Full article: Large-Scale LoD2 Building Modeling using Deep Multimodal Feature Fusion

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In today’s rapidly urbanizing world, accurate 3D city models are crucial for sustainable urban management. The existing technology for 3D city modeling still relies on an extensive amount of manual work and the provided solutions may vary depending on the urban structure of different places. According to the CityGML3 standard of 3D city modeling, in LoD2, the roof structures need to be modeled which is a challenging task due to the complexity and diversity of roof types. While high-resolution images can be utilized to classify roof types, they have difficulties in areas with poor contrast or shadows. This study proposes a deep learning approach that combines RGB optical and height information of buildings to improve the accuracy of roof type classification and automatically generate a 3D city model. The proposed methodology is divided into two phases: (1) classifying roof types into the nine most popular roof types in New Brunswick, Canada, using a multimodal feature fusion network, and (2) generating a large-scale LoD2 3D city model using a model-driven approach. The evaluation results show an overall accuracy of 97.58% and a Kappa coefficient of 0.9705 for the classification phase and an RMSE of 1.03 (m) for the 3D modeling.

Résumé

Dans le monde d’aujourd’hui qui s’urbanise rapidement, des modèles de ville 3D précis sont cruciaux pour une gestion urbaine durable. La technologie existante pour la modélisation 3D de la ville repose encore sur une grande quantité de travail manuel et les solutions fournies peuvent varier en fonction de la structure urbaine des différents lieux. Selon la norme CityGML3 de modélisation 3D de la ville, dans LoD2, les structures de toit doivent être Split it as modélisées, ce qui est une tâche difficile en raison de la complexité et de la diversité des types de toitures. Bien que les images haute résolution puissent être utilisées pour classer les types de toitures, elles ont des difficultés dans les zones à faible contraste ou ombragées. Cette étude propose une approche d’apprentissage en profondeur qui combine les informations RVB optiques et de hauteur des bâtiments pour améliorer la précision de la classification des types de toitures et générer automatiquement un modèle de ville 3D. La méthodologie proposée est divisée en deux phases: (1) classer les types de toitures pour les neuf types les plus populaires au Nouveau-Brunswick, Canada, à l’aide d’un réseau de fusion d’entités multimodales, et (2) générer un modèle de ville 3D LoD2 à grande échelle à l’aide d’une approche basée sur la modélisation. Les résultats de l’évaluation montrent une précision globale de 97,58% et un coefficient Kappa de 0,9705 pour la phase de classification et un RMSE de 1,03 (m) pour la modélisation 3D.

Introduction

The United Nation’s population division reports that 55% of the world’s population lives in urban areas (United Nations Citation2018; Department of Economic and Social Affairs of the United Nations Citation2018). Accurate 3D city models are crucial to sustainably managing and planning the growing urban population. 3D city models provide a comprehensive understanding of a city’s layout, infrastructure, and resources, empowering engineers, urban planners, and decision-makers to have a better understanding of the current state of the city and make decisions that align with the city’s long-term objectives. 3D city models can be used in different areas, such as smart city applications, disaster management, urban planning, tourism, navigation, and facilities management (Doulamis and Preka Citation2016; Peters et al. Citation2022).

According to the Open Geospatial Consortium (OGC) standard of 3D city modeling, CityGML 3.0, 3D city models can be produced in different Levels of Detail (LoD) to address different applications. This standard allows for detailed modeling of buildings at four LoDs, LoD0 to LoD3. These levels progressively increase the level of complexity and accuracy of the building model. In LoD0, buildings have a 2D representation of the building’s footprint, and LoD1 is a block-shaped model of buildings. In LoD2, buildings have the structure of the roof and finally, detailed architectural elements such as windows, doors, and full exterior have been included in LoD3.

Hence, in LoD2+ models, buildings have adequately modeled roof structures and thematically differentiated surfaces. The majority of the studies in the literature can be divided into data-driven or model-driven approaches. Data-driven approaches extract geometric components from a building and use them to model the building. On the other hand, model-driven approaches select the model that best fits the building data from a predefined library of models (Krafczek and Jabari Citation2022). Although data-driven approaches can be more flexible in modeling the type of roofs, if the shape of the majority of roofs in an area follows basic formats like gable and pyramid, model-driven methods can be a faster and simpler implementation (Partovi et al. Citation2014).

Thus, in model-driven LoD2 city generation methods, the type of building roofs plays a crucial role in fitting a 3D model to each building (Buyukdemircioglu et al. Citation2021). As a result, accurate estimation of building roof types is a key step toward 3D city modeling in model-driven approaches. In general, the accuracy of the roof type classification process directly affects the final precision of the LoD2 3D building model.

Due to the high amount of information in urban areas and the diversity of building roof types, the classification of building roof types is an active research area in photogrammetry, remote sensing, and computer vision.

Although classification algorithms have been developed, creating a 3D city model based on fully automatic roof-type classification is still a challenging task. With the rapid development of big data and high-performance computers, deep learning, especially Convolutional Neural Networks (CNNs), can help with classification tasks, e.g., image classification, segmentation, and scene understanding (LeCun et al. Citation2010).

High-resolution RGB optical airborne/satellite images provide rich information content that can be used for roof type classification. However, semantic classification and extracting 3D information using 2D optical images suffer from difficulty in distinguishing objects in areas with poor contrast and shadows. Digital elevation models can overcome this limitation since each type of roof has its own height pattern. Flat roofs, for example, have a constant height across their surface, while gable roofs have a decreasing height from the peak to the bottom. 3D point cloud data containing elevation and intensity measurement information can be used as an independent source of information for roof-type classification.

To improve the accuracy of recognizing roof types and generating a large-scale LoD2 3D city model, we propose a multimodal network that combines deep RGB optical and height features of each building. To the best of our knowledge, this is the first study in the literature that fuses these features for roof type classification based on the model proposed in . Our proposed method aims to increase the accuracy of LoD2 3D city modeling by improving the overall accuracy of roof type classification.

The presented methodology consists of two phases. The first phase focuses on building roof type classification by fusing the RGB optical features extracted from high-resolution orthophotos with height information of buildings extracted from LiDAR data. The concatenated features are fed to a fully connected classifier to recognize the roof types. To the best of our knowledge, existing CNN-based roof classification training datasets, such as (Alidoost and Arefi Citation2018; Buyukdemircioglu et al. Citation2021; Wang et al. Citation2022) have up to seven roof types including flat, hip, half-hip, pyramid, gable, and complex. The study area for this work includes Fredericton and Moncton, which are major cities in New Brunswick, located in the Atlantic region of Canada. In New Brunswick, like many other urban/suburban areas in Canada, the majority of rooftops can be classified into nine groups, namely flat, gable, hip, cross-hip, gable-flat, pyramid, cross-gable, gambrel, and dutch. As the existing dataset could not classify roof types specific to this area, it was necessary to develop a roof type dataset that included common roof types in Eastern Canada. A small number of buildings do not fit into the aforementioned groups, which we classify as complex roofs. We employed a tree decision method to recognize complex building roof types from all the buildings. Then, the rest of the nine building roof types have been classified using the proposed deep feature fusion DL network. In this paper, we are able to identify a total of 10 types of building roofs.

The second phase of this work focuses on large-scale 3D City Modeling using a mode-driven approach by fitting a 3D model to the point clouds. This part can be divided into two steps includes: (1) Extracting the eave and ridge heights of each building using the Digital Elevation Model and Digital surface Model DSM. (2) Assigning a 3D model to each building using a preexisting roof type library.

In summary, this paper contributes to the literature in three respects:

Designing a multimodal feature fusion solution to classify roof types. This solution utilizes both RGB optical and LiDAR data features, which are fused through a double deep learning network.
Optimizing the fit of a 3D model to the building points in the point cloud by using a model-driven approach.
Creating a roof-type classification dataset using high-resolution orthophotos (72 mm) and LiDAR data (6 points/m²). The datasets will be published to be accessed by others and can be employed in other CNN-based roof-type classification applications.

Related works

There is extensive literature on building roof type detection and LoD2 3D city model reconstruction. This section briefly addresses some of the related work in these two areas. These works are dedicated to building recognition and building roof type classification using machine learning and deep learning approaches.

Qian et al. (Citation2022) proposed a deep learning network called Deep Roof Refiner for refining the delineation of roof structure lines using satellite imagery, then could use to model the roofs. Alidoost and Arefi (Citation2018) developed a model-based approach for automatic building detection and roof type classification using a single aerial image. They classified three different roof types, including flat, gable, and hip shapes with an accuracy of 92%. Partovi et al. (Citation2017) also classified roof types into seven classes using WorldView-2 pan-sharpened multispectral satellite images data and the VGG-Net model. In another study, Bittner et al. (Citation2019) proposed a multi-task conditional generative adversarial network for DSM refinement and roof-type classification. Their network aims to create DSM, which is then used for dense pixel-wise rooftop classification, assigning object class labels to each pixel in the DSMs with 80.03% precision. Buyukdemircioglu et al. (Citation2021) classified six roof types using a shallow CNN model. They fine-tuned their model with three well-known pre-trained networks, i.e., VGG-16, EfficientNetB4, and ResNet-50. The result showed that after fine-tuning the network, the accuracy of the model had increased by 3% to 6%. Using the machine learning approaches, Assouline et al. (Citation2017) classified roof types as well as aspect (azimuth) classes and slope (tilt) classes for large-scale solar photovoltaic (PV) deployment and obtained an accuracy of 67%.

Conventional methods of 3D city model generation can be divided into three categories: Data-driven, model-driven and hybrid techniques.

Data-driven techniques, which are also called down-top approaches, are used to detect the roof planes and extrude roof shapes based on geometric components such as lines, edges, and points (Park and Guldmann Citation2019). There are various methods for segmenting the LiDAR point clouds and determining roof planes, including edge-based methods (Jiang and Bunke Citation1994), region-growing methods (Alharthy and Bethel Citation2004), random sample consensus (RANSAC) methods (Hartley and Zisserman Citation2003), and clustering methods (Shan and Toth Citation2018), as well as the combination of two or more algorithms (Dorninger and Pfeifer Citation2008). Huang et al. (Citation2011) introduced generative modeling of building roofs with an assembly of primitives allowing overlapping using the Reversible Jump Markov Chain Monte Carlo algorithm.
Huang et al. (Citation2022) and Li et al. (Citation2022) presented a methodology for reconstructing 3D models of buildings from airborne LiDAR point clouds using a data-driven approach. In both works, they segmented point clouds into planar patches. Then, a 3D optimization technique was applied to create a topologically consistent 3D building model from its compositional primitives.
Model-driven approaches which known as top-down approaches. Lafarge et al. (Citation2010) and Huang et al. (Citation2013) proposed a method to reconstruct buildings from a digital surface model. This process involved breaking down the building footprints into components either manually or automatically and then utilizing a Gibbs model to fit the 3D block models onto the building footprints. A Bayesian decision was taken to find the most appropriate roof primitives from the pre-defined library that would represent the point clouds by utilizing a Markov Chain Monte Carlo sampler and original proposition kernels.
Hybrid methods were developed as a result of the inherent weakness of model-driven approaches in modeling complicated buildings and the complexity of data-driven methods. Model-driven and hybrid approaches are reviewed here, as they are related to the adopted workflow. Pepe et al. (Citation2021) and Tripodi et al. (Citation2020) used stereo satellite imagery to build the digital surface model and extract the height of each object using the DSM. The latter used deep learning to extract the contour polygons of the buildings and then the digital train model. Zhao et al. (Citation2021) proposed the reconstruction framework to reconstruct a 3D model containing a complete shape and accurate scale from a single image. The proposed method involves using two convolutional neural networks to create watertight mesh models and optimizing them using another CNN network. Krafczek and Jabari (Citation2022) proposed a decision-tree-based methodology for generating LOD2 3D city model. They decomposed the building footprints into building primitives to have a better estimation of height for each building’s parts.

3D city models can be directly generated from 3D point clouds. These methods are based on using terrestrial laser scanners (Akmalia et al. Citation2014) to generate dense point clouds from Terrestrial Laser Scanners (TLS) and then perform segmentation to detect building façade and features. Li et al. (Citation2022) utilized deep learning to model building roofs from raw LiDAR data automatically. They extracted PointNet++ deep features from the input building roof point clouds to detect the roof corners. The corners are clustered to make a set of accurate vertices. The vertices are fed to a graph algorithm to find the valid edges between vertices, providing the results to make the final roof model. Similarly, Also, Dehbi et al. (Citation2021) presented a new method for reconstructing 3D buildings from LiDAR data. They used an active sampling strategy that combines a series of filters to focus on promising samples. The filters are based on prior knowledge represented by density distributions. The method uses surflets-3D points with normal vectors to provide parameters for model candidates, such as azimuth, inclination, and ridge height. Building footprints are derived in a preprocessing step using machine learning methods. Kada (Citation2022) also used LiDAR point clouds to reconstruct a simple 3D model. He extracted geometrical features of buildings which are needed for 3D modeling using a DL network.

Data preparation

In this research, we developed a roof-type dataset consisting of 2483 number of buildings from nine common roof types. To create this dataset, we used four input layers, and the details of the input data are provided in .

Table 1. Specification of the data used in this study.

Download CSV Display Table

LiDAR point clouds were used to create a DSM with a spatial resolution of 72 (mm) after resizing. Next, high-resolution orthophotos were mosaiced and used to manually digitize, modify, and label building footprints. This process resulted in extracting nine groups of roof types from high-resolution orthoimages and DSM images. Each dataset includes building images for each class, along with its corresponding label.

The training datasets was created using the DSM and orthophoto of Fredericton in the first scenario. While in the second scenario, the training dataset was created using the DSM and orthophoto of Moncton. To evaluate the performance of the model, separate testing and validation datasets were generated by utilizing the DSM and orthophoto of Moncton in the first and Fredericton in the second scenarios. For each scenario, the whole city building data was used for training and 25% of the test dataset was dedicated to validating the model performance, while the remaining 75% of the data was used to test the model.

The test and train datasets were then cropped and resized. shows the preprocessing and preparation diagram.

Figure 1. Preprocessing and preparing the training and testing datasets.

To reduce the over-fitting problem in deep learning, the training data was augmented. The augmentation process involved horizontal and vertical flipping of the training images, as well as rotations by 45, 60, and 90 degrees clockwise. These augmentations ensured an equal number of images for each class. As a result, there were 1000 samples for each class in the training datasets. A sample of the RGB optical image and height layer for each building is shown in .

Table 2. RGB optical and DSM sample images of the roof types.

Display Table

Methodology

As shown in , this work is divided into two phases. Phase 1 focuses on the classification of roof types. In this phase, we first determine complex roof types which have irregular geometries that cannot be used in model-driven methods. We intend to detect these roof types and mark them for manual 3D reconstruction. In the next step, we classify the roof types of the rest of the buildings. Phase 2 of this work involves the reconstruction of a large-scale LoD2 3D city model. The following subsections provide details about each phase.

Figure 2. Overall workflow of the multimodal feature fusion network and 3D city model reconstruction.

Phase 1: Roof type classification based on multimodal feature fusion network

Recognizing complex roof types

This type of building includes multiplane roof structure with multiple peaks and edges. We utilized a decision tree method, depicted in , to detect complex roof types. The first step in this process is to find out the number of roof edges and planes for each building. To calculate the number of edges, we simply counted the number of vertices for each building footprint.

Figure 3. Complex roof type distinguishing using a decision-tree based method.

To determine the planes of each roof, the RANdom SAmple Consensus (RANSAC) method was used (Derpanis Citation2010). RANSAC is an algorithm used to identify and extract multiple planes from a LiDAR point cloud dataset. The algorithm works by iteratively selecting a random subset of points from the point cloud and using them to estimate a plane that fits the data. The algorithm then uses a distance metric to evaluate how well the estimated plane fits the remaining points in the point cloud. Points within a certain distance threshold from the plane are considered inliers and assigned to that plane. Then, in a repetitive process, new subsets of points have been selected and refitted to a plane, until a satisfactory number of planes are found that explain the majority of the data. Once the RANSAC algorithm has identified multiple planes, it can be used to classify and assign other points in the point cloud to those planes. (Derpanis Citation2010; Zeineldin and El-Fishawy Citation2017).

Pretraining baseline networks

Residual neural network (ResNet) is a CNN architecture used for image classification with the purpose of solving the vanishing gradient problem that causes the gradient to become infinitely small during backpropagation due to sequential multiplication. ResNet solves this issue by incorporating skip connections, allowing gradients to flow freely and lay on deep layers before becoming attenuated to small values (Sarwinda et al. Citation2021; Tan et al. Citation2018). In this study, we used ResNet as CNN’s body of our baseline networks and the proposed method.

Traditionally, CNNs are trained with a random initial set of weight parameters, but this approach requires a large number of training data and a significant amount of memory. In this work, a pre-trained ResNet on IamgeNet dataset is employed to avoid the overfitting problem (Deng et al. Citation2009). We utilized this pre-trained model as the baseline or starting point for the classification task and then fine-tune its parameters specifically for our target dataset using transfer learning principles (Bengio Citation2012; Donahue et al. Citation2013).

To customize the pre-trained ResNet for building roof type classification, we replaced the last fully connected layer of these pre-trained models with a fully connected layer that is relevant to the number of roof types classification problems. Additionally, we utilized transfer learning techniques to fine-tune our model further and optimize its performance. The rest of the models are used as a fixed feature extractor to extract features using our two sets of datasets. Next, SoftMax classifiers of these networks are trained on these new datasets using discriminative learning rates.

Deep feature extraction and fusion

After creating baseline networks that pertained to the ImagNet dataset, we proceeded to develop the third network, which extracts meaningful RGB optical and height features of each building. While the CNN’s body of this network is ResNet, it is pre-trained based on RGB optical and height datasets from the previous step (see ). After feeding the weights parameters from the previous step to the multimodal feature fusion network, the last layer of this network is replaced with a newly developed classifier head with the roof type classes. The multimodal feature fusion network extracts the RGB optical and height features of each building and then concatenates them. This process results in a feature map, which is subsequently fed to the flattened layer. represents a schematic diagram of the multimodal feature fusion network.

Figure 4. Multimodal feature fusion framework. After generating two ImageNet-based pretrained ResNet networks and fine-tuning them based on RGB optical and height datasets in step 1 and step 2, the weight parameters of these two networks are fed to the proposed multimodal feature fusion network in step 3.

Phase 2: Model-driven 3D city model reconstruction

In Level of Detail 2 (LoD2), 3D building reconstruction using a model-driven approach requires roof attributes such as roof type, ridge height, and eave height. Even though roof types are recognized in phase 1, ridge height and eave height are needed to be retrieved from the point cloud. To obtain the ridge height, the maximum statistics zone tool on is applied on the nDSM (Normalized Digital Surface Model), which calculates the maximum height for each building. To generate the nDSM, the LiDAR point cloud data is first used to extract both the Digital Surface Model (DSM) and the Digital Terrain Model (DTM), which represents the ground surface. Next, the DSM is subtracted from the DTM, generating the nDSM, which represents the height of buildings above the ground surface.

The eave height of a building is defined as the minimum height of its largest slope roof plane. To determine this, the following steps are taken, as shown in : (1) Calculate the slope of each building and assign zero if it falls outside the minimum and maximum slope range. (2) Create the minimum height threshold DSM. The threshold can be defined as follows: if nDSM > = Min Roof Height, nDSM is assigned 1; else 0, (3) Calculate the aspect of this layer. (4) Reclassify the aspect map into the number of classes. (5) Convert the classes into polygons. (6) Calculate the area of each polygon and use the one with the largest area (larger than minimum slope roof area) as the roof plan. As a result of determining the roof plan, the largest sloping roof plane was obtained. The minimum height of this plane is considered as the eave height. Then, the height of roofs can be calculated by subtracting the ridge height and eave height of each building.

Figure 5. Flowchart of the eave height calculation process.

Key attributes, including ridge height, eave height, and roof type, have been acquired to fit 3D models to point clouds and reconstruct 3D buildings. Using a computer-generated architecture (CGA) code, these attributes are utilized within the ESRI City Engine for 3D creation purposes. Buildings are extruded to their eave height, then the roof shape and roof height of each building are applied.

Experiments

Phase I: Roof type classification

Based on visual inspections, it can be observed that the majority of the buildings in Atlantic Canada comprise ten different types, including complex, as specified earlier. Using the decision tree-based approach, building footprints that have more than six planes and 13 edges have been classified into complex. These thresholds have been obtained by trial and error according to the structure of buildings in Eastern Canada ().

Table 3. Number of average edges and planes for each building roof types.

Download CSV Display Table

After recognizing complex roof types, we created two baseline deep networks to recognize non-complex building roof types (see ). The first network was based on high-resolution orthoimages data, and the second was on digital surface model data. To extract meaningful RGB optical and height features of each building, we tested three pre-trained versions of ResNet-18, ResNet-50, and ResNet-101 as the backbone of our baseline networks. We were able to effectively transfer the knowledge gained from training on the ImageNet dataset to our baseline models, resulting in better accuracy for our specific classification tasks. Each network was trained for 200 epochs.

In the next step, we saved the weight parameters of the two baselines DL models and later used them as initial weights for our proposed deep multimodal feature fusion network. This approach allowed us to leverage the preexisting knowledge of the baseline networks trained on the roof type dataset, which is more relevant to our specific task of classifying roof types and helped to ensure that the initial parameters of our fusion network were related to the target domain.

Next, we extracted two sets of descriptors from these two DL baseline networks. These descriptors are concatenated together using the proposed multimodal feature fusion network, and the fused descriptors are fed into the SoftMax classifier head. This deep feature fusion approach enabled the generation of more robust features (Dai et al. Citation2021).

We fine-tuned the proposed network to adapt the initial weight parameters from the previous step to the proposed classification network. For fine-tuning, all layers except the last fully connected layer were frozen, and the network was trained for 100 epochs. Thus, the linear classifier was trained from scratch. Then, all layers were unfrozen, and the network was trained for 100 epochs, and the RGB optical and height features of each building were extracted.

To find an optimal learning rate, we trained the network with a range of learning rates, including 0.1, 0.01, 0.001, 0.0001, 0.00001, and 0.000001, each for just four epochs. The learning rate that results in the minimum loss function is selected as the starting point, and then we fine-tune the learning rate by exploring three decimal places below and above the chosen value. The one that has the minimum lost function is chosen as an optimal learning rate. The learning rates of these networks are shown in .

Table 4. Learning rate and epoch values.

Download CSV Display Table

We employed two scenarios to ensure the reliability of the baseline networks and the proposed method. In the first scenario, we trained the method on buildings in Fredericton, New Brunswick, Canada, and tested it in Moncton, the largest city in the province. We conducted a second scenario to further validate the proposed method’s accuracy. In this scenario, we trained the networks on buildings in Moncton and tested them on buildings in Fredericton. demonstrates the number of training, validation, and test data samples for each scenario (individual building roof types).

Table 5. Training, testing and validation datasets.

Download CSV Display Table

To quantify the performance of the proposed network for roof type classification, we used two metrics, namely Overall Accuracy (OA) and the Kappa coefficient. The Overall Accuracy represents the proportion of correctly classified test samples to all the test samples, and the Kappa coefficient determines agreement between two raters. The formulas for the overall accuracy and kappa coefficient are presented in EquationEquations (1)(1) $OA = (\frac{Ʃ correctly classified roof types}{number of buildings}) * 100$ (1) and Equation(2)(2) $Kappa coefficient = 1 - ((1 ― P_{o}) / (1 ― P_{e}))$ (2) . (1) $OA = (\frac{Ʃ correctly classified roof types}{number of buildings}) * 100$ (1) (2) $Kappa coefficient = 1 - ((1 ― P_{o}) / (1 ― P_{e}))$ (2) where $P_{o}$ is the relative observed agreement among raters and $P_{e}$ is the hypothetical probability of chance agreement using the observed data to calculate the probabilities of each observer randomly seeing each category (Cohen Citation1960).

Phase 2: Model-driven 3D city model reconstruction

After classifying the roof types and extracting building attributes such as eave height and ridge height, we used this information in the City Engine program using the ESRI CGA rule for 3D representation purposes. Next, we assigned the extracted roof types to the corresponding buildings in the model, utilizing the ESRI buildings’ library. Since the library does not contain any models for complex and cross-hip roofs, we postponed modeling those to another study. Therefore, an LoD2 3D city model of flat, gable, hip, cross-gable, pyramid, gambrel, and Dutch roof buildings was created for city of Moncton and Fredericton, using a model-driven approach.

It is essential to assess whether the proposed LOD2 3D city modeling method performs properly over eave height, ridge height, and roof type detection tasks, Thus, we needed to assess the final 3D model. The accuracy of the final 3D model depends on the accuracy of the roof type classification and building decomposition steps. While CityGML-3 does not prescribe any fixed values, according to the CityGML-2 standard, the geometric error of 3D models should not exceed two meters.

To evaluate the accuracy of the final 3D model, we used the digital surface model as the ground truth and calculated the root mean square error (RMSE) (Chai and Draxler Citation2014) between each 3D building model and DSM. The formula for the RMSE is presented in EquationEquation (3)(3) $RMSE = \sum_{i = 1}^{N} \sqrt{({x_{i} - {\overset{´}{x}}_{i})}^{2}}$ (3) . (3) $RMSE = \sum_{i = 1}^{N} \sqrt{({x_{i} - {\overset{´}{x}}_{i})}^{2}}$ (3)

where i is variable, N is the number of buildings, $x_{i}$ is DSM value of each building and ${\overset{´}{x}}_{i}$ is the 3D model of the building.

Result and discussion

We report the results of each phase and discuss them in the following sections.

Roof type classification

With limited roof type training samples, up to seven to the best of our knowledge, DL methods cannot classify satisfactory building roof types in our study area. So, we created the roof type dataset based on nine common roof types existing in New Brunswick. This dataset can be used for further CNN-based classification purposes and is published through the University of New Brunswick library (https://www.unb.ca/).

The quality indices for RGB optical-based and DSM (height)-based baseline deep learning networks in each scenario are presented in . The number in front of ResNets shows the number of its hidden layers.

Table 6. Overall accuracy and Kappa coefficient of the proposed method.

Download CSV Display Table

In the first scenario, the OA of the proposed network with ResNet-18, ResNet-50 and ResNet-101 is 91.96%, 97.03%, and 97.58%, respectively, while the Kappa coefficient for the same network is 0.9030, 0.9639, and 0.9705 for ResNet-18, ResNet-50, and ResNet-101, respectively. Plus, the overall accuracy of the first DL network on the RGB optical dataset with ResNet-18, ResNet-50, and ResNet-101 is 89.54%, 91.74%, and 92.62%, respectively, while the Kappa coefficient for the same network is 0.8727, 0.8996, and 0.9102, respectively. The results also show the OA of the proposed method on the DSM (height)-based dataset with CNN’s body of ResNet 18, 50 and 101 is 90%, 94.96%, and 95.26%, respectively, while the Kappa coefficient for the same network is 0.8430, 0.9533, and 0.9426, respectively.

The proposed method employing ResNet-101 as the CNN body demonstrated a significant increase of approximately 5% in OA and 0.06 in Kappa coefficient compared to using only RGB optical images. When compared to the height (DSM)-based network, the proposed method shows improvements of 2.3% in OA and 0.03 in Kappa metrics. Further, using ResNet-50 and ResNet-18 bodies enhanced the OA by approximately 6% and 2%, respectively, compared to RGB optical networks, along with a Kappa enhancement of approximately 0.06 and 0.03. Additionally, the proposed method using these two bodies improved the accuracy of the height-based network, resulting in an increase in OA for ResNet-50 and ResNet-18 of approximately 3% and 2%, respectively, and an improvement in Kappa metrics of approximately 0.01 and 0.06 for ResNet-50 and ResNet-18 bodies, respectively.

In the second scenario, the proposed network using ResNet-18 achieved an OA of 91.70% and a Kappa coefficient of 0.8980, while ResNet-50 achieved an OA of 95.02% and a Kappa coefficient of 0.9387. They were outperformed by ResNet-101, which had an OA of 96.30% and a Kappa coefficient of 0.9544. On the RGB optical dataset, the overall accuracy of the first DL network on the RGB optical dataset with ResNet-18, ResNet-50, and ResNet-101 was 82.35%, 84.04%, and 87.10%, respectively, while the corresponding Kappa coefficients were 0.7986, 0.8039, and 0.8416. Moreover, Using the DSM (height)-based dataset, the proposed method has achieved OAs of 87.66% and Kappa coefficients of 0.8532, ResNet-50, with OAs of 88.59% and Kappa coefficients of 0.8616, and ResNet-101 with OAs of 94.89% and Kappa coefficients of 0.9371.

We got a similar result in the second scenario. The result proves that the OA of the proposed method utilizing ResNet-18, ResNet-50, and ResNet-101 has increased around 9%, 11%, and 9%, respectively, compared to the RBG optical network and 4%, 7%, and 2% compared to the height (DSM)-based network. In addition, the Kappa metrics of the proposed method also improved around 0.10, 0.13, and 0.11 in these three ResNet 18, 50, and 101 bodies compared to the RGB optical-based network and 0.04, 0.07, and 0.02 compared to the DSM (height)-based network.

Based on the interpretation of these numbers, presented in , the results can be categorized into three main aspects:

The multimodal feature fusion strategy has the highest quality indices in distinguishing roof types compared to DSM or RGB optical-based features. Hence, combining the height information with the RGB optical information of each building can compensate for the limitations of RGC images, such as poor contrast and shadow areas and improve the accuracy of the roof type classification. Some snapshots of shadow areas are shown in . Our method using both RGB optical and DSM datasets shows higher accuracy compared to studies using only airborne images like Alidoost and Arefi (Citation2018), who achieved an accuracy of 92% for classifying three different roof types, while Buyukdemircioglu et al. (Citation2021) got an accuracy of 86% for six different roof types. In another study, Wang et al. (Citation2022) reported an accuracy of 90% for five different roof types.
The increase in ResNet layers from 18 to 101 leads to an improvement of around 6% in OA and 0.07 in Kappa metrics of the proposed network in the first scenario and 5% in OA and 0.06 in Kappa metrics of our method in the second scenario. By increasing the number of ResNet hidden layers, the OA and Kappa metrics of each baseline network also are enhanced. The results also show that increasing layers of ResNet from 18 to 101 in the first scenario leads to an increase of approximately 3% in OA and 0.04 in Kappa metrics for the RGB optical-based dataset baseline network and an increase of around 5% in OA and 0.10 in Kappa metrics for the DSM (height)-based dataset baseline network. As a result of the second scenario, we have improved OA by 5% and Kappa by 0.05 for the RGB optical-based dataset baseline network, as well as OA by 6% and Kappa by 0.08 for the DSM height-based dataset baseline network.
While adding hidden layers to ResNet increases the complexity and capacity of the network, it allows it to capture more complicated patterns and features in the data, leading to more accurate predictions.
The analysis result also demonstrates that the classification of roof type based on LiDAR data (height-based) performs better than RGB optical images. However, the majority of the studies in the literature use RGB optical images for roof-type classification purposes.

demonstrates the confusion matrix of the proposed network. As shown in this figure, many of the misclassified roof types using ResNet-18 are related to the cross-hip and gable, which are misclassified as hips and flat, respectively. The structure of these roof types reveals some similarities between them, as shown in . Hip and cross-hip roofs, for instance, have similar edges, and even flat roof types have parts with different heights, similar to gable roofs. ResNet-18 CNN’s body network could not accurately classify them, whereas both ResNet-50 and ResNet-101 achieved significantly higher accuracy in classifying these two classes. This result proves that deeper networks can have higher resolving powers for similar roof types.

Figure 6. Snapshots of the poor contrast or shadow areas in RGB optical image.

As shown in , DSM (height)-based network performs better than RGB optical-based network in most classes but struggles with classifying certain roof types such as hip or dutch-gable. On the other hand, the RGB optical-based network shows better performance in these classes. Therefore, combining each building’s RGB optical and DSM (height) features leads to improved classification accuracy.

LoD2 3D reconstruction

The final 3D city model is generated based on roof type classes extracted using the multimodal proposed method. Snapshots of the LoD2 3D model of the city of Moncton with these different roof types are displayed in .

Figure 7. Confusion metrices of the proposed method.

The RMSE of the final 3D model and each roof type has been represented in . As shown in the table, the RMSE of the large-scale 3D city model is 1.03 meters, and the RMSE of individual roof types such as flat, gable, hip, cross-gable, pyramid, gambrel, and dutch range from 1.02 to 1.86 meters. According to the CityGML 3.0 standard, the accuracy of the 3D city model should be better than 2 meters. The results demonstrate high accuracy for the presented model on 1208 buildings, though its performance varies on different roof types (). Furthermore, for a better understanding of how the model correlates with the LiDAR point cloud, the overlay of the LiDAR point cloud on the 3D city model is shown in . We also brought the Google Earth 3D representation of this building.

Figure 8. Most common misclassified roof types. There are some cross-hip roofs that have been misclassified as hip roof tops and this figure shows the similarities between them and the hip roof top. (a) Samples of misclassified cross-hip roofs. (b) A sample of hip roof.

Figure 9. Confusion metrices of two baseline networks with ResNet-101 CNN’s body.

Figure 10. Snapshots of the 3D city model of Moncton.

Figure 11. Comparison of LiDAR point clouds to the 3D model. (a) Generated 3D city model. (b) LiDAR point clouds. (c) Overlay of LiDAR point cloud on building with flat roof type. (c) Overlay of the LiDAR point cloud on the generated 3 C city model. (d) Google earth 3D visualization.

Table 7. RMSE result of the 3D city model of Moncton.

Download CSV Display Table

While the results demonstrate high accuracy in generating a 3D city model, there is still a limitation in the proposed method regarding complex roof types. The challenge in the large-scale 3D city modeling in this study is that buildings with complex structures could not model using a model-driven approach. In future studies, we will develop a hybrid method to construct 3D buildings with complex roof types.

Conclusion

In this study, we proposed a multimodal feature fusion deep learning-based network for classifying roof types into nine standard roof classes and constructing a large-scale LoD2 3D city model.

The methodology inputs high-resolution orthophotos, LiDAR points cloud data, and building footprints layer and follows two different phases to create the large-scale 3D model. In the first phase, a decision tree-based method recognizes complex roof types. Then, the roof types of each building are classified by fusing the RGB optical and DSM (height) features through a deep multimodal feature fusion network. The initial parameters of this network are from DSM (height)-based dataset and RGB optical-based dataset baseline networks. In the second phase, the 3D model is fitted to each footprint using the building’s roof information, such as roof type, eave, and ridge heights. As shown in this paper, our roof type classification network confirmed that utilizing the RGB optical and height features of each building improves roof type classification accuracy, thereby enhancing the overall accuracy of 3D building reconstruction.

Acknowledgment

This project is partially funded by NBIF-POSS (Priority Occupation Student Support) and NSERC-Discovery grant. The authors would like to thank the funding agencies for their support. Authors would like to thank GeoNB, a government subsidiary for geospatial information, and Municipalities for providing airborne images, LiDAR point cloud data and ESRI. The deep learning framework was developed using PyTorch and implemented over Google Collaboratory. The authors also thank these organizations for providing such a valuable platform.

References

Akmalia, R., Setan, H., Majid, Z., Suwardhi, D., and Chong, A. 2014. “TLS for generating multi-LOD of 3D building model.” IOP Conference Series: Earth and Environmental Science, Vol. 18(No. 1): pp. 012064. doi:10.1088/1755-1315/18/1/012064.
Google Scholar
Alharthy, A., and Bethel, J. 2004. “Detailed building reconstruction from airborne laser data using a moving surface method.” In 20th Congress of International Society for Photogrammetry and Remote Sensing, 213–218.
Google Scholar
Alidoost, F., and Arefi, H. 2018. “A CNN-based approach for automatic building detection and recognition of roof types using a single aerial image.” PFG – Journal of Photogrammetry, Remote Sensing and Geoinformation Science, Vol. 86(No. 5-6): pp. 235–248. doi:10.1007/s41064-018-0060-5.
Web of Science ®Google Scholar
Assouline, D., Mohajeri, N., and Scartezzini, J.-L. 2017. Building rooftop classification using random forests for large-scale PV deployment. doi:10.1117/12.2277692.
Google Scholar
Bengio, Y. 2012. Deep Learning of Representations for Unsupervised and Transfer Learning (Vol. 27). http://www.causality.inf.ethz.ch/unsupervised-learning.php.
Google Scholar
Bittner, K., Körner, M., Fraundorfer, F., and Reinartz, P. 2019. “Multi-task cGAN for simultaneous spaceborne DSM refinement and roof-type classification.” Remote Sensing, Vol. 11(No. 11): pp. 1262. doi:10.3390/rs11111262.
Web of Science ®Google Scholar
Buyukdemircioglu, M., Can, R., and Kocaman, S. 2021. “Deep learning based roof type classification using very high resolution aerial imagery.” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XLIII-B3-2021(No. B3-2021): pp. 55–60. doi:10.5194/isprs-archives-XLIII-B3-2021-55-2021.
Google Scholar
Chai, T., and Draxler, R.R. 2014. “Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature.” Geoscientific Model Development, Vol. 7(No. 3): pp. 1247–1250. doi:10.5194/gmd-7-1247-2014.
Web of Science ®Google Scholar
Cohen, J. 1960. “A coefficient of agreement for nominal scales.” Educational and Psychological Measurement, Vol. 20(No. 1): pp. 37–46. doi:10.1177/001316446002000104.
Web of Science ®Google Scholar
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., and Barnard, K. 2021. Attentional Feature Fusion. https://github.com/YimianDai/open-aff.
Google Scholar
Dehbi, Y., Henn, A., Gröger, G., Stroh, V., and Plümer, L. 2021. “Robust and fast reconstruction of complex roofs with active sampling from 3D point clouds.” Transactions in GIS, Vol. 25(No. 1): pp. 112–133. doi:10.1111/tgis.12659.
Web of Science ®Google Scholar
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., and Fei-Fei, L. 2009. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. IEEE.
Google Scholar
Department of Economic and Social Affairs of the United Nations. 2018. “68% of the world population projected to live in urban areas by 2050.” Available from https://www.un.org/development/desa/en/news/population/2018-revision-of-world-urbanization-prospects.html
Google Scholar
Derpanis, K.G. 2010. Overview of the RANSAC Algorithm. Image Rochester NY, Vol. 4(No. 1): pp. 2–3.
Google Scholar
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. 2013. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. http://arxiv.org/abs/1310.1531.
Google Scholar
Dorninger, P., and Pfeifer, N. 2008. “A comprehensive automated 3D approach for building extraction, reconstruction, and regularization from airborne laser scanning point clouds.” Sensors (Basel, Switzerland), Vol. 8(No. 11): pp. 7323–7343. doi:10.3390/s8117323.
PubMed Web of Science ®Google Scholar
Doulamis, A., and Preka, D. 2016. 3D Building Modeling in LoD2 using the CityGML Standard. https://www.researchgate.net/publication/309384841.
Google Scholar
Hartley, R., and Zisserman, A. 2003. Multiple View Geometry in Computer Vision. Cambridge university press.
Google Scholar
Huang, H., Brenner, C., and Sester, M. 2011. “3D building roof reconstruction from point clouds via generative models.” In Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 16–24.
Google Scholar
Huang, H., Brenner, C., and Sester, M. 2013. “A generative statistical approach to automatic 3D building roof reconstruction from laser scanning data.” ISPRS Journal of Photogrammetry and Remote Sensing, Vol. 79: pp. 29–43. doi:10.1016/j.isprsjprs.2013.02.004.
Web of Science ®Google Scholar
Huang, J., Stoter, J., Peters, R., and Nan, L. 2022. City3D: Largescale building reconstruction from airborne LiDAR point clouds. Remote Sensing, Vol. 14(No. 9): pp. 2254.
Google Scholar
Jiang, X., and Bunke, H. 1994. Fast segmentation of range images into planar regions by scan line grouping. Machine Vision and Applications, Vol. 7: pp. 115–122.
Google Scholar
Kada, M. 2022. “3D reconstruction of simple buildings from point clouds using neural networks with continuous convolutions (CONVPOINT).” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XLVIII-4/W4-2022(No. 4/W4-2022): pp. 61–66. doi:10.5194/isprs-archives-XLVIII-4-W4-2022-61-2022.
Google Scholar
Krafczek, M., and Jabari, S. 2022. “Generating LOD2 city models using a hybrid-driven approach: A case study for New Brunswick urban environment.” Geomatica, Vol. 75(No. 1): pp. 130–147. doi:10.1139/geomat-2021-0016.
Google Scholar
Lafarge, F., Descombes, X., Zerubia, J., and Pierrot-Deseilligny, M. 2010. “Structural approach for building reconstruction from a single DSM.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32(No. 1): pp. 135–147. doi:10.1109/TPAMI.2008.281ï.
PubMed Web of Science ®Google Scholar
LeCun, Y., Kavukcuoglu, K., and Farabet, C. 2010. Convolutional networks and applications in vision. Proceedings of 2010 IEEE International Symposium on Circuits and Systems, 253–256. doi:10.1109/ISCAS.2010.5537907.
Google Scholar
Li, L., Song, N., Sun, F., Liu, X., Wang, R., Yao, J., and Cao, S. 2022. Point2Roof: End-to-end 3D building roof modeling from airborne LiDAR point clouds. ISPRS Journal of Photogrammetry and Remote Sensing, Vol. 193: pp. 17–28.
Google Scholar
Park, Y., and Guldmann, J.M. 2019. “Creating 3D city models with building footprints and LIDAR point cloud classification: A machine learning approach.” Computers, Environment and Urban Systems, Vol. 75: pp. 76–89. doi:10.1016/j.compenvurbsys.2019.01.004.
Web of Science ®Google Scholar
Partovi, T., Krauß, T., Arefi, H., Omidalizarandi, M., and Reinartz, P. 2014. Model-driven 3D building reconstruction based on integeration of DSM and spectral information of satellite images. In 2014 IEEE Geoscience and Remote Sensing Symposium, pp. 3168–3171. IEEE.
Google Scholar
Partovi, T., Fraundorfer, F., Azimi, S., Marmanis, D., and Reinartz, P. 2017. “Roof type selection based on patch-based classification using deep learning for high resolution satellite imagery.” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XLII-1/W1(No. 1W1): pp. 653–657. doi:10.5194/isprs-archives-XLII-1-W1-653-2017.
Google Scholar
Pepe, M., Costantino, D., Alfio, V.S., Vozza, G., and Cartellino, E. 2021. “A novel method based on deep learning, gis and geomatics software for building a 3d city model from vhr satellite stereo imagery.” ISPRS International Journal of Geo-Information, Vol. 10(No. 10): pp. 697. doi:10.3390/ijgi10100697.
Web of Science ®Google Scholar
Peters, R., Dukai, B., Vitalis, S., van Liempt, J., and Stoter, J. 2022. Automated 3D reconstruction of LoD2 and LoD1 models for all 10 million buildings of the Netherlands. Photogrammetric Engineering & Remote Sensing, Vol. 88(No. 3): pp. 165–170.
Google Scholar
Qian, Z., Chen, M., Zhong, T., Zhang, F., Zhu, R., Zhang, Z., … and Lü, G. 2022. Deep Roof Refiner: A detail-oriented deep learning network for refined delineation of roof structure lines using satellite imagery. International Journal of Applied Earth Observation and Geoinformation, Vol. 107: pp. 102680.
Google Scholar
Sarwinda, D., Paradisa, R.H., Bustamam, A., and Anggia, P. 2021. “Deep learning in image classification using residual network (ResNet) variants for detection of colorectal cancer.” Procedia Computer Science, Vol. 179: pp. 423–431. doi:10.1016/j.procs.2021.01.025.
Google Scholar
Shan, J., and Toth, C. K. (Eds.). 2018. Topographic laser ranging and scanning: principles and processing. CRC Press.
Google Scholar
Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C. 2018. A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4–7, 2018, Proceedings, Part III 27, 270–279. Springer International Publishing.
Google Scholar
Tripodi, S., Duan, L., Poujade, V., Trastour, F., Bauchet, J.P., Laurore, L., and Tarabalka, Y. 2020. “Operational pipeline for large-scale 3D reconstruction of buildings from satellite images.” In IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, 445–448. IEEE.
Google Scholar
United Nations. 2018. Retrieved from United Nations, Department of Economic and Social Affairs: https://www.un.org/development/desa/en/news/population/2018-revision-of-world-urbanization-prospects.html.
Google Scholar
Wang, Y., Li, S., Teng, F., Lin, Y., Wang, M., and Cai, H. 2022. “Improved mask R-CNN for rural building roof type recognition from UAV high-resolution images: A case study in hunan province, China.” Remote Sensing, Vol. 14(No. 2): pp. 265. doi:10.3390/rs14020265.
Web of Science ®Google Scholar
Zeineldin, R.A., and El-Fishawy, N.A. 2017. “A survey of RANSAC enhancements for plane detection in 3D point clouds.” Menoufia Journal of Electronic Engineering Research, Vol. 26(No. 2): pp. 519–537. doi:10.21608/mjeer.2017.63627.
Google Scholar
Zhao, C., Zhang, C., Yan, Y., and Su, N. 2021. “A 3d reconstruction framework of buildings using single off-nadir satellite image.” Remote Sensing, Vol. 13(No. 21): pp. 4434. doi:10.3390/rs13214434.
Web of Science ®Google Scholar

Large-Scale LoD2 Building Modeling using Deep Multimodal Feature Fusion

Modélisation de bâtiments LoD2 à grande échelle à l’aide de la fusion de caractéristiques multimodales profondes

Abstract

Résumé

Introduction

Related works

Data preparation

Table 1. Specification of the data used in this study.

Table 2. RGB optical and DSM sample images of the roof types.

Methodology