Deep learning models for crack segmentation in 3d images of concrete trained on semi-synthetic data

Concrete is the number one building material in civil engineering. Various types of concrete enable broad applicability. Concrete typically consists of cement, aggregates, air pores, and reinforcements. For example, steel fibers are used to increase the structural strength of concrete and to control crack propagation. The assessment of the concrete’s micro-structure is important to gain deeper insight into failure mechanisms. Computed tomography images the micro-structure spatially non-destructively. A prerequisite for quantitative analysis is to segment the components of interest in the reconstructed images. Previous experiments showed that the 3d Unet – a deep learning model originally designed for segmentation in biomedical image data – reliably segments cracks in concrete, too, and generalizes well to cracks in different kinds of concrete. Consistent and sufficient training data can be generated by impressing simulated crack structures on images of real concrete. In this work we further improve the 3d Unet by applying a modern backbone for enhanced feature learning. Additionally, we train another encoder-decoder based network, the 3d Feature Pyramid Network, and equip it with the same backbone. We apply both models to real data of steel fiber-reinforced concrete and to semi-synthetic images with simulated cracks on the steel-reinforced concrete background. The latter allows not only for a qualitative but also for a quantitative comparison.


Introduction
Concrete is the ubiquitous and favored building material, employed across a wide spectrum of structures ranging from buildings to bridges.Despite its widespread use and reputation for durability, concrete is susceptible to deterioration over time, and one common manifestation of this process is the development of cracks.The occurrence of cracks in concrete structures is a significant concern, as it can compromise both aesthetic and structural integrity.As a proactive measure, structure monitoring is an essential element in construction maintenance, allowing for the timely identification of potential issues.In this context, utilizing computed tomography (µCT) enables us to explore the internal structure of concrete samples, providing insights into concrete behavior.This understanding serves as a foundation for developing tools and solutions aimed at enhancing the overall quality of concrete.CT scans provide detailed spatial images of internal concrete structures, enabling the detection of cracks and flaws without altering the micro-structure.Cracks are intricate spatial structures and should therefore be analyzed in 3d, too.Classical techniques for segmenting cracks in 3d CT images can be traced back to a collaboration between the Federal Institute for Materials Research and Testing and the Zuse Institute Berlin in 2011 [5,16,17].Today, machine learning, in particular deep learning, is the dominant solution for image segmentation tasks.In a recent study [1], we analyzed and compared several methods from both, classical image processing and machine learning.We trained and evaluated these methods using semi-synthetic image data along with corresponding ground truths.Hessian-based percolation [5] and the 3d Unet [21] were identified as the most effective methods.Both approaches were successfully applied to real CT data of concrete, too [2,7,8].When propagating through concrete, cracks typically form continuous and jointed structures with some elements of the concrete matrix interacting with the crack.This applies in particular to steel reinforcements, which usually do not break during crack propagation.Moreover, steel reinforcements appear as bright cylindrical objects in the reconstructed CT images while cracks are dark elements.We focus on three types of concrete with steel reinforcements: crimped fibers, fibers with hooked ends (hooked-end fibers) and straight fibers, all scanned using the laboratory µCT device at the Fraunhofer ITWM, Kaiserslautern.We use deep learning to segment the cracks.A 3d Unet [21] has proven to be very effective for this task [1].We therefore modify and further improve this method.For comparison, we incorporate another popular deep learning segmentation methodthe Feature Pyramid Network (FPN) [13].The methods are trained on semi-synthetic data, where cracks are modeled as minimal surfaces on the facet system of random Voronoi tessellations as described in [9].The cracks are discretized, dilated to the desired local thickness, and embedded into real 3d CT gray value images.The method can generate synthetic crack structures on varying scales and with an arbitrary number of branches.Having ground truths for the semi-synthetic crack images, we can compare 13th Conference on Industrial Computed Tomography, Wels, Austria (iCT 2024), www.ict-conference.com/2024 the selected models quantitatively.Furthermore, we can observe how specific types of steel reinforcements affect the prediction quality of segmentation methods.Finally, the models are tested and qualitatively evaluated on CT images of steel fiber-reinforced concrete (SFRC) with real crack structures, too.

Data
In this section, we describe the details of the laboratory tests used to induce cracks in a concrete sample, on which we then apply crack segmentation methods.Moreover, we introduce the imaging process of that sample and present the generation process of semi-synthetic data.

CT data imaging and characteristics
We reuse CT images of the straight steel fiber-reinforced ultra high performance concrete samples described in [14] acquired at Fraunhofer ITWM in Kaiserslautern, Germany.To mitigate gray value variations in the acquired images, the cuboidal specimen was encased in a cylindrical shell of ultra high performance concrete (UHPC).ITWM's CT device features a Feinfocus FXE 225.51 tube with maximum acceleration voltage 225 kV and maximum power 20 W. A Perkin Elmer detector XRD 1621 with 2,048 × 2,048 pixels is used.The tube voltage was set at 190 kV, the target electricity at 65 µA, and the power at 12 W. Tomographic reconstructions were generated from a total of 800 projections.The flexural characteristics of the concrete sample were analyzed through four-point bending tests on an unnotched specimen, as illustrated in Figure 1.During testing, the specimen was rotated by 90 • about the z-axis, bringing the concrete surface to the front.The bending test was executed under displacement control, with a deliberately low load rate set at 0.1 mm/min to allow for tracking of crack formation.The deflection at the midspan was monitored using an extensometer, and the test concluded upon reaching a midspan deflection of 5 mm. Figure 1: Arrangement for the four-point bending test from [14].Specimen (labeled as 1), extensometers positioned on both sides to gauge deflection at the midspan (labeled as 2), a force gauge (labeled as 3), and a load cell (labeled as 4).
The UHPC sample is cuboidal and reinforced with straight steel fibers, a feature that manifests in high gray values in the CT image.The crack continuously spans multiple scales with a large opening that gradually narrows towards the bottom.Additionally, the crack branches several times, adding complexity to its overall morphology.The image cropped to the actual sample has 760 × 810 × 1, 580 voxels of edge length 49.4 µm.The image has a color depth of 64 bits.2d slices of the 3d image are presented in Figure 2.

Semi-synthetic data
Machine learning models usually require a large amount of labeled training data.VoroCrack3d [10]  scans of real concrete, see Figure 3. Synthetic cracks are generated according to [9].We briefly summarize the procedure below.The crack model is based on 3d Voronoi diagrams [15] bounded by a cuboid.First, a contour on the boundary of the cuboid consisting of Voronoi edges is selected.Then, a minimum-weight surface consisting of a connected subset of the Voronoi facets and bounded by that contour is computed by solving a binary integer program.The crack generation is illustrated in Figure 4. Voronoi diagrams are generated by point processes.The regularity of the chosen point process model influences the cell shapes in the resulting tessellation.We choose Poisson, Matérn cluster, and hard-core point processes.The latter arises from force-biased sphere packings [3].See Figure 5 for examples.In the next step, the surfaces are discretized to binary images which are then adaptively dilated to obtain locally varying crack widths.To emulate the brittle crack boundary, a second Voronoi diagram is computed from another Poisson point process.Cells of this second, very fine tessellation touching the dilated crack are integrated into the crack structure.The cracks are then embedded into the CT images, see Figure 6.The voxel gray value distribution is assumed to be the same as the one in the air pores.Multiple cracks can be combined to model crack branching.To obtain the ground truth, we note that steel reinforcements are typically not prone to cracking during tensile tests.That is, voxels where steel reinforcements intersect the crack are assigned the value 0 (background) in the label images.Similarly, air pores that intersect a crack may not be considered to belong to the crack and are also assigned the value 0. Steel reinforcements appear as tubular, bright structures and air pores as round, dark structures in the CT images.They can be segmented by local shape filters such as Frangi's [6].Steel fiber and air pore segmentation are demonstrated in Figure 6.The ground truth is then obtained from the synthesized binary crack image by setting all those voxel gray values to 0 where fibers or air pores intersect the crack.
Copyright 2024 -by the Authors.Licensed under a Creative Commons Attribution 4.0 International License.

Methods
This section focuses on describing the semantic segmentation deep learning methods we use as well as how to choose the hyperparameters in practice.

EfficientNet backbone
In semantic segmentation networks, the "backbone" refers to the initial part of the network responsible for feature extraction from the input image.It typically consists of a series of convolutional and pooling layers that capture hierarchical features at different scales.The backbone's primary role is to analyze the input image and extract relevant information before passing it to Figure 6: 2d slices from 3d images.From left to right: CT image of straight steel fiber-reinforced concrete, fibers, air pores, synthetic crack structure (ground truth), crack embedding.Note that the threshold for the fiber system is chosen such that not only fibers are guaranteed to be covered but also voxels at their edges being brighter than the concrete matrix due to the partial volume effect.Voxels where steel fibers or pores intersect the crack are set to 0 in the ground truth.the subsequent layers responsible for segmentation.We use EfficientNet first introduced in [20] as backbone.As a backbone, it forms the foundational structure of a larger neural network.EfficientNet replaces the convolutional blocks made from two convolutional layers in the original versions of FPN and Unet by a module called Mobile Inverted Bottleneck Convolution (MBConv) [19].The MBConv consists of Depthwise Separable Convolutions, which are further described in [4].The use of such convolutions significantly reduces the number of learnable network parameters.Figure 8

3d Unet
The 3d Unet is the convolutional neural network architecture we used for our semantic segmentation task so far.The original Unet was described in [18], while its 3d version was introduced in [21].The Unet architecture is characterized by its U-shaped design consisting of an encoder path that performs downsampling through convolutional and pooling layers and a symmetric decoder path that performs upsampling to recover spatial resolution.Skip connections, also known as shortcut connections, are extensively used to concatenate feature maps from the encoder and decoder paths, allowing for the integration of both local and global information.For the 3d Unet designed for crack segmentation, which we use later in this work, the 'efficientnet-b0' [20] was used as backbone.The classification head of the original 'efficientnet-b0' network was replaced with a 3d convolutional layer.The decoder part of 3d Unet consists of four decoder blocks (each block includes two 3d convolutional layers, each of the layers is followed by batch normalization and ReLu activation).A dropout layer (probability 0.5) is applied to the final output of the encoder.Our 3d Unet with EfficientNet (3d Unet EfficientNet) backbone has 2,833,689 learnable parameters.

3d Feature Pyramid Network
Originally, the Feature Pyramid Network (FPN) was used for object detection [13].It can however be modified for semantic segmentation and has been used for this purpose in medical applications already [12].Its architecture designed to detect multiscale objects combines features from several levels of a backbone network through lateral connections and top-down pathways.This creates a feature pyramid for handling objects at different scales and effectively capturing both fine details and global context.FPN allows both bottom-up and top-down information flow.Lateral connections merge feature maps from higher-resolution levels with those from lower-resolution levels through addition, resulting in a richer representation of objects at various scales.The main and innovative idea is to predict on each level of the network, so eventually the total epoch training error takes into account the errors on each level.As in the case of 3d Unet described before, we use efficientnet-b0 as the backbone for 3d FPN.Our 3d FPN with EfficientNet (3d FPN EfficientNet) backbone has 1,107,521 learnable parameters.The dropout layer with 0.5 probability is applied after each lateral connection.
Copyright 2024 -by the Authors.Licensed under a Creative Commons Attribution 4.0 International License.

Training
We trained both networks on 90 semi-synthetic 3d images, each of size 256 3 voxels.To enhance the models' robustness and generalization, we augmented 60 of these images, incorporating operations such as flipping, rotation, and adding noise additional to that described in Subsection 2.2.The volumetric images are partitioned into cubic patches of 64 3 voxels for computational feasibility.A 14 voxel overlap minimizes potential edge effects in the segmentation results.Throughout the training, batch size 2 was employed, optimizing the models' capacity to process and learn from smaller subsets of the dataset at each iteration.The Adam optimizer was used [11].The learning rate was initially set at 0.001, with a scheduled decay of 0.5 after every 5 epochs, contributing to the model's adaptability over the training duration.In total, the training regime spanned 20 epochs, ensuring a comprehensive exposure of the neural network to the augmented dataset, allowing it to learn the underlying patterns within the volumetric images.

Results
In this section we describe the metrics we apply for comparing and assessing our methods based on both, real and semi-synthetic data.

Metrics
We evaluate the models by Intersection over Union (IoU), recall, and precision.IoU, also known as the Jaccard Index, quantifies the overlap between the predicted segmentation mask and the ground truth mask by comparing the area of intersection to the area of their union.It effectively gauges how well the model identifies the correct regions in the segmentation task by considering both true positive and false positive cases.High IoU scores indicate visually accurate segmentation, making it a valuable metric for model comparison.Recall is calculated by dividing true positives by the sum of true positives and false negatives.It measures the ability of a model to correctly identify all relevant instances of the positive class.It emphasizes minimizing false negatives, because it gets higher when the number of false negatives is smaller.Precision is calculated by dividing true positives by the sum of true positives and false positives.It assesses the accuracy of the model by measuring the proportion of correctly identified positive instances among all predicted positive instances.When the number of false positives is lower, precision is higher.Precision and recall are similar, but complement each other.In the case of crack segmentation, higher recall implies that the model is more effective in capturing the majority of crack voxels and that the amount of true crack voxels that were undetected is lower.Higher precision indicates that there are fewer voxels that actually belong to the background but have been misclassified as crack.

Segmentation quality for semi-synthetic data
To assess the model performance, we use 27 images the networks has not seen in the training.The 3d images of 256 3 voxels represent equally the crack widths and steel reinforcement types.Original images as well as augmented ones (rotated, flipped, with noise) are included.In comparing the models, our interest extends beyond overall performance to efficiency in handling various crack sizes (small, medium, large) and different concrete types, reinforced with different steel reinforcements: straight steel fibers, crimped steel fibers and hooked-end steel fibers.The plot in Figure 9 and Table 1 detail the performance.The three metrics evaluated on the whole test dataset show, that both the 3d FPN and 3d Unet with the EfficientNet backbone perform very well, with mean IoU differing by only 0.04.The difference in mean recall is even smaller, but this time 3d Unet EfficientNet outperforms the FPN based architecture.On the other hand, according to mean precision, 3d FPN EfficientNet is significantly better, with a difference of over 0.1.Remember that IoU measures the overlap of predicted and ground truth regions.Thus a higher mean IoU indicates better alignment of predicted and actual segmentation.Consequently, the higher mean IoU for the 3d FPN EfficientNet model suggests that overall it segments voxel-wise more accurately.The slightly lower mean recall for the FPN based architecture suggests that it might miss some of the true positive instances compared to the 3d Unet EfficientNet, so might not be able to catch all the actual crack voxels.Finally, much higher precision indicates that, when the model predicts positive instances, it is more likely to be correct.We also evaluated the performance on subsets of the test dataset containing only single types of concrete or crack widths.Here, only mean IoU was estimated.According to Figure 9, 3d FPN EfficientNet scores better for every subset, but the models show some common tendencies: The performance is the worst for images with cracks simulated in hooked-end steel reinforced concrete, when it comes to the reinforcement type, and for the large cracks, when it comes to the crack size.For the crimped steel fiber-reinforced concrete images, the mean IoU is the highest for both models and equals 0.91 for 3d FPN EfficientNet and 0.87 for 3d Unet EfficientNet.This is a very satisfactory result.However, on the subset of hooked-end steel reinforced concrete, the models perform significantly worse (mean IoU < 0.6) than in the other cases.The abilities of both models are similar in segmentation of medium and small cracks, with slightly better results for the medium cracks.
Copyright 2024 -by the Authors.Licensed under a Creative Commons Attribution 4.0 International License.

Qualitative results assessment on real data
To identify cracks in the straight steel fiber-reinforced concrete sample described in Subsection 2.1, we implemented the models on images downscaled to {0.25, 0.5, 0.75, 1} times the original size.Each of the models was then employed to predict cracks at each scale.Subsequently, we resized the images back to the original dimensions using spline interpolation and calculated the voxelwise maximum of the four segmentation maps to derive the final segmentation after global thresholding with 0.5.The crack was segmented very well by both models, but there were many false positives in the areas without crack.Thus, we extract the largest connected component to remove them.As the final step, we trim the edges of the segmentation maps to reduce edge effects.The final size of segmentation map from each model is 730 × 600 × 1, 481 voxels.A ground truth for validating segmentation results is unavailable due to the impracticality of manual labeling of crack voxels in large 3d images.Consequently, discussions about results are confined to qualitative assessments following visual inspections of 2d slice views, presented in the Figure 10, and 3d renderings, presented in the Figure 11.For better visualization and results assessment, the original image was cropped to the size of the segmentation maps.In a general assessment, the 3d Unet model demonstrated a tendency to over-segment, so to classify too many voxels as cracks, which explains its higher recall metrics.This over-segmentation phenomenon is evident in the 3d rendering, where numerous fiber-like structures are discernible.On the other hand, the FPN model exhibited a considerably smoother prediction, yet being more prone to under-segmentation.However, this apparent under-segmentation did not compromise accuracy; instead, the FPN model demonstrated an ability to discern and accurately classify fibers from the actual crack.The net effect is a visually perceptible thinning of the fracture.Notably, both methods successfully connected the crack along its entire length, emphasizing the robustness of the segmentation models in capturing the continuous nature of the crack.

Conclusion
This study compares two deep learning methods for semantic segmentation of crack structures in 3d images: 3d FPN and 3d Unet with EfficientNet backbones, applied to both semi-synthetic and real data of steel-reinforced concrete.On our semi-synthetic data, 3d FPN EfficientNet consistently outperforms 3d Unet EfficientNet across all metrics except for mean recall.Notably, application to distinct subsets of the test set, specifically focusing on different steel reinforcement types or varying crack sizes, revealed differences in mean IoU.The results of these experiments suggest that the type of steel reinforcement has a significant effect on the segmentation results, as the subset of images of crimped steel fiber-reinforced concrete scored the best of all subsets, and the images of hooked-end steel fiber-reinforced concrete scored the worst of all subsets.In terms of crack sizes, large cracks were segmented worst.When applying the model to cracks larger than those provided in the test set, the multi-scale approach from Subsection 4.3 is inevitable.
In this study we compared two encoder-decoder convolutional neural networks.In the Unet architecture, the information flow from encoder to decoder is enabled by skip connections, where the feature maps of the same sizes are concatenated.This allows Unet to capture fine-grained details in the segmentation mask.FPN, on the other hand, uses lateral connections, where the feature maps from encoder levels are interpolated to restore the appropriate size and are added to the decoder at the corresponding level.Unet's skip connections help in preserving spatial information and gradients throughout the network.This can lead to a better preservation of fine details in the segmentation masks.FPN's lateral connections combine features from multiple scales, which can result in a coarser representation of details compared to Unet and emphasis on the main and most important features.These properties of the convolutional networks are particularly prominent in the application to the real concrete sample in Section 4.3.Unet is better at learning fine details, but in case of steel fiber-reinforced concrete this increases the number of voxels incorrectly classified as crack.FPN based architecture remember the big picture of the crack well, but might be insufficient when the crack has many small, barely visible branches.
Copyright 2024 -by the Authors.Licensed under a Creative Commons Attribution 4.0 International License.

Figure 2 :
Figure 2: 2d slices of 3d CT images of concrete reinforced with straight steel fibers.On the left: view of the SFRC sample from the top.On the right: view of the SFRC sample from the side.

Figure 3 :
Figure 3: CT images of ultra-high performance concrete (UHPC) without cracks, from left to right: concrete reinforced with straight steel fibers (voxel edge length 49.4 µm), concrete reinforced with hooked-end steel fibers (voxel edge length 88.5 µm) and concrete reinforced with crimped steel fibers (voxel edge length 106 µm).

13thFigure 4 :
Figure 4: Crack generation.From left to right: Bounded Voronoi diagram, contour on the cuboid boundary, minimum-weight surface.

Figure 5 :
Figure 5: Minimum-weight surfaces in Voronoi diagrams bounded by a 400 × 150 × 400 voxel cuboid.The generators are realizations of a Poisson (left), a Matérn cluster (middle), and a hard-core process (right).

Figure 7 :
Figure 7: Straight steel fiber-reinforced concrete with synthetic crack structures.From left to right: Thin crack (1 voxel) with several branches, single thin crack (1 voxel), single intermediate crack (5 voxels), multi-scale crack with branching and multiscale crack with branching and noise added to the image.Crack widths may deviate slightly locally by a small amount due to the very fine second tessellation integrated into the crack for more realistic surface roughness.
yields a sketch of the architecture of the EfficientNet variant called 'efficientnet-b0'.

13thFigure 9 :
Figure 9: Performance of 3d FPN EfficientNet and 3d Unet EfficientNet on semi-synthetic data.IoU all, Recall all and Precision all are the respective means obtained from evaluating the whole test dataset.The remaining scores present the mean IoU for a specific subset of the test dataset (e.g.straight steel fiber IoU was assessed on the subset containing solely images with cracks simulated on the straight steel fiber-reinforced concrete background).Each subset consists of 9 images.

Table 1 :
Performance of both models on the whole test dataset and different subsets of the test dataset.