The long journey to the training of a deep neural network for segmenting pores and ﬁbers

Even though it is a crucial step for achieving suitable results, the preprocessing of data before it is used as input to deep neural networks is often only described as a side note. This work elaborates on the required steps in this preprocessing procedure. Speciﬁcally, we provide insights into the selection of appropriate segmentation algorithms to generate reference volumes from X-ray computed tomography (XCT) scans as training data. Furthermore, this work evaluates the criteria for the selection of an appropriate deep learning network architecture, and a quantitative comparison between networks based on U-Net and V-Net.


Introduction
Machine learning is increasingly used for the analysis of XCT scans [1]. Currently, we are already past the initial hype set up by a number of recent publications, which have shown impressive architectures for medical 3D data and also for material science [2][3][4]. Most new works in this area now are no longer developing completely new network architectures, but adapting existing architectures and optimizing their training to achieve reasonable results for a specific dataset. Well-established frameworks like Keras 1 , MONAI 2 and Tensorboard support such workflows. MONAI (Medical Open Network for Artificial Intelligence) is quite popular in the field of medical image analysis. With MONAI, one can easily generate a training pipeline which covers all necessary steps. Especially when going in the direction of certification, the focus moves away from the network architecture itself and more towards the training of the pipeline [5]. For certified deep learning systems, the most important aspect is the ease of reproducing the results. We will show in this paper how we have designed our training pipeline to compare two different network architectures (U-Net and V-Net). We used the MONAI framework for this purpose, which allows to design a whole training pipeline within a day.

Related work
A machine learning pipeline typically consists of several steps. The first step is the generation of the ground truth/reference. Subsequently, several steps of preprocessing typically follow in the form of data augmentation. The last step is the actual training of the networks, where the analysis of the loss metric and the evaluation metric play an important role [6].

Generation of reference data for training
Having or generating a reasonable reference is the most critical step for the quality of a deep neural network segmentation pipeline. A respective network is always bound to the quality of the input data used in training. In the medical sciences, often ground truth segmented by humans is available in the form of standard datasets, e.g., from the BraTS challenge [7] or from the medical segmentation secathlon [8]. For material science, no such datasets of XCT scans are available to the public, especially not for very specific specimen, such as high-resolution scans of carbon fibre reinforced polymers. Therefore, it is necessary to produce a proper labelled reference. The best results are typically expected to come from manual segmentations by material experts. However, this process is highly time consuming, and therefore it is typically impractial or not even feasible to produce a sufficient amount of training data in that way. Another option is to use results from some other segmentation algorithm as input to the training. To analyze CFRP parts after a XCT scan, typically a global threshold segmentation is performed. Commonly used methods are ISO50 and Otsu threshold. Besides that, there are also manufacturer specific methods [9]. The Otsu thresholding method is an automatic algorithm for selecting a global threshold. It divides the gray value histogram into two classes and tries to maximize the variance between the classes to select the threshold [10]. K-means segmentation is based on the unsupervised K-means clustering. The K-means clustering divides N samples with M dimensions into K clusters. In the case of images, the color channels are the dimensions, and the pixels/voxels are the samples [11] [12]. As additional method not based on global thresholding, the Watershed transformation/segmentation can be used. It is based on the topology of an image. Gradients are 1 Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow.; Keras.io 2 MONAI is a freely available, community-supported, PyTorch-based framework for deep learning in healthcare imaging. ; MONAI.io 11th Conference on Industrial Computed Tomography, Wels, Austria (iCT 2022), www.ict-conference.com/2022

Data augmentation
It is important to accumulate enough data for the training of the network, especially in areas with limited availability of data. Data augmentation is a way to generate additional training data from a limited collection of reference. There are different operations that can be performed on the pair of input data and according reference data to produce slightly different data for training. Commonly used operations are found in the list below [6]: • Flip: The input pair is mirrored horizontally or vertically.
• Gaussian noise: Gaussian noise is added to the input image only.
• Rescaling: The size of the image pair is changed. This can be applied in one or more directions (x/y/z).
• Jittering: Some random small value (± 1-5) is added to the intensity of each a voxel in the input image only.
• Gaussian Blur: A Gaussian filter is used to blur the input image only.
• Affine transformation. The input pair is rotated or shifted in one of the directions (x/y/z).
These and other transformations may be performed randomly on the initially available labeled reference dataset, resulting in additional training data.

Network
The U-Net architecture was introduced by Ronneberger in 2015 for image segmentation in 2D [14]. The name is derived from the shape of the network. Due to its strong performance, it was later adapted to 3D [2]. The VNet architecture is based on U-Net, with the main difference being that VNet uses a residual connection instead of max pooling layers in the blocks [3]. A residual connection means that the input data is forwarded to the output in parallel to the convolutional filters. In VNet, filtered and residual (bypass) connections are summed, and in a final step down-sampling happens by using an additional convolutional filter.

Normalization
Normalization in the network plays an important role to improve the learning rate. The normalization here is done between the layers of the network. The goal is to scale the results of the network layers and center them in the same way. There are different kinds of possible normalization strategies [15]: • Instance: In this strategy each convolutional filter (also called channel) is normalized separately.
• Batch: This strategy extends instance normalization to the whole batch (number of samples which are trained in one step).
One channel of the network is normalized over the whole input batch.
• Layer: The layer normalization strategy is executed on the whole layer.
• Group: The group normalization strategy is a combination of the instance and layer strategy. It uses multiple channels of the layer but not the complete layer.
Wu and He showed that group normalization has a huge benefit when training on multi GPU systems [15], especially when small batch sizes are used; otherwise batch normalization performs better.

Optimizer
Generally, a training consists of two steps: The first step is the prediction (forward), where the prediction is compared to the reference using the loss function. In the second step, the error/loss is back-propagated to update the weights of the network. A network consists of millions of weights. So an update has to be performed in small steps. For this, methods like stochastic gradient descent (SGD) [16] are used. The method used most commonly today is the Adaptive Moment Estimation (Adam) [17]. It uses an adaptive learning rate, instead of an fixed one as in SGD.

Dice Metric
The Sørensen-Dice coefficient is an overlapping measure [18]. That means it measures the overlapping of the predicted label with the label in the reference (see equation 1). A result of one means that the prediction and the reference are the same. If the result is zero, there is no similarity between prediction and reference.
Optimizers for the training of the network try to find the smallest possible value for the loss. Therefore, the Dice coefficient needs to be inverted to be usable as loss. This is achieved by subtracting the Dice coefficient from 1 (see Equation 2).

Analysis pipeline
This section describe all steps which where necessary to train our XCT data, starting with the acquisition of the data and all necessary steps of preprocessing. We also describe our machine learning pipeline and the performed image augmentation, along with some of the challenges we had with the network.

Data Acquisition
We analyzed fiber reinforced composite datasets containing multiple pores, which were scanned using a GE Nanotom 180 NF XCT-device. The goal was to extract the pores and fibers. The chosen resolution was (3.3 µm) 3 voxel size.

Segmentation
Because of the large amount of training data required for a 3D CNN, it is not feasible to manually create a perfect reference through labelling of every single voxel by hand. Therefore, the datasets were segmented with four algorithms (K-means, Otsu threshold, watershed and manual threshold selection) to produce labels (binary masks) as reference for our experiments more efficiently.
For K-means we identified the position of peaks of the air and material by using the histogram of the volume. These two values were used as the starting centroids of the two classes. For watershed, we first extracted the gradient magnitude image of the volume. We then performed the watershed transformation with a level of 0 and a threshold of 0.13. The parameters of watershed were determined empirically. Figure 2 shows a comparison of the different segmentation methods used. As seen in these images, the Otsu and K-means algorithm perform best, whis is why we decided to only use K-means and Otsu for the training. Watershed seems to classify slightly more voxels as pores than the other algorithms. The biggest differences can be found in the porous area of the specimen. Nevertheless, even for material specialists it is not easy to find the correct threshold manually. The mentioned algorithms were used because they can produce results within reasonable time and memory resources. For a volume of size 1220x854x976, a segmentation with K-means took 2.5 hours, and less than 5 minutes with Otsu thresholding. This was achieved in the open source software open_iA [19]. The results where later evaluated by material specialists to ensure and safeguard the quality of the segmentation.

Preprocessing
To improve the performance and reproduciblity, the XCT scans were split into smaller chunks of 140x140x140 voxels. This allows MONAI to compute multiple volumes in parallel, reduces the memory footprint, and improves the total throughput. That is particularly important when state-of-the-art GPUs are used in order to fully utilize their capabilites. We used the MONAI framework for the training (see Figure 1). In the pipeline the first step contained the normalization of the XCT scan's data to a range 0 to 1. After this step the volume was split into smaller sub-volumes with a size of 128x128x128. The same was done for the reference data. Also, an image augmentation was performed, where the intensity of the image was randomly changed, and affine transformation were applied (shifts in x, y and z direction as well as rotation and and zoom). This process is called transformer in PyTorch, and it is executed in each epoch. In each epoch the network therefore sees slightly different images because of this data augmentation.

Training
For the training we used a combination of Otsu thresholding and K-means segmentation. In total, 4224 samples generated with Otsu thresholding and 4224 generated with K-means segmentation were used as training data. They were shuffled and then split into training and validation datasets. The data was split into 80% for training and 20% for validation. For the training, U-Net and VNet architectures were used. For U-Net we changed the standard normalization from instance to batch normalization. The reason was that with this change, the training was more stable during our empirical trials. We used the hyper-parameter as shown in Table 1.

Evaluation
In the evaluation step we used the Dice coefficient applied it on the validation dataset with a sliding window. After the prediction a layer with a sigmoid activation function was used to produce values between 0 and 1. In the final step before the Dice coefficient a threshold of 0.5 was applied to generate an binary image.
In the results after 50 epochs as shown in Figure 3, it becomes visible that U-Net and VNet perform similarly. As shown in Table  2 the U-Net architecture achieved a slightly higher evaluation dice value than VNet. Also the best value was already achieved in epoch 6 out of 50. With the same data, VNet needed 6 times more epochs (36) for the best evaluation result. But for both networks, the final result after 50 epochs was worse than the best result. This means that the networks run into overfitting, i.e.,

Performance
The training pipeline was tested on a Workstation with a 18 Core intel Xeon W and an Nvidia RTX A6000 with 48GB video RAM. To reduce loading times, the datasets were stored on a NVME SSD. During the training we found out that the image augmentation caused a high CPU utilization. To increase the throughput of the training, the image augmentation was therefore subsequently executed with parallel workers. Also, 5% of the datasets were cached in RAM to reduce the latency of the NVME SSD. We also had to use smaller batch sizes, than the GPU would be capable of (utilization of video RAM), because of the computationintensive data augmentation. Otherwise, it would have lead to a high dead time without utilization of the GPU between the epochs for the preprocessing/augmentation of huge batches. The training for one epoch of the U-Net took 440 seconds, and training of one epoch of the VNet took 739 seconds. The times include loading of the data, preprocessing/augmentation and the training itself.

Conclusion
As shown in this paper, the choice of a fitting network type is just one part of setting up a segmentation based on deep neural networks. Setting up a proper pipeline for getting reasonable and usable training data for the network is at least equally important. We based our pipeline on the MONAI framework, which makes it easy to adjust the parts of the training pipeline. With the increasing performance of accelerator cards (GPU), the challenges moves more and more away from the network itself. With more calculation power, it becomes easy to try different hyper-parameters or network types. Especially multi GPU systems are fast enough to load and preprocess the data for the training. In our future work the goal is to automate the training more by using MONAIs workflow features to achieve better hyper-parameter optimization, and to make it easier for material science researchers