Automated defect detection on inductive thermography images using supervised and semi-supervised Deep Learning methods

Inductive infrared thermography has been proven as an interesting solution for the inspection of surface defects. To automate the inspection, defect detection methods based on convolutional neural network proved their efficiency for complex detection tasks compared to traditional methods. Both supervised and semi-supervised learning approaches have been proposed for the inspection task. While the supervised approach remains the most common one, it requires images of both defective and non-defective parts during the training phase. Unfortunately, in many industries where the scrap rate is low, acquiring images of defective parts is difficult and requires time which can delay the deployment of such solutions. This paper compares these two learning approaches by illustrating the advantages and disadvantages of each approach from an industrial point of view. In conclusion, we describe an inspection deployment strategy, which combines the two approaches to ensure robust inspection with rapid deployment.


Introduction
CETIM has been developing and evaluating for several years nondestructive testing techniques (NDT) based on inductive infrared thermography to detect defects on metallic materials such as forging laps, hardening taps, welding or grinding cracks for example.
Inductive infrared thermography has been proven as an interesting alternative to penetrant testing and magnetic particle inspection which are still widely used in the industry for the inspection of surface defects [1]. It enables the same detection quality as these testing techniques while being more environmentally friendly, more energy efficient, and not requiring the use of any chemicals [1]. Furthermore, it gains even more interest since it is easier to fully automate compared to the other NDT techniques. Indeed, both penetrant testing and magnetic particle inspection require cleaning the inspected parts after their inspection which makes it difficult to fully automate.
In order to fully automate the inspection process with inductive infrared thermography we must automate the defect detection task, that is develop defect detection methods to efficiently detect defects in inductive infrared thermography images. During the last decade, several defect detection methods based on deep learning algorithms, more precisely convolutional neural network (CNN), have been proposed and proved to be more efficient than conventional image processing methods in several tasks [2].
Regarding thermography imaging, several research works have been conducted to investigate the usage of different types of CNNs for defect detection. Classification using the VGG model has been investigated in [3]. Several object detection models have also been investigated such as R-CNN, Faster R-CNN and Yolo [4]- [6]. Some papers also investigated the use of segmentation models such as U-net and Mask-RCNN [7], [8].
One common factor in all the studies mentioned above is that they all consider a supervised learning approach to train the neural networks. In a supervised context, both defective and non-defective images are required to train a classification model, while both object detection and segmentation models can be trained on only defective images where the defects in each image are labeled. The goal is to train the neural network to identify real defects, classify them by type, and in some cases locate them in the image. This supervised approach, while being very robust in the detection of defects, requires a dataset containing a large number of images of defective parts in which all the types of defects to be detected must be present. In an industrial environment, the presence of a large quantity of defective parts representative of all types of defects is not always guaranteed, especially in industries where the scrap rate is low. This in terms causes major delays in developing and deploying deep learning based inspection systems in production. For this reason, most of the papers mentioned above considered a dataset of images containing artificially induced defects of different shapes and sizes. While such studies can provide important insights on the performance of CNNs for defect detection, they fail to consider the complexity of real defects encountered in real industrial use cases which are usually more difficult to detect due to the variety in defect shapes and sizes, and the material variations from batch to batch of produced parts.
In [9], the U-net segmentation model was used for the detection of crack-type defects on images of forged steel parts acquired through inductive infrared thermography. This paper studies a real industrial use case. However, due to the aforementioned limitation of acquiring defective images, the whole study has been conducted on only 44 forged parts, including 2 non-defective parts. A total of 22 defective parts have been used for training and 22 parts for testing. To leverage the limited training data, only a small area of the part has been examined in which the probability of defects is high. While the results presented in the paper proved to be promising, the small dataset used in the study does not allow for a conclusive evaluation. Indeed, it can be seen that the lack of defective parts is a big limitation when it comes to using deep learning based inspection methods which can cause delays in developing and deploying such solutions in production. To overcome this limitation, anomaly detection approaches and specifically methods that only requires images of non-defective parts, present an interesting alternative. This approach consists in learning the normal state of a part (non-defective state) and subsequently identifying any part that exhibits a significant deviation from that normal state. This deviation is thus referred to as an anomaly rather than a defect.
In this paper, we propose to compare the supervised and semi-supervised inspection approaches for the inspection of forged steel parts by illustrating the advantages and disadvantages of each approach from an industrial point of view. A large dataset of both defective and non-defective parts is collected to conduct this study. For supervised learning, two main methods of defect detection are compared: classification and segmentation. For semi-supervised learning, we consider a method that allows both the detection and localization of anomalies in the image. In conclusion, we propose an inspection deployment strategy, which combines the two approaches to ensure robust inspection with rapid deployment. This paper is organized as follows. In section 2.1, the acquisition configuration using the inductive infrared thermography is briefly described. Section 2.2 provides a description of the industrial dataset used in this work. Then, sections 2.3 and 2.4 present the supervised learning approaches: the classification and semantic segmentation respectively. Next, section 2.5 presents the anomaly detection approach. Following the results of both learning approaches, section 3 presents the proposed deployment strategy. Finally, section 4 concludes the paper.

Defect Detection using convolutional neural networks
In this work, a real industrial use case is considered from the automotive industry. When it comes to vehicle engines, more precisely piston engines, one of the most important components is the connecting rod. A connecting rod has a crucial role in transferring the load from the piston to the crankshaft and is thus exposed to high stress loads. Since modern diesel engines achieve high torque values at low speeds, this results in high stresses on the connecting rods. This high amplitude of operating stress over time will have a great influence on the fatigue life of connecting rods which is considered the most common cause of its failure [10]. Connecting rods fatigue may cause fractures and lead to breaking. These fractures are initiated by small cracklike defects located on or slightly below the surface of the connecting rod. High loads and stresses on the connecting rod allow these small cracks to grow and lead to failure. These small cracks can be found anywhere on the connecting rod surface, with certain areas being more crucial than others. Therefore, the entire surface of the connecting rod must be inspected to detect these cracks during the manufacturing process. The inspection process to detect such defects is typically performed using magnetic particle testing (MT). In recent years, inductive infrared thermography has been considered as an alternative to the MT technique. Using such inspection technique, connecting rods can be tested without affecting their usability and future usefulness, allowing for a 100% inspection of the production.

Acquisition configuration
In inductive infrared thermography, an inductor (air-cored coils or coils with magnetic core) is placed near the surface of a conductive component (the surface to be inspected). An alternating current will flow within the material causing heating in the form of resistive losses. The presence of a discontinuity or a defect causes an interruption to the current flow forcing it to find a path around the defect. This will lead to a change in the current density at this area of the inspected surface, generating a change in heat levels which can be visualized and evaluated with an infrared camera. The indication of the defect will be much wider than the defect itself, thus turning an otherwise invisible defect into a visible one.
Acquisitions on connected rods were performed using a FLIR X6580sc cooled infrared camera with high frequency induction excitation. An inductor consisting of a Helmholtz coil was used. It generates induced currents at a frequency of 61 kHz. The images used for the paper are 4.5 Hz phase images obtained in the frequency domain by Fourier transform on the sequence of images acquired during induction heating (120 ms) and part of the cooling (100 ms).

Dataset description
A total of 1348 forged connected rods were collected for this study with almost half of them containing at least one defect. In order to inspect the whole surface of the connected rod, a total of 4 images are taken for each part. Each image focuses on a specific area of the part. Therefore, the acquired dataset contains 5392 original inductive thermography images.
The dataset has been labeled by NDT professionals to manually identify all the defects in the acquired images. This step is very important since many defects are difficult to identify by non-experts. In total, 854 images have been identified as containing at least one defect. The other 4538 images do not contain any defects. The variety of defects and their locations on the connected rod surface are representative of the real case scenario in forged connected rods quality control.
The acquired images of the connected rods are very complex to inspect. Figure 1 shows images of both nondefective (first row) and defective (second row) connected rods acquired with the imaging setup described above. The defects are highlighted with a red ellipse. As can be seen, the areas that correspond to the defects generally represent a crack-shaped indication and are brighter than other areas in the image. Some defects can be clearly identified while others are much more difficult to locate. This inspection complexity is due to many factors. First, even with consistent acquisition conditions, there is a large variation in contrast and texture between acquired images due to the nature of the inspected parts and the nature of the inductive infrared thermography technique. Secondly, defects can be located anywhere on the surface of the inspected part, and they vary in intensity, size, and shape. And most importantly, we notice that for all inspected parts whether defective or non-defective, their surface may contain some irregularities and deformities that are not considered as real defects and thus must be distinguished from real defects.
In such a wide detection environment, conventional image processing methods all lack the necessary adaptability and robustness to the different scenarios that may occur since they are generally tuned to deal with a specific scenario. In the last decade, several defect detection methods based on deep learning algorithms, more precisely convolutional neural network (CNN), have been proposed. These methods have achieved excellent results and proved to be more efficient than traditional defect detection methods when dealing with highly textured surfaces.
CNN based defect detection methods can be categorized based on the learning approach used to train the CNN. In this paper, we will focus on two main learning approaches: supervised defect detection methods and semi-supervised anomaly detection methods. For the supervised approach, we can identify three main levels for defect detection methods: classification, object detection and semantic segmentation. Only the classification and the semantic segmentation approaches are studied in this paper due to the shape of the defects on the inspected parts. For the semi-supervised approach, an anomaly detection method based on a pretrained CNN is used.

Supervised approach: Classification
Classification is considered the most common task performed by CNNs, where given an image, the CNN is expected to output a discrete label that represents the class of the image. In our case, we only have two classes of images, defective and non-defective. To train the classification model, the full dataset has been manually labeled where non-defective images are assigned the label 0 and defective images are assigned the label 1. Then, the dataset is split into two subsets: the training set which is used to train the CNN and a test set which is used to test the CNN and evaluate its performance. In this work, we use the 80/20 split rule which means that 80% of the dataset is used for training, and the remaining 20% is used for testing. To respect the ratio between defective and nondefective images in both training subset and testing subset, the 80/20 split has been applied on each class independently. The dataset split for both classes is shown in Table 1. We can notice that the number of non-defective images highly exceeds the number of defective images in the dataset. This imbalance between classes, especially in the training subset, is a common problem in machine learning in general, since the classifier will be biased towards the majority class, hence losing accuracy [11]. There are many techniques proposed in the literature to reduce the effect of class imbalance. In this work we used the balanced weight technique which consists in modifying the class weights of the majority and minority classes during the training process. During training, wrong predictions on the minority class will be penalized more by giving more weight to the loss function. This will create a balanced training in order to achieve better overall performance.
Regarding the classification model, the 50-layers version of the ResNet family has been used in this work, referred to as ResNet-50 [12]. ResNet-50 achieves high results on the ImageNet dataset while being computationally efficient, which is essential in any realtime application. Using a known CNN model rather than developing a custom one will allow us to benefit from transfer learning to initialize the ResNet-50 weights for training. Transfer learning is a concept that is widely used in machine learning in general. It consists in transferring knowledge from one model to another even if the task itself is different. In CNNs, the first convolution layers will learn basic image features that are mostly common in all images regardless of the task. If the dataset used to pretrain the original model is large enough and very diverse, these learned features will be useful even in other tasks. Consequently, recovering the weights of the convolution layers of such a CNN will give the new CNN a large variability of features that will not necessarily be easily learned on a smaller dataset. Hence, in our study, we initialize the feature extraction layers of ResNet-50 using the weights trained on ImageNet dataset. Transfer learning has been proven to improve the performance of the model in similar defect detection tasks [2]. The ResNet-50 model is trained on images of size 256×256 pixels with normalized values between 0 and 1. The loss function used to train the model is cross-entropy. The performance of the model on the test set is presented in a confusion matrix format in Table 2.
The model achieves an overall accuracy of 98.51% on the whole testing dataset. It can be seen that the model is better at identifying non-defective images where only 0.55% of non-defective images are misclassified as defective, while 6.47% of defective images are misclassified as non-defective. This can also be highlighted using the Precision and Recall metrics. The Precision value, defined as the proportion of correctly predicted defective images among all images predicted as defective by the model, is high and reaches 96.95%. The Recall, defined as the proportion of correctly predicted defective images among all defective images, only reaches 93.53%.
In an industrial quality inspection application such as the one studied in this work, the most important objective is to avoid missing defective parts during the inspection. This translates into minimizing the number of defective images wrongly predicted as non-defective by the classifier, that is minimizing the value of false negatives (FN). In the ideal case of FN = 0, the value of recall will be 1. To improve the Recall, it is proposed to modify the prediction threshold. The new confusion matrix is presented in Table 3. The overall accuracy of the model decreased to a value of 97.4%. However, due to the imbalance between classes, the overall accuracy is not the best metric to evaluate the performance of the model. As mentioned above, the recall value is the most important metric in inspection tasks. In this second experiment the Recall value reaches 96.47% where only 3.53% of defective images are misclassified as non-defective. Finally, Figure 2 shows in the first row two false negative cases where the defects highlighted with a red ellipse has been missed, and in the second row two false positive cases. Defective images that are misclassified by the model present small defects on the side of the product that are difficult to identify even visually. Non-defective images that are misclassified by the model present anomalies that are very similar in shape and contrast to real defects but are not considered as defects by NDT experts. They are usually the result of small irregularities on the part surface that do not affect its conformity.

Supervised approach: Segmentation
Unlike classification, the output of a semantic segmentation model is not just a single label assigned to the input image, but rather an image of the same size as the input image in which each pixel is classified into a class. Hence, it can be considered as a pixelwise classification where each pixel of the input image is assigned a label. This will allow both the classification and localization of each defect in the image.
To train the model to distinguish between defective and non-defective pixels, we need to manually annotate all the images that contain at least one defect. The annotation is achieved by associating a label 0 to all non-defective pixels and a label 1 to all defective pixels. The more accurate the annotation is the better the model will perform. The non-defective images do not need to be annotated since all the pixels belong to the same class. After training, the model is expected to output a similar image to the annotation image where each pixel has a prediction score value between 0 and 1. A threshold applied to the prediction image will then allow to get the final binary image.
To be able to compare the performance of the semantic segmentation model with the classification model, the same testing subset used in classification will also be used to test the segmentation model. However, this will not be the case for the training subset. In fact, since the goal of the segmentation model training is to teach the model how to distinguish between defective and non-defective pixels, training using only defective images is usually enough as these images contain both defective and non-defective pixels. In our case, since the majority of our dataset is comprised of non-defective images, these images could contain a bigger variety of non-defective areas that the model will not be able to learn solely from the defective images. Therefore, to train the segmentation model, it is decided to use all the defective images along with 200 non-defective images chosen randomly from the training subset, which represent around 22% of the training dataset. Adding more non-defective images to the training dataset might potentially reduce the false alarm rate but might affect the ability of the model to detect defects. Additional experiments can be conducted on finetuning this value to find the best model accuracy. The dataset split is shown in Table 4.  Training set  Testing set  Total  Non-defective  200  908  1108  Defective  684  170  854  Total  884  1,078  1962 Regarding the segmentation model, a modified version of the U-Net architecture [13] has been used in this work which will be referred to as SegNet. This modified version is detailed in [2] and proved its efficiency on a similar task. Briefly, compared to the original U-Net, this modified version added more convolutional layers in the decoder part and introduced batch normalization after each convolution. Batch normalization is considered an optimization for the network training since it helps the network to train faster, introduces some regularization eliminating the need for dropout, and generally leads to better results [14].
The SegNet model is trained on images of size 512×512 pixels with normalized values between 0 and 1. It is important to choose a proper loss function when training the model on defect detection tasks. In fact, the usual cross-entropy loss function averages the pixel-wise prediction error giving each pixel an equal representation in the overall loss regardless of its class. In defect detection, since defects usually occupy a small area of the image, the ratio of defective and non-defective pixels is highly imbalanced which will lead to a biased training towards the majority class. In such cases, a better loss function that deals with this class imbalance problem is the Dice loss [15] which is essentially a measure of overlap between the ground truth and the prediction. In this work, a combination of cross-entropy loss and Dice loss is used.
To be able to compare the results of the classification model and the segmentation model, the same evaluation metrics will be used and presented in a confusion matrix. A defective image is considered as true positive if at least one defect is detected, and considered as false negative if no defect is detected. A non-defective image is considered as true negative if no defects are detected, and false positive if at least one defect is detected by the model. The performance of SegNet on the test set is presented in a confusion matrix format in Table 5. The model achieves an overall accuracy of 98.33% on the full testing dataset. The Recall value is slightly lower to the one obtained by the classification model at 94.71% with a false negative rate as low as 5.29%. The false alarm rate of 0.99% is lower than the one obtained by the classification model at 2.43%. Again, we can modify the detection threshold to adjust the detection rate of the model depending on the requirements of the quality inspection at the industry. In some industries, a higher false alarm rate is accepted if the detection rate is improved. By decreasing the detection threshold, we can achieve a better Recall value. The updated confusion matrix is presented in Table 6. The overall accuracy of the model decreased to 97.2%, but the Recall value is now at 99.41% with only one defective image misclassified as non-defective. However, this improved capacity to detect defects comes at the cost of a higher false alarm rate that reaches 3.2%. A potential improvement to lower the false alarm rate can be achieved by adding more non-defective images to the training subset. This will help the model in better identifying the non-defective areas of the inspected part.
Finally, Figure 3 shows some results of the segmentation model. The first column corresponds to the input image, the second column to the ground truth and the third column represents the segmentation result overlayed above the input image. The model is able to detect even small defects located at the edge of the inspected part. On the other hand, the model detects some areas in the non-defective images as defective since they share similarities with real defects.
In conclusion, both classification and semantic segmentation trained in a supervised learning approach can achieve good detection results on the inspection task studied in this paper, with a slight advantage to segmentation both in performance and in the ability to not only classify the images but also locate the defects in the defective ones.

Semi-supervised approach
A main issue that we encountered in both supervised approaches studied in this paper is the significant imbalance between defective and non-defective images. In many industrial applications, defects are a rare event on the production line. Therefore, the presence of a large quantity of defective parts representative of all types of defects is not always guaranteed, especially in industries where the scrap rate is low. This causes major delays in developing and deploying deep learning based inspection systems in production.
Anomaly detection approaches and more specifically, semi-supervised anomaly detection which only requires images of non-defective parts, present an interesting alternative to supervised learning. This approach consists in learning the normal state of a part (non-defective state) and subsequently identifying any part that exhibits a significant deviation from that normal state. This deviation is thus referred to as an anomaly rather than a defect. To clarify the terminology in a surface inspection context, an anomaly can be described as an unexpected variation in the inspected part (irregularity, smudge, deformity, …) that can be the result of many factors such as the manufacturing process, the part handling, the material variability, … Some of these anomalies cannot be prevented depending on the application and are thus accepted by both the manufacturer and the client. When an anomaly surpasses certain criteria defined by quality control, it becomes a defect to be rejected.
Like classification and segmentation, we can differentiate between anomaly detection and anomaly localization methods. Anomaly detection can be considered as a binary classification between normal and non-normal images by assigning an anomaly score to the whole image. On the other hand, anomaly localization is seen as a segmentation approach that assigns an anomaly score to each pixel in the image, or in some cases each area of the image, allowing for a more precise and interpretable result.
In recent years, several methods based on deep learning have been proposed that combine anomaly detection and anomaly localization at the same time. Several review papers of such methods can be found in [16]- [18]. According to the reviews, anomaly detection methods can be classified into 5 categories.
In this work, the method developed in [19] referred to as PaDiM is used for anomaly detection and localization. PaDiM uses a pre-trained CNN to extract features (embedding extraction) from different semantic levels. These embeddings are divided into small patches representing small areas in the image, and each patch is modeled by applying a multivariate Gaussian distribution. During the learning stage, the normal class is modeled through the set of Gaussian distributions. Then during the testing stage, the final anomaly map is generated by measuring the Mahalanobis distance of the features at each patch to the "template" Gaussian model of the normal class. At each patch, of the Mahalanobis distance surpasses a pre-defined threshold, this area of the image is considered as anomalous.
In this work, the PaDiM method is used, more precisely the version using the WideResNet-50-2 [20] as a pre-trained model for embedding extraction. Again, to be able to compare the performance obtained by the anomaly detection approach compared to the supervised approach, the same testing subset used in both classification and segmentation will also be used to test PaDiM. Regarding the training subset, only non-defective images are used during the training stage. To evaluate the capacity of PaDiM to properly model the normal class in such a complex inspection task, several experiments have been conducted with an increasing number of non-defective images used during the training stage. This will help in evaluating the effect of the number of non-defective images used during the training stage on the overall anomaly detection performance. Figure 4 presents some detection results using the anomaly detection model. For each case, we have the input image, the ground truth (empty means nondefective), the predicted heat map (anomaly map), the predicted mask after thresholding, and the detection result overlayed over the input image. Figure 5 presents the overall accuracy obtained on the testing subset for different number of non-defective images used during the training stage. This number goes from 50 non-defective images chosen randomly from the training subset and increases to 1,000 images. It can be seen that the test accuracy increases overall with the increasing number of images used during the training phase. The overall test accuracy at 50 images is around 90% and goes up to reach 92.2% at 1,000 images. Hence by adding 20x images during the training stage, only a 2.2% increase in accuracy is gained. This shows that the model is already able to properly model the normal state with only a limited amount of non-defective images.
As expected, the test accuracy using the anomaly detection approach is much lower than the accuracy obtained by both supervised approaches with a loss of around 5%. This can be explained by the fact that in supervised learning the models will be trained on both defective and non-defective images (or pixels in case of segmentation) which will enable the model to learn to distinguish between a real defect and an anomaly that is not considered as a defect by NDT experts who annotated the dataset. On the other hand, the anomaly detection approach will only learn on a limited amount of nondefective images to create a reference model of the normal state. Any anomaly that is not modeled during the training stage will present a significant deviation from the normal model and will thus be detected as an anomaly. As for the real defects, some defects share similarities in shape and contrast with non-defective areas and thus might be modeled in the reference normal model and missed during the test. This will cause an increase in both false positives and false negatives leading to a worst overall accuracy. Therefore, the limited performance with the anomaly detection approach is related to the complexity of the inspection task studied in this work. Inductive infrared thermography images of forged connected rods present a significant variation in the non-defective images class. Even with consistent acquisition conditions, there is a large variation in contrast and texture between acquired images with many irregularities and deformities visible on the surface of the inspected part. Therefore, creating a reference model for the normal state is difficult to achieve. In addition, the images in our case are not perfectly aligned meaning that the inspected part can move slightly from one image to another which can affect the normal state model as indicated in [21].

Deployment strategy
The efficiency of the manual visual inspection has been a concern for many years. Several research studies have focused on evaluating the error rate of manual inspection conducted by human operators in many industries [22], [23]. The reported error rate of manual inspection has been estimated to be around 3-10% for simple inspection tasks and for the majority of more complex inspection tasks it reaches an average of 20-30%. This high error rate is attributed to many factors related to the inspection task, working environment, and human nature among others [24]. In fact, on the particular task studied in this work, two NDT experts conducted the annotation of the dataset independently. On around 40% of the images, the annotation between the two experts was different. This clearly highlights the amount of subjectivity in manual control.
With this high error rate in manual inspection, it becomes clear that the CNN based methods provide an interesting alternative for many industries in order to automate the quality inspection task. In the inspection task studied in this work, both supervised approaches reached a high level of accuracy up to 98%. However, one main requirement in developing these supervised solutions was the availability of defective images. In many industries, collecting enough defective images to train a robust supervised model is very challenging especially in the early phase of production and can take several weeks to several months depending on the industry and its scrap rate. At the same time, even though the anomaly detection approach that does require any defective image has reached an accuracy of 92% which is higher than the average accuracy of manual inspection, it remains much less robust than the supervised models. Therefore, it cannot be considered as an autonomous system to replace the manual inspection conducted by a human operator.
A potential solution for industries that wish to reduce the task of human operators in a rapid manner while still ensuring a robust inspection can be a 2-phase deployment strategy.
In the first phase, the anomaly detection model is deployed on the production line while keeping the human operator in the inspection loop. The anomaly detection model can be regarded as an assistant to the operator and can serve several purposes. The first purpose is to reduce the inspection effort of the operator by reducing the number of parts to be manually inspected. The anomaly detection model will automatically accept any part that does not present any anomaly, i.e. with a low anomaly score, without the need for manual inspection. Any part that has an anomaly score above a certain threshold will be signaled for manual inspection. In order to guarantee that no defective parts are accepted by the anomaly detection model, the detection threshold is tuned in way to optimize the recall value ensuring an automated detection of all anomalies. Although increasing the recall value will lead to an increase in the false alarm rate, hence increasing the manual inspection task, we argue that the anomaly detection model will still be able to filter out a good percentage of non-defective images thus reducing the number of parts to be manually inspected. Figure 6 illustrates the recall percentage compared to the false alarm percentage for the two cases where either 50 or 1,000 non-defective images are used used during the training stage. In order to guarantee that no defective parts are accepted by the anomaly detection model, the detection threshold should be set to obtain a 100% recall. At the 100% recall value, the false alarm rate reaches 92.5% and 84.34% for the 50 and 1,000 cases respectively. This means that the anomaly detection model will allow to reduce the number of parts to be manually inspected by 8-15% while guaranteeing that no defective parts are accepted. If the threshold is set to obtain a 98% recall for example, the false alarm rate will be at 62.95% and 43.88% thus reducing by half the manual inspection task. Figure 6: The recall percentage compared to the false alarm percentage for 50 and 1,000 non-defective images used during the training stage.
In addition, the anomaly detection model will serve a second very important purpose. In fact, since defects can be present anywhere on the surface of the inspected part, the manual inspection consists in inspecting the whole surface of the part to make sure all potential defects are detected. Since the anomaly detection model also ensures a localization of the anomalous areas in the inspected part, this localization will highlight the areas where the operator should focus during the manual inspection. Only these areas should be inspected by the operator thus reducing both the manual inspection effort and time.
The third and most important purpose that the anomaly detection model will ensure is assisting in the collection of the dataset for the second phase of the supervised model. The operator will either confirm or reject the decision of the anomaly detection model on the conformity of the part. If the operator rejects the decision, i.e. the image is non-defective, this image is saved to the dataset in the non-defective class. On the other hand, if the operator confirms the decision of the model, the image is indeed defective and the image is saved to the dataset in the defective class. This will enable both a fast collection of a classification dataset and an online annotation of the dataset. Note that this process is not limited to a binary classification (defective and non-defective) but can also be applied to multi-class classification if multiple types of defects are to be separated into different classes. Another important note worth mentioning is that it is not required to save the images that got automatically accepted by the anomaly detection model as non-defective since these images do not present any significant deviation from the normal state and are thus not relevant for the training of the supervised model. In addition to the classification annotation, and since the anomaly detection highlights the anomalous areas in the image as seen in Figure 4, if the image is indeed defective this anomaly map can serve as an annotation to train a supervised segmentation model for the second phase. Of course, the annotation will need to be adjusted to highlight the exact area of the defect more accurately.
The dataset collected during phase one by the assistance of the anomaly detection model will serve as a training dataset for the supervised model. When the supervised model reaches the desired accuracy by quality control, we can move to phase two of the deployment strategy which consists in deploying the supervised learning model to fully automate the visual inspection task. This deployment can be fully autonomous replacing the manual inspection. Another proposed deployment process suggests that the operator can still intervene with a manual inspection in case the prediction confidence of the supervised model is low. These low confidence images can then be added to the training dataset to improve the robustness of the supervised model. This process is referred to as active learning [25].

Conclusions
In this paper, both supervised and anomaly detection approaches have been evaluated and compared in the visual inspection task of forged connected rods. Supervised methods achieve better overall detection results compared to the anomaly detection approach since they are trained on both defective and non-defective images thus are capable of distinguishing between real defects and other anomalies that can appear on the surface of non-defective parts. Since anomaly detection is trained on non-defective images only, this method is susceptible to more false alarms and false detections. Nevertheless, anomaly detection is easier to develop and deploy especially at an early production stage where defective images are not available for supervised learning. It even eliminates the subjectivity of the manual inspection which was as high as 40% on the current task studied in this paper. To benefit from that, this paper discusses a 2-phase deployment strategy that will reduce the task of human operators in a rapid manner while still ensuring robust inspection. The first stage consists in using the anomaly detection method as an assistant to the operator with the aim to reduce the number of parts manually inspected and to leverage the challenge of collecting and annotating defective images. The second stage consists in deploying a robust supervised learning model trained on the images collected during the first phase to fully automate the visual inspection task.