| NDT.net - December 2002, Vol. 7 No.12 |
Many radiographic exposures and film evaluations are made daily in a typical industrial NDT laboratory. But questions remain regarding the precise probability of detecting discontinuities, including the reliability of each individual inspector or laboratory. Although most experienced inspectors of the laboratories evaluate each radiograph, the actual quantitative reliability of each of these inspectors remains unknown.
In order to estimate the performance of representative inspectors and laboratories involved in NDE, volunteer laboratories from Croatia and Hungary conducted a series of round-robin inspections using X-ray films of welds. While the RRT is still in progress, the authors try to summarize the results of the “second round” and discuss the evaluation method. The results of the “first round” had been reported already. The focus is here to present the new results in terms of
So far the results of a total of 35 inspectors of the different laboratories were evaluated. Those inspectors were selected who have done usually the work of radiographic film interpretation. They have from a half up to 35 years professional experience. For better overview we created four groups: maximum 4 year-experience, from 5 to 12 year-experience, from 15 to 25 year- experience and from 26 to 35 year-experience.
Discussing the circumstances of the RRT with the participants, many of them expressed, they could not exactly follow the prescription of the test. According to their professional experience the small discontinuities can be omitted because of the strictest acceptance level accept them. So the most experienced evaluators did not write the small gases and slags into the list, which yield a wrong reliability of detection. As a consequence we should accomplish two types of evaluation:
In addition we analyzed the “true values” in a critical review. These developments were made and the new results of the evaluation of the testers were more precise and reliable.
The corresponding results are evaluated and presented in a ROC diagram where the p(TP) is plotted against the p(FP). The p(TP) means the probability of a true positive indication (correct defect finding) and p(FP) means the probability of a false positive (false alarm) indication. From the mean value of all inspectors maximum operation points we can estimate a representative overall reliability in terms of a correct indication rate of about 68 % (64 % new) and a false call rate of about 17 % (8 % new). This is a reasonable result within the strict 1 cm score.
Many radiographic exposures and film interpretations are made daily in a typical industrial test laboratory. Yet, questions remain regarding the precise probability of detecting specific discontinuities, including the reliability of each individual inspector or laboratory. Although each laboratory's most experienced inspectors evaluate each radiograph, the actual quantitative reliability of these inspectors remains unknown.
In order to estimate the performance of inspectors and laboratories involved in NDE, volunteer laboratories from Croatia, Hungary and Poland conducted a series of round-robin inspections using X-ray films of welds. A set of well-characterized X-ray films was sent to each laboratory for their individual assessment. The selected films were initially evaluated by BAM's Reliability Laboratory with the assistance of an image-processing computer program. This established a baseline of "true values" for each weld image. Individuals at the participating laboratories, in turn, evaluated each image.
Using the results obtained from the round-robin study ROC diagrams were established to give an overview about the overall reliability of routine inspections.
The paper is organized as follows: The next section will give a short background of the ROC method. Then the practical procedure of the RRT is described. Finally the results in terms of the ROC diagrams and K- values are presented and discussed.
To describe the efficiency of an NDT system it is necessary to distinguish between the device parameters as signal to noise ratio or spatial resolution, which guarantee merely the functioning of the NDE equipment and the actual testing performance in defect detection and classification. The NDE diagnosis system acts from the interaction of rays or waves with a defect in the material up to an indication in an inspection report with an eventful history of the useful indication signal and the noise signal. Considering the example of radiographic weld testing the physical process can be theoretically modeled or described by empirical parameters to a certain accuracy up to the creation of the image on the film.
Fig 1: The four cases of NDT-diagnosis.
|
For the evaluation of the whole testing chain - especially for such NDE-systems where standards are not yet ready and the experience of the "NDE world" is poor - including the human inspector and its interaction with complex technique it is helpful to perform a statistical evaluation of defect findings. The statistics of evaluation is based on the Receiver Operating Characteristic (ROC) (1-5) which is deviated from the general theory of signal detection. The general four possible situations in NDT diagnosis are presented in figure 1. The idea of the ROC method is to characterize the accuracy of an inspection system by evaluating the true positive detection rate versus the false positive detection rate for a set of possible decision criteria (decision whether the signal is noise or a defect signal) which represents a varying sensitivity or recording level. The creation of an ROC curve in this way is shown in figure 2 where - following the curve from the lower left corner to the upper right - the sensitivity of the system raises.
Fig 2: Creation of an ROC curve (theory).
|
So - in the lower part of the curve the highest signals (correct indications) are included and only a small amount of noise (false calls). In the higher part more and more all of the defects are taken into account but also a greater amount of false calls has to be paid as price. In practice it is not always possible – especially for manual methods - to apply continuously growing signal thresholds and to count correct and false call rates for each. Therefore different discrete categories of signal counting are defined to be applied by the inspectors during the non-destructive testing evaluation as indicated in figure 3. These categories might correspond to the visibility of defects on a radiographic film or to an echo height in an ultrasonic A-scan. We call it detectability later on. So we yield five different experimental points in the ROC diagram – in the current RRT investigation we reduced it to four. The maximum point represents the actual possible operating point in detection performance. From the whole curve shape - which can be obtained by using a special regression method on the basis of the binormal model - the overall capability of the system is indicated. There is e.g. a forecast possible what will happen when the sensitivity of the system will be raised: Is there a gain in defect finding or is only the false alarm rate increasing? Considering the area under the ROC- curve (see figure 4) it may vary from 0.5 (pure chance curve 1) up to 1.0, which corresponds to an ideal NDT system belonging to the left corner's step curve. For the fictive systems shown in figure 4 the performance of the system increases from curve 1 to curve 7. With the K-vector value according to van Dijk (8) a good summary performance value is give showing the integral capability of the method to distinguish between a flawed and unflawed film image.
Fig 3: Practical creation of an ROC curve.
|
Fig 4: Differentiating NDT-systems by ROC curves.
|
To try to estimate the reliability of the human factor in radiographic film evaluation an international Round-robin test (RRT) was organized. The RRT is in progress in Croatian, Hungarian and Polish laboratories by voluntary attendees. For the purpose of the RRT a set of films was selected in the Reliability Laboratory of BAM in Germany, which provides the scientific and technical support. The selected 38 films contain more than 200 defect indications of different types and dimensions.
The selected films were scanned by a state of the art film digitizer (LS85 SDR, Lumisys). The digital images were evaluated with the help of a dedicated image processing computer program of BAM. The results of this evaluation are taken as the true values of the discontinuities. The types of the discontinuities were discussed and agreed by a small group of Hungarian experts. At the same time the X-ray films were copied with a laser printer (AGFA Scopix LR5200), for further information see (7). Four sets of films were printed into AGFA Scopix Laser films, which were sponsored by AGFA. In this way all of the participants of the RRT evaluate exactly the same films.
The voluntary evaluators were provided with clear instructions of the procedure and specific forms for the support of the evaluation work and to aid the pre-processing of the results with the computer.
The forms contain the columns of identification, the code of defects/imperfections according to EN 26520:1991, the co-ordinates, the dimensions and the detectability of the discontinuities and for completeness the IQI and the optical density of the films.
The inspectors were asked to evaluate and identify each weld image cm by cm, which is a very strict prescription. Additionally they were required to fill in a form for the circumstances of the film evaluation including the length of evaluation time. The evaluation work of the 38 films took 5 to 8 hours. The longer the evaluation time the more detailed discontinuities were indicated.
The evaluation results were filled in an Excel table to support the data processing of the indication results for the ROC statistics.
So far the results of a total of 35 inspectors of the different laboratories were evaluated. Those inspectors were selected who do commonly the work of radiographic film interpretation. They have from a half up to 35 years professional experience. For better overview we created four groups: maximum 4 year-experience, from 5 to 12 year-experience, from 15 to 25 year- experience and from 26 to 35 year-experience.
Discussing the circumstances of the RRT with the participants, many of them expressed, they could not exactly follow the prescription of the test. According to their professional experience the small discontinuities can be omitted because of the strictest acceptance level accept them. So the most experienced evaluators did not write the small gases and slags into the list, which yield a wrong reliability of detection. As a consequence we should accomplish two types of evaluation:
The corresponding ROC results are presented in the figures 5 a, b, c and d. p(TP) means the probability of a true positive indication (correct defect finding) and p(FP) means the probability of a false positive (false alarm) indication as explained above. The right hand side images show the actual experimental operating points of the individual inspectors. Under nuclear power conditions we would select the most upper points as the valid operation points with the highest indication rate. For testing of e.g. normal water pipes we would select one of the lower points with a lower indication rate but also a much lower false call rate for better economy. From the maximum operation points we can estimate for the overall reliability a correct indication rate of about 68%(Old Evaluation) and 64%(New Evaluation) and a false call rate of about 17%(Old Evaluation) and 8%(New Evaluation). This is quite reasonable within the strict 1 cm score. We see also that a performance of about 90% true indications and 2% false indication is possible from the best inspector. The error cross indicates the standard deviation for each group in p(FP) and p(TP) respectively. We learn about the raising in detection performance when the unnecessary small flaws are deleted from the treatment concerning diminished false calls and scatter. The left hand side diagrams show the full ROC curves calculated by maximum likelihood fit. From the full ROC curves we can learn about the potential of each individual inspector or of the whole group. By varying the detectability level it is possible to shift the operation point along the curve. The red curves indicate the mean of each group. From the statistical point of view the results of the four groups are identical. A more detailed look reveals the best indication rates and smallest scatter for group d and surprisingly b. This means the very fresh inspectors are still somewhat unsure but are full of energy, which comes to a “blooming” level with over five years of experience in b. The most promising ROC curves can be actually found in b. Stadium b I followed by a saturation level in c. The “master-level” is reached in d with over 25 years of experience with the lowest amount of false indications and lowest scatter. This impression is confirmed by the K values. The K value according to van Dijk describes the probability of the system to distinguish between a cm weld with defect from a cm weld without a defect and gives a hint about the care of evaluation. The K-values follow the described up and down of inspection performance with the years of experience. But more important than the years of experience for the performance of the inspectors seems to be their individual gift and education which we see from the very different curves especially in group c.
| Fig 5a: The ROC curves of inspectors with from half to 4 year-experience. |
| Fig 5b: The ROC curves of inspectors with from 5 to 12 year-experience. |
| Fig 5c: The ROC curves of inspectors with from 15 to 25 year-experience. |
| Fig 5d: The ROC curves of inspectors with from 26 to 35 year-experience. |
Fig 6a: The summarised K values of groups in the old evaluation.
|
Fig 6b: The summarised K values of groups in the new evaluation.
|
From the international RRT test with 35 human inspectors we learned about the value of years of professional experience for high testing performance and small scatter of results. We learned further about the influence of the individual capability of each person. From this level of knowledge we could recommend ROC tests for film evaluators to exclude pure chance results. The lesson from the second round of the RRTs is the hint to clearly define the requirements of the client in terms of minimum defect size according to the acceptance level to be detected for optimum performance.
We thank Dr. Uwe Zscherpel and Gisela Malitte from BAM for supporting the digitization and film copy work. We thank AGFA for providing the film material for the copies. We thank further all the voluntary inspectors for the effort of film evaluation.
| © NDT.net - info@ndt.net | |Top| |