Table of Contents ECNDT '98 Session: NDT of Welds  Experiences and Problems of a Manual Ultrasonic Round  Robin TestFerenc Fücsök Eur. Ing.ERÖKAR Co. Budapest, P.O. Box 67, 1601 Hungary 
TABLE OF CONTENTS 
During the EuropeanAmerican Workshop, Determination of Reliability and Validation Methods of NDE the participants agreed on a "model", an empirical formula:
where:  R:  the reliability of an NDE system, 
IC:  the Intrinsic Capability of the system,  
AP:  the effect of Application Parameters,  
HF:  the effect of Human Factors. (1) 
The IC is generally considered as an upper bound, the AP and the HF are generally reducing the capability of the NDE system. Everybody agreed that the most important part of the reliability is the human factor and it is also the weakest part. It depends on several factors and changes rapidly. There were several tests to measure the reliability, and how it changes with the circumstances (2).
There are several solutions of measuring and evaluating the reliability of NDE systems, and the evaluation of the reliability of operators is similar. But the human factor has some different features. He knows, that he is under test, he is thinking about his work and his performance is not constant; so it is a very hard problem to measure the reliability of the operator and it is harder problem to enhance it.
An accredited industrial laboratory needs to measure the operator's reliability to demonstrate both the operator's and the laboratory's performance, but the operators need to understand the results of evaluation. They don't understand mathematical expressions, but they are interested in who is the best. For this reason and to provide a reliable NDE it is necessary to organise inner or inter laboratory blind roundrobin tests from time to time, evaluate them, and explain the results to the participants.
Fig.1: The sketch of tested welding marked 7.1 
Last year a manual ultrasonic roundrobin test was organised by ERÖKAR Co. in Hungary, with 24 participating operators, coming from 8 laboratories. The specimen, marked 7.1, can be seen in the Fig. 1, which was a 60 mm thick welding with four artificial defects and one natural root discontinuity.
The artificial defects were manufactured by sparkcutting (Fig. 2) which is a good solution to make volumetric discontinuities of defined size.

The real shape of a defect can be seen on Fig. 3, where the photo were made after fracturing another similar specimen.
Fig.3: The real shape of an artificial defect 
For testing the specimen, a strict ultrasonic testing technology was given to the voluntary operators. They examined the welding with Ø 3 mm FBH registration level, with 45°, 60° and 70° angle of incidence of 2 MHz frequency probes, in four directions. The test lasted approximately eight hours.
The means of the measured values measured by 24 operators can be seen at the Table 1, and the standard deviation of means can be seen at Table 2. The real measures of the artificial flaws are shown in the Table 3. As can be seen, the manufacturing of flaws is quite good, except reflector No 5. Naturally, the No 1 root failure has no planned measure.
No  Beginning of flaw [mm]  Length of flaw [mm]  y [mm]  z [mm]  FBH [mm] 
1  1,12  289,79  0,17  58,73  6,60 
2  42,27  13,62  7,74  31,40  3,73 
3  133,76  21,59  8,03  31,04  3,87 
4  233,76  26,14  2,10  31,36  4,23 
5  325,40  44,76  2,22  31,37  4,53 
No  Beginning of flaw [mm]  Length of flaw [mm]  y [mm]  z [mm]  FBH [mm] 
1  6,83  53,3  4,31  4,33  2,16 
2  5,91  7,48  4,11  6,26  0,79 
3  6,14  7,55  4,23  6,26  0,81 
4  10,05  9,31  7,76  3,78  1,00 
5  8,83  13,9  7,74  3,98  1,00 
No  Beginning of flaw [mm]  Length of flaw [mm]  y [mm]  z[mm]  Code Of Flaw 
1  
2  45  15  31,5  B 7.1  
3  133  30  31,5  B 7.2  
4  245  15  30  KB 7.1  
5  343  30  30  KB 7.2 
You can evaluate the reliability of the operators, if you know the real values corresponding to measured values. In the ultrasonic test the real values can be determined by:
Currently we could evaluate the reliability of the operators based on the mean of all measured values of all operators and using the planned measure. The other evaluation methods of real values are still in progress at this time so we could not get the results.
The SPC evaluation
For the first evaluation the statistical process control (SPC) was used, as McEvans suggests (3). In this version of SPC the reliability is calculated like the signal to noise ratio, and is expressed in dBs. The dB scale is more understandable for NDT operators than the mathematical expressions, because they use the dB scale every day. In this calculation the real value is the signal, and the failure of the measure is the noise. The formula is:
S/N =  10 log (MSD) (dB) where:
S/N = signal to noise ratio,
MSD = mean squared deviation.
and
where
x_{i} = measured value
R = real value
n = number of measuring.
Fig.4: The reliability of testers 
The results of calculations can be seen in the Fig. 4. This method of evaluation is quick and understandable enough, but couldn't solve the problem of missed reflectors.
If an operator misses a reflector in the evaluation it will lack a measured value. In the end the reliability of the operator is underestimated because x_{i} = 0, and MSD contains square of (x_{i}R).
Another problem exists with this calculation method, if the relative value of missing x_{i} is greater than one (e. g. reflector 5) it will have more influence, than the missing smaller value (e. g. reflector 2). To repair this problem, in the evaluating process, the (x_{i}R) value was given by 0. It seems to be a good solution, but this 0 was a "gift", and the operators with more missing reflectors get higher dB values. The SPC is understandable for the operators, but doesn't contain the probability of detection (POD).
Fig. 5: The testers' Probability of Detection 
The sophisticated POD evaluation
It is clear for an operator too, that the results of an NDT have four possible outcomes:
True Positive (TP): a defect indication was reported and there was an actual defect present.
True Negative (TN): no defect was reported and no defect was present.
False Positive (FP): a defect indication was reported and there was no defect.
False Negative (FN): no defect was reported and an actual defect was present.
For the evaluation of probability in the usual way the weld can be divided into cells. Our welding was divided into 2 cm long cells. It is easy to demonstrate that the connections of four possible outcomes expressed in cell numbers are: (4)
TP + FN = 100 %
Number of all flawed cells.
TN + FP = 100 %
Number of all unflawed cells.
These connections are valid for cells of the tested specimens. The formula for calculating the POD is:
p(TP) = POD = TP / TP + FN
and the probability the false alarm (PFA):
p(FP) = PFA = FP / FP + TN
The evaluation of POD versus PFA are shown in the Fig. 6, where one point represents the performance of one operator. Because there is no possibility to demonstrate the results of all the 24 operators because of the overlapping, in the figure are shown some example.
Fig.6: The POD diagram of nine operators 
This type of performance demonstration is correct, because it contains the accuracy and the probability of detection, too. But it is very hard to understand for the operator, because they couldn't see, who is the best of the team. We have to teach them not only the necessity of reliability evaluation of their work, but the interpretation of the results as well.
The figure would be practical for the leader of nondestructive laboratories, to see which operator has good probability of detection and who has good reliability. Unfortunately it is very hard to express the acceptance level of testers, or the acceptance area of the PFA versus POD diagram at this time.
The requirements of ASME Code Section XI. Appendix VIII are clear (5) and can be expressed like an area of the POD diagram, but they are valid only under those particular circumstances the Code mentions. In other areas the requirements depend on the NDT methods, the specimens and the conditions of work.
In the time of the test and the evaluation some experiences were gathered. The operators were not interested in the artificial volumetric flaws, because they judge to find and measure them an easy exercise. In addition, they have done their work without responsibility, and it was boring for them, because of the long time work.
Otherwise the specimen did not provide sufficient information for evaluation of the sophisticated POD curve, because there were only flaws of two different measures. The other question is whether when the operator tests the same specimen a second time with another probe, is this test a blind test or not?
It became clear that the specimens of the round  robin test are the key things. The number of specimens, the measures and shape of the flaws have to be planned depending on the goal of the evaluation. The set of specimens must contain unflawed ones. The operators will work with higher interest, if the flaws or most of the flaws are real, because in this way they can collect real experiences for their everyday work.
In the future we have to find out the easiest and correct evaluation methods, which are understandable for the common people, and have to collect enough information to determine the acceptance levels of reliability. These levels aren't depending on the methods only, but on the needs of the industry, the manufacturing process, and the calculation methods the designers use.
It seems this field of reliability needs a lot of work and cooperation in the exchange of data.
This work has been performed within the framework of Bilateral Cooperation between Fraunhofer Institute für Werkstoffmechanik and Bay Zoltán Institute for Logistics and Production Systems, No. D127. lead by Dr. S. G. Blauel and Prof. L. Toth.
In the evaluation process of results was supported by Bundesanstalt für Materialforschung und prüfung Lab. VIII.33. Reliability of Non Destructive Testing, lead by Dr. rer. nat. Christina Nockemann. The author expresses his thanks to her and her staff for the help.
NDT.net 