|TABLE OF CONTENTS|
One of the ways in which these quality system concerns are affecting day-to-day business operations in the European Community and European Free Trade Association is through European Standard EN 45001 , the primary objective of which is to promote confidence in "testing laboratories" through establishing criteria to which such laboratories should conform, and which may be used in the accreditation of laboratories (or of test facilities in general). This standard addresses such topics as impartiality, technical competence, and cooperation with customers.
Other European or international standards documents, such as ISO Guide 25 , have addressed many of the same topics, but additionally have drawn attention to the need to demonstrate that test methods are fit for their intended purpose. ISO Guide 25 is currently being rewritten to incorporate the requirements of EN 45001 into a single document that will explicitly require "validation" of test methods. The primary focus of such validation efforts has been on quantifying the uncertainty associated with the measurement process. This is also a primary focus of many statistical process control (SPC) programs. SPC techniques are usually thought of as providing means for monitoring the consistency with which a process produces a product (through the use of control charts, for example). However, in recent years there has been increasing interest in applying SPC techniques to demonstrating the fitness for purpose of test methods. Parameters such as "Gage R&R" express Repeatability and Reproducibility of measurement devices or methods, i.e. although the motivation is slightly different, both SPC and ISO 9000 initiatives have led to requirements for "validation" of test methods through quantification of measurement uncertainties.
The idea of validation is clearly laudable, and implementation for measurement methods is usually quite straightforward. The importance of reaching consensus on requirements for validation is being highlighted by the current trend in Europe to require validation of any test method, including NDE, before the facility using it can be accredited. However, significant problems have been found in trying to apply these standard assessment techniques to nondestructive evaluation (NDE) methods. It appears that the numerous differences between "detection" and "measurement" are partly responsible for these difficulties, but a large part is played by the lack of explicit understanding of what is appropriate to the validation of NDE methods, both in efforts to apply SPC- and ISO 9000-derived techniques. This paper reviews these differences in more detail, and discusses several alternatives for dealing with the validation of NDE methods which are based mainly on already existing experiences in several sectors of industry.
These DAR and EUROLAB documents, largely based on suggestions by Morkowski , recognize that the key element in validation of a test method is determination of characteristic key data, such as (but not limited to) the uncertainty of the results, for defined boundary conditions and for an expected application range . Several ways of categorizing the techniques available for determining uncertainty have been suggested. For example, both ISO  and US  guides for evaluating uncertainty in measurement distinguish two alternatives: (i) Type A: statistical processing of experimentally determined data; and (ii) Type B: estimation of uncertainty by other methods.
Type A evaluations involve standard statistical procedures associated with analysis of variance (ANOVA): multiple measurements should be made under essentially constant conditions, (to assess repeatability), or with deliberate changes in at least one condition (to assess reproducibility). Typical conditions to be controlled, or deliberately changed, include the observer, the measuring instrument, and any other factor that may plausibly influence variability of the results. Type B evaluations involve scientific judgement using all available relevant information, such as consideration of instrumental, technical, environmental and human factors, and might include previous calibration and measurement data, uncertainties assigned to reference data taken from handbooks, and prior knowledge of similar measuring instruments, for example.
Similar concepts are involved in other published or proposed guidelines, but grouped differently, or with different emphases. An EAL-EUROLAB position paper  identifies at least eleven possible ways for validating a test method, and groups them as "scientific" or "comparative" approaches. DAR-ATF document ATF/27/96e  further subdivides these approaches into four categories:
a) Systematic examination of parameters influencing the measurement result. This requires review of the quantities influencing the measurement method, and quantification of the associated uncertainty by the systematic variation of these quantities, using several test objects. This method is widely applicable, and is particularly useful where relevant reference standards are not readily available.
b) Calibration, with examination of the parameters influencing the measurement. This method is similar to (a) except that the output (measurement result) is compared to the input (reference standard) and that the evaluation of the variability of the reference standards must be included. The results are then valid only within the specific boundary conditions used.
c) Comparison with other test methods. Results from the test method under evaluation are compared with those from one or more other methods for which the variability has already been characterized. The smaller the difference between results from the methods being compared, the higher the confidence that the corresponding uncertainty of the new test method is small. This method is useful when no appropriate reference standards are available, or when approaches (a) or (b) would cause considerable additional effort.
d) Interlaboratory comparison tests. Results obtained by several test facilities examining one or more identical or similar test objects with the same test method are compared. The smaller the difference between results from the participating laboratories, the higher the confidence that the corresponding uncertainty of the test method is small. This corresponds to what is known in NDE under "Round Robin Tests".
ATF/27/96e  proposes an additional alternative:
e) Controlled Assessment of the uncertainty of the results. This method, illustrated in Figure 1, involves dividing the test method into procedural modules (such as sample preparation, measurement, etc.), and estimating the uncertainty associated with each module using theoretical understanding of the test method and prior experience in its use, i.e. the assessor must be an expert in the test method. Separate estimations of uncertainty are made for the trouble-free performance during each module that is typically assumed to apply to about 2/3 of all test results, and for irregular performance, based on past experience of the assessor, that is typically taken as applicable to 95% of all results. This method may be the only one practicable during early stages of development of a test method, but would normally be regarded as providing only interim estimates, needing replacement by one of the other four methods (a)-(d) as soon as possible.
Fig.1. Characterization of testing methods by controlled assessment
None of these approaches provides explicit guidance about what constitutes an acceptable uncertainty for specific test methods. Generally this must be decided on a case-by-case basis as part of the validation process, taking into account the needs of the customer, while recognizing that pursuit of a low level of uncertainty for a single module is unlikely to be justifiable if another module of the same test method is already known to contribute a much larger uncertainty.
Other differences between measurement and detection can cause more significant problems. For example, special consideration is necessary in dealing with NDE techniques (such as penetrant or radiographic inspections) for which the result of an inspection is not a measured value but a binary classification (e.g. as "acceptable" or "rejectable") . Even for those NDE techniques (such as eddy-current or ultrasonic inspection) that do produce quantitative output signals, the frequent application over ranges which approach quite close to material or inspection system noise poses problems that are not usually faced in conventional measurement/metrology situations. Other difficulties arise because most NDE techniques provide only indirect information about the size of the "defect". This is recognized in most NDE control documents by requiring rejection of "indications" - not defects - larger than some specified size. The indication - the observable quantity - is often only moderately well correlated with defect size (since it is rarely dependent on defect size alone), although defect size is likely to be of major concern in assessing the fitness for purpose of the product. The distinction between "indications" and "defects", (which sometimes causes misunderstandings between NDE engineers and engineers in other disciplines), leads to questions about whether it is indication amplitudes or defect sizes that should properly be the focus for validation of NDE methods. Many documents specifying requirements for specific NDE methods make no mention of actual defect size; others require the estimation of defect size from the available inspection data, a process that is likely to be accompanied by large uncertainties.
||TP: true positive indication (hit)|
TN: true negative indication
(correct "no defect")
|FN: false negative indication
FP: false positive indication|
|Fig. 2: Four possible diagnosis results in NDT|
A closely related question is whether validation of an NDE method should be concerned with characterization of its detection capability, or only with characterization of its measurement performance. This in turn leads to questioning whether it is Probability of Detection (POD) or variability in the POD that is appropriate; (the latter would more closely resemble what is done in validation of measurement methods). Finally, it has been suggested that the Probability of False Alarm (PFA) should also be taken into account in characterizing an NDE method . According to signal detection theory (Fig. 2) and its realization in real life there exist four possible response results of the NDE system. From each type of those which complement to 100% one has to be considered for a complete characterization of the system. No agreed national or international standard exists for dealing with validation of the detection aspects of NDE with except the limits of defect detection rates and false alarm rates written in the ASME Section XI Appendix VIII . An other idea is that a reliable detection capability might be traced back to a certain set of essential variables of the NDE system .
Standardization Based on Reference Objects
This approach is by far the most widely used, and is in use in most, or perhaps all, industries. One or more reference objects (such as DIN image quality indicators for radiography, or ASTM E127 reference blocks for ultrasonic inspection) are used, in combination with written procedures controlling details of the inspection method, to target and demonstrate consistency. The intent is to achieve essentially the same inspection conditions, independent of where, when, or by who an inspection is conducted. This type of validation is often used to provide process control information of a qualitative nature: loss of control at some earlier manufacturing stage is indicated by the unexpected occurrence of numerous or large indications, for example.
Capability for detection of "real" (naturally-occurring) defects seems rarely addressed, but is sometimes inferred from the size of the simulated defects in the reference objects, or from the size of defects that have been detected in past inspections using the same conditions. This inferential process can often lead to false conclusions about the detectability of real defects but it is assumed that the validation occurred during the accomplishment of the standard and its use over many years. Considering the example of a signal transfer chain of a NDE system in Fig. 3, the application of standards means that the performance of the NDE system is assured in defining the values of parameters in each module. It is strongly related to method (a) for general testing.
Fig. 3: Signal transfer chain radiographie
Performance Demonstrations for Empirical Applications
In its simplest form this approach involves the use of material samples containing known defects as a basis for studying the effects on detectability of factors such as calibration, changes in inspection equipment, or inspector training programs. For the other inspection parameters that are not deliberately changed, consistency is still pursued through the use of reference objects and control procedures. Test programs of this kind are often used in conjunction with "round-robin" or other interlaboratory data acquisition procedures. This type of test can be applied equally well to NDE methods producing qualitative (i.e. pass/fail) or quantitative (i.e. signal amplitude) outputs, but in practice they appear to have been used most widely with qualitative methods in terms of blind trials as indicated in Fig. 4, where input (the true defect situation in the component) and output (defect indication in the inspection report) are compared and the signal transfer chain from Fig. 3 is treated as black box.
From the aerospace sector examples are known for penetrant or radiographic inspection, which typically show large dependence on human factors, and for which the inspection threshold is consequently likely to vary significantly. To date, the major efforts in this type of Performance Demonstration have been the PISC program (focusing on characterization of all types of ultrasonic testing of nuclear power plant components), and the PDI program at the Electric Power Research Institute (EPRI) NDE center, in Charlotte, NC, (under whose auspices some hundreds of testing companies have already passed examinations in manual ultrasonic testing, and automated testing of pressure vessel nozzles, according to the ASME code, section XI, appendix VIII).
Fig. 5: ROC (Receiver Operating Characteristic) as reliability curve assessment
Fig. 4: Principle of an "Performance Demonstration" assessment
Detection capability is usually expressed in one of two forms: POD as a function of defect size, or POD as a function of PFA. For the first of these formats, a single curve represents the relationship between POD and size, for specific inspection parameters (such as penetrant type) and for a single inspection threshold (such as "report all indications larger than 1 mm"); PFA is usually not explicitly considered in this data format. The second format (which is usually referred to as a Relative or Receiver Operating Characteristic, or ROC) sometimes incorporates a family of curves, each one of which represents the relationship between POD and PFA (see Fig. 5), as the inspection threshold is varied, for defects of a single size. From the signal detection point of view, these two formats are just different ways of expressing the same underlying relationships between "signal", "noise" and the selected inspection threshold. A set of POD curves, each having a different PFA, can be transformed into a set of ROC curves, each for a different flaw size, for example.
In other applications only a single curve appears, usually as a result of deliberately combining data from defects of a range of sizes; in this form the underlying dependence of POD and PFA on defect size is suppressed. This single-curve ROC has been especially favored in reporting the results of capability demonstrations in the power generation industry; the results have been used on an empirical basis (i.e. to measure POD rather than test whether a specified POD value was attained) for a variety of purposes, such as comparison of inspection methods or evaluating the effects of inspector training programs.
Fig. 6: â versus a" philosophy for the determination of POD; concept developed by USAF (US-Air Force)
Fig. 7: Modular validation for application to NDE methods
The modular approach facilitates incorporation into the validation of results from simulation programs, for example; encourages selection of the best means for characterizing the uncertainty (or other quality characteristics such as POD) for each module; and readily allows changes to be made in the overall validation results as new methods or new data become available.
This modular approach might become a scientific basis to the "Technical Justification" developed empirically by ENIQ .
Fig. 8: Scheme of the example
|Module 1:||Physics of the method and influence of device (Radiography with tube, film)|
|POD1:||Determined by Modelling (see Fig. 9 with resulting POD in Fig. 10a)|
|Interface:||Signal on screen, film or C-scan (crack image with 0.01 O.D. minimum contrast on film)|
|Module 2:||Performance of interpretation of the signals etc. by human inspectors|
|POD2:||Hit/miss evaluation of the inspectors of a series of experiments (here: mean of results of 5 inspectors on each crack with a certain depth, see Fig. 10b)|
Fig. 9: . Geometrical setup
For the model calculation, we applied a ray tracing program to the model shown in Fig. 9. The POD in Fig. 10a is estimated by the "â versus a" approach. As â we used the difference in optical density from signal (notch region) to noise (surrounding of the notch ) integrated over some mm notch. As a we used % wall thickness of the notch, which is proportional to the notch depth. In good agreement with the experience of radiographic experts, the POD (at 95% confidence) is 100% at 4% wall thickness. For module 2, we consider the experimental data from ferritic tube welds with strain induced stress corrosion cracking. In comparison to other mechanisms of cracking (e.g. intergranular stress corrosion cracking), the crack shapes are more flat and straight so that the notch model can be applied. But there remains a certain zigzag line. On the cross sections from the destructive investigation, we measured the effective value of crack depth parallel to the possible X-ray for -5°, 0°, +5° angle of incidence and in no case was it smaller than 4% wall thickness. From the physical point of view, all of our cracks have an image on the film with delta O.D. > 0.01 (which is usually the detection criterion for human eyes). The experimental POD values were calculated using the indications of 7 human inspectors for each crack depth. The experimental raw data, as an average of the 7 inspectors' findings, are shown in Fig. 10 b. There seems to be no more dependence on the crack depth, but only a random variation of the POD around 0.9.
Result: For cracks greater than 4% wall thickness PODTOT = POD1 * POD2 = 1.0 * 0.9 = 90%.
Fig. 10: a) Theoretical POD: "â versus a "from simulated notch experiment (Modelling POD), object thickness: 20 mm, notch width: 40 µm, source diameter: 4 mm, source to film dist.: 700 mm, film: AGFA D4, Dmin = 2, adec = (deltaD)min = 0.02, â = (deltaD)local
b) Experimental results of film interpretation (mean of five inspectors)
On the other hand, since knowledge of POD is essential for category E, demonstration that an adequate detectability has been achieved, and characterization of the uncertainty of both the detection and measurement aspects of such inspections, appears to be desirable. Finally, it appears that many of these issues are sufficiently significant to require considerable additional attention before final decisions are reached. Resolution through inter-industry or international discussion and consensus seems both desirable and essential.
Most of these questions might be answered in applying integral methods like risk informed ISI, where NDE is embedded in the whole process of component maintenance.