NDT.net • May 2005 • Vol. 10 No.5

## An Ethical Problem in the Statistics of Defect Detection Test Reliability

Terry Oldberg, M.S.E., M.S.E.E., P.E.
terry@oldberg.biz, 650-941-0533

Corresponding Author Contact:
Email: terry@oldberg.biz

Speech to the Golden Gate Chapter of the American Society for Non-destructive Testing Prepared for presentation March 10, 2005

### Introduction

Thanks for the opportunity to speak to you! My topic is an ethical problem in the statistics of defect detection test reliability.

To give you a taste of what's to come, a decade ago, the American Society of Mechanical Engineers published a paper that I had co-authored; it was entitled "Erratic Measure." This paper claimed inconsistency between theories of the reliability of defect detection tests and empirical reality.

In the wake of publication of this paper, two paths lay open to the NDT community. One was to refute the paper's claim. The other was to eliminate the inconsistency.

A decade later, the NDT community has taken neither path. This, in a nutshell, is the ethical problem.

While my focus will be on the ethical side, the scientific side provides context for the ethical discussion. Thus, I'll begin with an overview of the science.

### Overview of the Science

For future reference, I need to define the term "OR." As I will use the term, OR is synonymous with the phrase "inclusive disjunction" in logic. In particular, if A is a proposition, B is a proposition, and A OR B is a proposition, A OR B is true unless A is false and B is false.

We can use OR in determining an outcome of a coin flip that is certain to occur: In a coin flip, that Heads Occurs is one proposition. That Tails Occurs is another. That Heads Occurs OR Tails Occurs is still another proposition and it is TRUE unless Heads Occurs and Tails Occurs are both FALSE. Because of physical constraints offered by a coin, either Heads Occurs or Tails Occurs must be TRUE. Thus, we may conclude that Heads Occurs OR Tails Occurs is TRUE.

A coin flip is an example of an "event." Using this terminology, we may restate the previous finding by stating that "the event of Heads OR Tails is certain to occur."

In the statistical sciences, the nature of the relation from a set of events that are certain to occur to a statistical sample plays a central role. In particular, if there is no relation or a relation that is not one-to-one, probability theory is empirically violated. If a relation exists and it is one-to-one, probability theory is empirically preserved.

A situation in which this relation is one-to-one is coin flipping. As I've already shown, in a coin flip, the event of Heads OR Tails is certain to occur. Events of this type relate one-to-one to flipped coins. The set of flipped coins is an example of a "sample." A flipped coin is an example of a "sampling unit."

That its set of events that are certain to occur relates one-to-one to the associated sample implies that probability theory is empirically preserved for coin flipping. Thus, probability works in theorizing about coin flipping.

To pass on to the field of testing, any test that is subject to error generates events of the type A True Positive OR A False Negative OR A True Negative OR A False Positive that are certain to occur. The defect detection test is an example of a test that is subject to error. Thus, it generates a set of events of this type.

Question: Is probability theory valid in relation to statements about the reliability of defect detection tests? It is if the associated sampling units and sample are defined. For them to be defined, the one-to-one relation that I described earlier must be present.

In 1994, I searched the Engineering Library at Stanford University for a description of the sampling units and sample of defect detection reliability.

I didn't find a very clear statement of this, anywhere in this library. One work seemed obliquely to suggest that the sample was an inspected structure's discontinuities. This was incorrect: at most, the discontinuities could form a proper subset of the sample.

In the period between 1984 and 1995, I asked a number of people involved in defect detection reliability research to identify the sampling units of the studies in which they were engaged. I asked the statistician for a U.S. Nuclear Regulatory Commission study of the reliability of the ASME Section XI test of pressurized water reactor steam generator tubes to identify the sampling units. She never answered. I asked the NRC's Research Director. He delayed for months and then obliquely implied that they were discontinuities; this was incorrect. I asked the ASME's Committee on Risk Based Inspection to identify the sampling units which they thought would underlie the probabilities of the risk-based inspection technology they were developing. They delayed for a year and then forwarded a photo of material containing a crack. Did they mean by this that a structure's discontinuities would form the sample? If so, they were wrong. Was there reason to believe that probability theory would be preserved by the Committee's sample? No. There seemed to be the situation that people were specifying and conducting statistical research without having a clear idea of the sampling units! This would have been unimaginable in other branches of the statistical sciences. Normally, determining the nature of the sampling units and sample is the first order of business in designing a statistical study.

Years after my fruitless attempt at getting a description of the sampling units and sample from the statistician for the NRC study of the reliability of the ASME Section XI test of pressurized water reactor steam generator tubes, I got a hold of a copy of the final report for this study. Data presented in the report revealed the following three situations: 1) under the condition that the event of A True Negative OR A False Positive was certain to occur, it did not relate to objects in the empirical world; 2) under the condition that the event of A True Positive OR A False Negative was certain to occur and discontinuities were small, this event related one-to-one to discontinuities; 3) under the condition that the event of A True Positive OR A False Negative was certain to occur and discontinuities were not small, this event related one-to-many to discontinuities.

There were 4 implications for the ASME Section XI test: 1) A Probability of False Call was not defined for it; 2) A Probability of Detection was defined for it only in the case of small discontinuities 3) the final report of the NRC study asserted that a Probability of Detection existed for non-small discontinuities and this assertion was false; 4) If readers of the final report of the NRC study believed what was written there, they would falsely believe that a Probability of Detection was defined for non-small discontinuities.

The implication for defect detection tests, in general, was that tests sharing certain characteristics with the ASME Section XI test would share this test's statistical shortcomings. Pending a study of all defect detection tests, the number of such tests would be unknown. I reviewed a handful of tests and found none that preserved probability theory.

A byproduct of violations of probability theory was impairment of one's ability to communicate about statistical ideas. This followed from the fact that such terms as "probability," "sample," "sampling unit," "population," "signal" and "noise" assumed the validity of probability theory locally or globally. "Signal" and "noise" were widely used in reference to defect detection tests but implied a global preservation of probability theory that was not present.

Another byproduct of violations of probability theory was impairment or elimination of one's ability to make decisions about defect detection tests. The standard approach to decision making under uncertainty implements the Utility Theory of Von Neumann and Morgenstern: Knowing the probabilities of the outcomes of a process and the utilities of these outcomes, one computes the expected utility of the process. When alternatives are available, one computes the expected utility of each of them and selects that process which bears the greatest, expected utility. One cannot compute the expected utility if a complete set of outcome probabilities is not present. In this way, one's ability to make decisions is impaired or eliminated.

In mid-1994, a colleague and I cast a portion of these findings into the form of a paper; we titled the paper "Erratic Measure." The title alluded to the quantity which, rather than probability, would have to be used in characterizing the reliability of the ASME Section XI test and others like it. Probability had a maximum value of 1, but constancy of its maximum value would have to be sacrificed in this new quantity; hence rather than being a measuring rod of fixed length it would be a non-constant or "erratic" measuring rod. In the fall of 1994, we learned that an ASME conference committee had accepted the paper for publication ( www.ndt.net/article/v04n05/oldberg/oldberg.htm ) and oral delivery.

To my knowledge, in the period following publication of "Erratic Measure," no work has refuted or limited its claims. Thus, I have nothing else to report of a scientific nature.

### The Ethical Problem

I do have incidents to report that are of an ethical nature. The incidents I'll describe resulted from communications between me and 4 organizations, in the wake of the acceptance of "Erratic Measure" for publication. The four are a) the Nuclear Regulatory Commission b) the Federal Aviation Administration c) The ASME Section XI Committee and d) the ASNT.

Let's start with the NRC. The NRC had seemed unaware of the situation revealed by the "Erratic Measure." I felt that they should be immediately informed so I mailed a preprint to a high level NRC official. My cover letter recommended action on the situation revealed by the paper.

Neither he nor anyone else at the NRC ever responded to this letter. However, I found through the ASME conference chairman that the NRC had responded to the ASME. In its response, the NRC had asked the ASME to kill publication of "Erratic Measure."

The conference committee responded to the NRC's request by adding a fourth referee; he was an academic statistician. I learned that he had read the paper and judged it "excellent." Thereupon, the conference committee had rejected the NRC's motion to kill publication of "Erratic Measure." When I spoke to him, the conference chairman described the NRC's request as "technically unfounded."

Soon after publication, I wrote again to a high level, NRC official. In my letter, I pointed out that my position had prevailed within the peer review process and that the NRC's had been rejected. I called for action on the situation revealed by "Erratic Measure."

The NRC responded by stonewalling. This pattern of behavior continues to this day.

Another federal agency with responsibilities for enhancing public safety through NDT was the Federal Aviation Administration. After "Erratic Measure" was published, I wrote to the FAA's chief administrator, enclosing a copy of the paper and suggesting review of the inspection methods used for passenger aircraft, in light of the paper's conclusions.

A spokesperson for the FAA responded that the FAA did not compensate consultants for their time or travel expenses, as a matter of policy. As I earned my living as a consultant and resided several thousand miles from the FAA's engineering staff, this was a barrier to interaction. Sensing an opportunity to overcome this barrier, I informed the agency of a meeting my co-author and I were about to have with another federal agency near the FAA's offices; in this meeting, "Erratic Measure" was to be discussed. When I arrived at the meeting, I learned that the FAA was barred from attending. I heard that federal law prohibited inter-agency meetings that were not open to the public. I've heard nothing further from the FAA.

By the way, I've found what I believe to be unmistakable signs of empirical violations of probability theory in the final report on the FAA's research on the reliability of defect detection methods in inspecting the rivet holes of aging aircraft.

Another organization with responsibilities for public safety via NDT was the ASME Section XI Committee. The Section XI Committee was responsible for design of the probability-theory violating test of pressurized water reactor steam generator tubes. More generally ,the Committee wrote the rules governing the inspection of nuclear power plants in the United States and some foreign countries.

Immediately following the publication of "Erratic Measure," I wrote to the Committee to offer to present the paper and discuss it with them, at a meeting they were scheduled to hold in San Francisco 9 months later.

In the next 6 months, I didn't receive a response.

Thus, I wrote again to the Section XI Committee, to ask what was happening.

A spokesman for the Section XI Committee responded by telephone. In a three point reply he said: a) I could come to the meeting but could not confer with the Committee on the issue on which I had proposed conferring b) the Committee didn't understand statistics and c) the Committee wished not to change the rules of inspection of nuclear plants because nuclear power was a dying, money losing business.

In the period following publication of "Erratic Measure," I've attempted communication with the ASNT on a number of occasions.

Some of these have been constructive. For example, your Golden Gate Chapter has responded favorably on both occasions on which I have proposed delivering speeches. Also, a committee formed a few years ago to improve the technology of reliability assessment invited me to join; I had to decline for financial reasons.

Other contacts with the ASNT have been fruitless. For example, I contacted a number of past, ASNT presidents on this issue. None responded constructively. Most didn't respond at all.

I'd like to share with you story about my most recent interaction with the ASNT before this one. In 2001, Materials Evaluation published an article entitled "How Well Does Your NDT Work."

First, though referencing the pertinent, prior literature is required of authors of scientific works, "Erratic Measure" pertains to the topic of how well NDT works and was published prior to "How Well Does Your NDT Work," the authors do not reference "Erratic Measure."

Second, "How Well Does Your NDT Work?" does not address the issue that is raised by "Erratic Measure."

Third, by assuming probability theory and recommending a course that generates violations of probability theory, the article leads its readers into the inconsistency between probability theory and empirical reality that is exhibited by the ASME Section XI test and NRC study.

"How Well Does Your NDT Work" references a work produced by the Air Force Materials Lab in 1999: Military Handbook 1823. When I read it, I found that Military Handbook 1823 had the same deficiencies as "How Well Does Your NDT Work."

In 2003, I submitted a Letter to the Editor of Materials Evaluation. The letter pointed out that 8 years had elapsed since publication of "Erratic Measure" without refutation of its claim or elimination of the inconsistency it revealed. My letter called for immediate refutation or elimination of the inconsistency.

Months later, the editor contacted me to tell me that ASNT referees had recommended against publication. He said he was required to follow their recommendation and would do so. The referees implied that Military Handbook 1823 had superseded "Erratic Measure" but did not explain how. They stated that, as authors of Military Handbook 1823 "...were well-versed in NDT reliability and had significant backgrounds in statistical methods, it is the opinion of the reviewers that it is highly unlikely that serious problems exist within that document."

I responded by posting an essay on the forum of the Web-based magazine ndt.net. The essay was entitled "Argumentum ad Vericundiam at the ASNT." Argumentum ad Vericundiam is one of the logical fallacies. Translated to English, "Argumentum ad Vericundiam" is "Argument from Authority."

In the essay, I point out that the ASNT referees' argument for rejecting publication of my Letter to the Editor is of the form of Argumentum ad Vericundium. This form has following, three element structure: 1) Person A is claimed to be an authority on subject S; 2) Person A makes claim C about subject S; 3) Therefore, claim C is true.

As none of the logical fallacies can be used in scientific discourse, Argumentum ad Vericundiam cannot be used. However, I learned, from the editor of Materials Evaluation and from the ASNT's president, that ASNT referees are free to use Argumentum ad Vericundiam in defeating submissions. They didn't say this directly, but their actions implied it. It followed that the ASNT was not a scientific institution, appearances to the contrary not withstanding.

### Closure

The foregoing brings you up to date on my activities on the ethical front. Thus, it's time to close my talk. Before ending it, I'd like to leave you with a summary of what I've tried to convey.

A decade ago, the ASME published a paper claiming inconsistency between theories of the reliability of defect detection tests and empirical reality. In the wake of publication of this paper, two paths lay open to the NDT community. One was to refute the paper's claim. The other was to eliminate the inconsistency.

A decade later, the NDT community has taken neither path. This is the ethical problem that I have posed in this talk.

Now, I'd like to open the proceedings to questions.

Discussion Forum