2

The Situation

In many disability programs, a physician, psychologist, or other health professional conducts an exam with the claimant. Although it's not easy, a research study could determine an inter-rater reliability estimate1 for these medical2 examinations.

After the medical exam, lay3 adjudicators review all relevant records, including questionnaires and statements from the claimant, along with the medical examiner's report, and apply regulatory standards to reach an administrative/legal decision regarding the disability claim. Researchers could conduct a study to estimate the inter-rater reliability for these adjudicative decisions.

As you can see, the medical examiner's report influences the adjudicative decision, at least to some extent. (In most systems, the examiner's report significantly influences adjudicative decisions.) Thus, the two inter-rater reliability estimates are not independent.

My Question

What is/are the best way(s) to calculate the overall reliability of such a disability determination process?

Brief Background

I'm asking this question because some organizations conduct research on the inter-rater reliability of the adjudicative decisions and describe the results as "the accuracy of our disability determination decisions". The unspoken assumption is that the inter-rater reliability of the medical examinations is 1.00.

Notes

  1. I wasn't sure if "coefficient" would be a better term than "estimate".
  2. Most disability programs refer to all exams as "medical" even though some are actually psychological, audiological, etc.
  3. lay, adj. - Not of or belonging to a particular profession; nonprofessional.

Prior research

I searched SE-Math but found just one related thread: combined reliability

I reviewed the following texts/tutorials (not that I couldn't have missed something relevant).

Gwet, Kilem L. Handbook of Inter-Rater Reliability. 4th ed. Gaithersburg, MD: Advanced Analytics, 2014.

Khan Academy, Statistics and probability.

Stat Trek, Statistics and Probability.

Trochim, William M., James P. Donnelly, and Kanika Arora. Research Methods: The Essential Knowledge Base. 2nd ed. Boston: Cengage Learning, 2016.

1 Answers1

1

I'm not a statistician, but here are some starting thoughts. I've made this into an answer rather than a comment due to its length.

Before you calculate anything, you need to model the situation. There is rarely a single, god-given model. Modeling requires making choices, and these choices are informed by the experience of subject experts.

Modeling I: Patients

You want to understand the inter-reliability of medical and adjucative examinations. It seems likely that this reliability is a function of the underlying medical condition, as well as a host of other socio-economic factors. If you don't include this information in your model, you will get average estimates of inter-reliability that don't take into account the variance seen in different patient groups. It's up to you to decide what level of granularity to adopt.

Modeling II: Evaluations

Just as there are different ways of modeling the patient pool, there are different ways of modeling the evaluations. One possibility is binary: approve or deny claim. Binary outcomes make the most sense for the legal decisions. However, I'm imagining that the medical examiner's report contains more information, perhaps a value between $0$ and $5$ indicating the severity of the condition or their conviction for requesting approval.

Modeling III: Connections

What is the relationship between healthcare professionals and legal adjucators? Does an adjucator treat every healthcare professional's opinion equally? How many healthcare professionals does a given claimant see? Are they random?

A Simple Model

A model that is too complicated is impossible to use in practice, due to the lack of infinite data. Let me suggest the following simple model. Let $P$ be the set of patients. $P$ could contain all patients, or could be narrowed down to a group of patients with similar characteristics. Assuming every patient sees $d$ doctors, and each doctor gives a "yes-no" answer, we have a function $f: P \to \{0, 1, \cdots, d\}$ that records the total number of "approve" responses. Assume every patient has $n$ adjucators, we then have a function $g: \{0, 1, \cdots, d\} \to \{0, 1, \cdots, n\}$ that takes the total number of medical "approve" responses and outputs the total number of adjucator "approve" responses.

If you look at the variance of the function $g \circ f$, you get the total variance of the process over the set $P$. If you look at the variance of $f$, you get the total variance of the medical evaluations. Lastly, to get the variance of the legal adjucators for a fixed medical assessment, you need to take the variances of $g$ conditioned on particular outputs of $f$ (you can then average these quantities).

To my mind, variance is a good statistic to measure reliability. If the collections of patients in $P$ are all similar, an assumption of reliability would predict a fairly consistent medical and legal evaluation. The larger the variance of the various functions described in the prior paragraph, the more disagreement between medical and legal experts. Assuming patient homogeneity, this implies a lack of reliability.