publication date: May. 3, 2019
Trials & Tribulations
A comparison of study designs for estimating overdiagnosis in cancer screening
Barnett S. Kramer
Contractor, NCI Division of Cancer Control and Population Sciences, Office of the DCCPS Director
Former director of the NCI Division of Cancer Prevention
Philip C. Prorok
Contractor, NCI Division of Cancer Prevention, Office of the Director, Division of Cancer Prevention
Former chief of the Biometry Research Group at DCP
Overdiagnosis is defined as the diagnosis of an asymptomatic cancer that would not have become clinically evident during the person’s lifetime in the absence of screening or similar activities, such as diagnostic imaging tests that reveal “incidentalomas.”
The whole concept of overdiagnosis seems counterintuitive to the public and even to many clinicians, because the traditional view, encouraged by public health messages and by medical training, is that cancer is lethal unless detected early. This core belief system has driven the quest for increasingly sensitive screening tests that can detect as many asymptomatic cancers as possible.
However, the natural history of such asymptomatic tumors is unknown, and some may be very indolent or even non-progressive. Nevertheless, because cancer overdiagnosis and its potential to trigger overtreatment are important harms of screening, recent research has been devoted to estimating its frequency in association with various screening tests.
The aim of this commentary is to describe and contrast the various methods used to quantify/estimate the amount of overdiagnosis incurred by several types of cancer screening. The primary approaches are pathology or imaging studies, mathematical or simulation modeling, examination of trends in population data, and randomized trials.(1)
Pathology or imaging studies are aimed at predicting future behavior of cancers at a fixed point in time, based on pathologic or anatomic features. The identified features are assumed to be related to the natural history of the disease and are predictive of the ultimate progression, or lack thereof, and outcome of the cancer.
It seems clear that the key assumption, based on a static picture of an evolving process, is difficult, if not impossible, to verify and that this approach is therefore of limited utility since it misses the dynamic aspects of lesion progression.
In a typical modeling approach, a simulation process is used to generate the time of clinical detection in the absence of screening for each screen-detected cancer, based on assumptions about the lead time distribution (lead time is the time between screen detection of a preclinical cancer and the time of clinical diagnosis had the screening not occurred). A time to death from other causes is also generated.
The cancer is overdiagnosed if death from other causes occurs prior to the simulated time of clinical detection. The resulting estimate of overdiagnosis is dependent upon the choice of the lead time distribution—virtually never known, because it is not observable for any given individual.
The challenge of this modeling approach, therefore, is obtaining a valid estimate of the lead time distribution. Compounding the challenge is the likelihood that for many cancers the preclinical cancers in a population are a mixture of progressive cancers (that would eventually become clinically manifest in the absence of screening) and non-progressive cancers (that would never become clinically manifest).
In practice, the lead time distribution is typically derived from the distribution of time with progressive cancer, and this is obtained from the distribution of the duration of preclinical cancer. Ideally, this preclinical duration distribution would include the probability of non-progressive cancer.
This becomes circular, since non-progressive cancers are an important component of the very entity being estimated. However, the assumption is often made that there is no non-progressive cancer, and further that the distribution of the duration of preclinical cancer used to estimate the mean lead time is exponential.
Together, these overly simplistic assumptions could lead to a substantially biased estimate of overdiagnosis. A more realistic distribution of the duration of preclinical cancer would be desirable, but it is then statistically difficult, if not impossible, to separate the distribution of time to progressive cancer from the probability of non-progressive cancer; and obtaining a reliable estimate of overdiagnosis is problematic.(2) Assumptions about the distribution may explain why microsimulation models of the same screening test have yielded a wide range of estimates of overdiagnosis.(1)
Another commonly used method to estimate overdiagnosis is the difference between the annual incidence of cancer in a population or cohort receiving screening and the estimated annual incidence if, counterfactually, the population screened were not screened. Several estimates of the latter quantity have been used, all of which have important limitations.
Of major importance is the lack of an internal direct comparison group, as would be created by randomization. Absent an internal comparison group, underlying annual incidence requires extrapolating prescreening cancer trends, annual contemporaneous incidence in an unscreened geographic region, and/or annual contemporaneous incidence among persons who did not accept the screening invitation. All such comparison groups are prone to selection bias and other confounding factors. This is also the reason why single-arm screening studies cannot provide a reliable estimate of overdiagnosis.(1)
In principle, the most direct approach to estimate overdiagnosis is a randomized trial with a stop-screen design and sufficient duration of follow-up. (3,4) In a stop-screen randomized trial, participants in the screened arm(s) receive periodic screening until the start of a follow-up period. In this trial design, the number of persons overdiagnosed is estimated by an excess cumulative incidence, the difference in the cumulative number of incident cancers between the screened arm and the control arm that extends well beyond the active screening period.
Unbiased estimation requires that the length of follow-up exceed the longest lead time. Ideally this involves a design with multiple screening rounds, compliance with screening that is nearly perfect, neither study group gets screened after the prescribed screening period ends, and all participants are followed to death. The ideal is seldom, if ever, achieved, but at least some randomized trials have achieved useful approximations of the ideal.
In a trial in which both study arms are of equal size, the number of overdiagnosed cases can be determined as the difference between the total numbers of cases in two arms. Let ns be the total number of cases in the screened arm, nc the total number of cases in the control arm, and nO the number of overdiagnosed cases. Then nO = ns – nc.
All study participants are not followed to death in actual trials. However, if follow-up continues well after screening stops, it is possible the cumulative number of cases in each study arm will become the same. Then there is no overdiagnosis; nO = 0. If instead the difference in total cases between the two arms becomes constant, nO follows as above. A useful design in practice then is a two arm stop screen randomized trial with follow-up long enough to determine nO.
For example, in the ovarian component of the PLCO trial, from the initial endpoint report, screening was conducted over a five-year period but follow-up continued through year 13. Compliance with the screening tests was about 80% and contamination was very low at less than 5%. A persistent excess of cumulative ovarian cancer cases in the screened arm was observed from year 7 through year 13, the tell-tale hallmark of overdiagnosis.(5)
When two screening tests for the same cancer are to be compared, and benefit has not been demonstrated for either test, an unscreened control group is ethical and desirable. The preferred design is the three arm stop screen randomized trial. Participants in one arm are screened with one test, participants in a second arm are screened with another test, and participants in the third arm serve as controls.
We assume equal numbers of participants in each arm and equal periods of screening in the two screened arms. Ideally, all participants would be followed to death. In practice, follow-up must be sufficiently long to establish either equivalence or a constant difference in total cases between the control arm and each screened arm.
From the control arm the total number of cases is nc. Let i = 1,2 index the two tests and the corresponding trial arms. The total number of cases in screened arm i is nsi. Each test arm is then compared to the control arm to determine the individual numbers of overdiagnosed cases as nOi = nsi – nc. This design thus allows one to observe the number of cases overdiagnosed by each test.
If an unscreened control arm is not ethical or practical, a possibility that still has utility for guiding screening policy is the two-arm stop screen randomized design with follow-up. It is not possible to determine the number of overdiagnosed cases because there is no control arm. However, it is possible to determine the difference in overdiagnosed cases ∆nO = ns2 – ns1. This can aid in ranking the harm done by each screening program with respect to overdiagnosis.
It is worth noting a concern inherent in the often used single cohort paired design for comparing two screening tests, wherein all participants are in one cohort, and each person receives both tests. Importantly, overdiagnosed cases cannot be determined from this design, because there is no control arm.
Cases will be identified by screening and during post-screening follow-up. The totality of cases existing during follow-up is a mixture of cases detected by one test or the other that would have been diagnosed clinically in the absence of screening, plus cases missed by both tests, plus cases newly developing after screening ceases, plus cases overdiagnosed by one test or the other test or both. However, the overdiagnosed cases cannot be identified nor linked to one test or the other in this design.
The extent of overdiagnosis can depend upon various features of a screening program. In the randomized trial designs, near-100% attendance at screening is often achievable, but this is very unlikely to occur in a population screening setting. Any reduction in attendance will likely reduce the estimate of potential overdiagnosis.
Further, in the designs with a control arm, screening contamination in that arm will bias the estimate of overdiagnosis downward. Similarly, when comparing two tests, differential compliance in the screening arms will bias the comparison of overdiagnosis. The number of screening rounds and the length of the interval between screens will also influence the amount of overdiagnosis.
In summary, estimates of the amount of overdiagnosis associated with a screening test have important implications for people making an informed decision based on the benefits and harms of the test, and for generating guidelines for the public and health professions.
Although understanding the level of harm is important, obtaining accurate estimates can present a challenge; and commonly used study designs can yield seriously biased results. The gold standard, if practical, is a large-scale randomized trial with a stop-screen design and sufficient follow-up.
Carter JL, Coletti RJ, Harris RP. Quantifying and monitoring overdiagnosis in cancer screening: a systematic review of methods. BMJ 2015; 350: g7773 doi.
Baker SG, Prorok PC, Kramer BS. Challenges in quantifying overdiagnosis. JNCI 2017; 109 (10): djx064
Etzioni RD, Connor RJ, Prorok PC, Self SG. Design and analysis of cancer screening trials. Statisitcal Methods in Medical Research 1995; 4(1): 3-17.
Prorok PC, Kramer BS, Miller AB. Study designs for determining and comparing sensitivities of disease screening tests. J Med Screening 2015; 22(4): 213-220.
Buys SS, Partridge E, Black A, et al. Effect of screening on ovarian cancer mortality,the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. JAMA 2011; 305(22): 2295-2303.