526
Views
0
CrossRef citations to date
0
Altmetric
Editorial

Changes to health screening – we need to remain vigilant

Pages 319-320 | Received 12 Apr 2022, Accepted 18 Apr 2022, Published online: 08 Jul 2022

The balance of risks and benefits may be changed when the technical aspects of a form of health screening are changed. The introduction of the use of artificial intelligence (AI) to the process of mammographic screening is discussed, with comments on the limitations of studies to date and the characteristics of future studies that could provide the information needed to properly assess the impact of the introduction of AI into screening mammography.

There are well-established criteria for the introduction of a new health screening program, described more than 50 years ago, and these criteria are still appropriate [Citation1]. However, there are no such criteria for the review of existing screening practices. We should not assume that once a form of health screening is in place, it should be continued forever, and there is a strong argument for review [Citation2] as the balance of the benefits and harms of screening can change over time. If a screening program was reviewed, this does not mean it would cease – it could be stopped, continue unchanged or be modified in some way, such as a reduced frequency of screening or screening being limited to individuals at higher than average risk. Ropers et al. have argued that the need for review of an existing program would be justified for a range of reasons [Citation2] including the following:

  • the condition being screened for has become less common;

  • treatment for the condition may have improved to the point where the benefit of early diagnosis is less important than it was prior to the availability of such effective treatment;

  • new evidence might challenge the basis for the original initiation of the screening program, for example where harms occurring as a result of screening have become apparent when those harms were not clear when the screening program was initiated; or

  • the test used for screening has changed.

In relation to the last point and mammographic screening, a change that has already happened is from screen-film to digital mammography. An area of active research is the evaluation of digital breast tomo-synthesis as an alternative to two-view digital mammography for breast cancer screening (in breast tomo-synthesis, multiple images of the breast are captured and used to construct a ‘three-dimensional’ image of the breast). Another modification to screening mammography on the horizon is the incorporation of AI into the interpretation of mammographic images. Can we assume that the balance of benefits and harms in screening mammography is unaffected by these changes?

Why is the introduction of AI into screening mammography being contemplated? Even the best performing screening mammography program (with a low rate of interval cancers coupled with a low recall rate) will miss some cancers which subsequently present as interval cancers prior to the next scheduled screening round. It is possible that the incorporation of AI will improve sensitivity and reduce the rate of interval cancers. Furthermore, the best performing screening mammography programs tend to use two independent readers and so are relatively expensive to run. The use of AI could help to reduce these costs.

Early publications about the value of AI in the analysis of radiology images were optimistic and there were even predictions that AI would threaten the future of the specialty of radiology [Citation3]. It might sound like a simple exercise to establish whether the incorporation of AI into mammographic screening improves the performance of a screening program. However, the reality is that, despite hundreds of publications on the subject in recent years, a systematic review by Freeman et al. has shown that the picture is still not clear [Citation4].

The ideal approach to evaluation of AI in a mammographic screening program would be a randomized controlled trial where the performance of a trained AI program is evaluated prospectively in a screening setting, where the comparator was standard practice. Use of a randomized trial allows for the least biased assessment of the performance of the new test [Citation5] and the criteria for judging performance need to be clear. In the case of screening for breast cancer, we need to be cautious that an apparent improvement in sensitivity with a new test is not just a function of identifying extra cancers that might never cause symptoms in a woman’s lifetime. If the trial design is such that women are randomly allocated to AI or to standard practice, short-term follow-up will allow comparison of interval cancer rates and long-term follow-up would allow assessment of the impact of the use of AI on mortality [Citation6]. In the short term, all mammograms identified as problematic by either assessment method would need to be followed up in the same way and then all women not recalled would need to be followed up until the next scheduled screening round to identify interval cancers. An alternative study design is one where each mammogram is assessed using both modalities (standard practice and AI, independently). Although this approach allows the assessment of relative sensitivities, it does not allow comparison of the interval cancer rates, as cases identified as problematic by either screening test will have been investigated and treated [Citation6].

I have described here only the simplest comparison between AI alone versus standard practice, but studies could evaluate the use of AI as a triage step, as a second reader or as a reading aid for a radiologist. The need for prospective evaluation in a real-world setting is especially important where AI is being used as a prompt to a human reader, as radiologists may react to AI prompts in a way that was not expected, including becoming overly dependent on them or being distracted by them [Citation7,Citation8].

Freeman et al. retained only 12 studies in their systematic review and none were prospective evaluations of AI in a screening setting [Citation4]. Applying the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) [Citation9] criteria to the existing literature, Freeman et al. identified many issues with the evidence to date [Citation4].

In terms of risk of bias, issues included: inappropriate exclusions or loss to follow-up; the testing dataset not being independent of the training dataset; readers not being blinded to whether or not the woman is diagnosed with breast cancer; the thresholds for the operation of the AI system not being pre-specified; the human readers working in a ‘research’ rather than a ‘work as usual’ environment; and the diagnosis of breast cancer being incomplete, either because only women identified by the index radiologist were recalled or the remaining women were not followed up for at least 2 years (to ensure identification of interval cancers).

In terms of applicability, the issues included: participants not being consecutive or a random sample of women attending for screening; the AI program not being commercially available and pre-specified cut-offs not being used; and the AI not being integrated into the standard screening pathway.

Ideally, the accuracy of the results needs to be reported as real numbers (true positives, false positives, true negatives and false negatives), which allows for the calculation of sensitivity and specificity in both the AI system and the comparator (and the difference between them).

Freeman et al. concluded that the evidence to date does not allow an assessment of the accuracy of AI in mammographic screening [Citation4].

A common study design is the assessment of the performance of AI in a retrospective cohort enriched with women with cancer. In these studies, a cancer diagnosis will have been made in the screening round if the radiologist who originally read the mammograms recalled a case for further investigation. Women in these datasets need to be followed up until the next scheduled screening round to ensure interval cancers are identified and included as cases of breast cancer. Otherwise, if an AI program identifies a mammogram as high risk that was not recalled by the original radiologist, it could be considered a false positive. One could also argue to include the findings of the subsequent screening round as well as interval cancers in case the AI program was detecting very slow-growing tumors – which raises the question of whether AI is detecting the same or a different spectrum of disease from that identified by radiologists. There are studies that have reported a different mix of cancer types identified by AI compared with radiologists, although the differences are not consistent across studies. In at least one study, the AI system was more likely to detect micro-calcifications which are associated with ductal carcinoma in situ [Citation10].

This comes back to the issue I posed at the beginning in relation to the need to remain vigilant about health screening. We need to be cautious when new technology is introduced into a screening program and not assume that what seems like a simple modification leaves the balance of benefits and harms unchanged.

Potential conflict of interest

The author alone is responsible for the content and writing of the article. Robin Bell is a Co-Chief Investigator of a research project at Monash Health entitled ‘Artificial Intelligence (AI) Software in an Australian Screening Program’ that is receiving support in kind from ScreenPoint Medical.

Source of funding

Nil.

References

  • Wilson JM, Jungner YG. [Principles and practice of mass screening for disease] 1968 [Principios y metodos del examen colectivo para identificar enfermedades]. Available from: https://apps.who.int/iris/handle/10665/37650.
  • Ropers FG, Barratt A, Wilt TJ, et al. Health screening needs independent regular re-evaluation. BMJ. 2021;374:n2049.
  • Chockley K, Emanuel E. The end of radiology? Three threats to the future practice of radiology. J Am Coll Radiol. 2016;13(12 Pt A):1415–1420.
  • Freeman K, Geppert J, Stinton C, et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ. 2021;374:n1872.
  • Bell KJ, Bossuyt P, Glasziou P, et al. Assessment of changes to screening programmes: Why randomisation is important. BMJ. 2015;350:h1566.
  • Irwig L, Houssami N, Armstrong B, et al. Evaluating new screening tests for breast cancer. BMJ. 2006;332(7543):678–679.
  • Wallis MG. Artificial intelligence for the real world of breast screening. Eur J Radiol. 2021;144:109661.
  • Hickman SE, Baxter GC, Gilbert FJ. Adoption of artificial intelligence in breast imaging: evaluation, ethical constraints and limitations. Br J Cancer. 2021;125(1):15–22.
  • QUADAS-2. Available from: https://www.bristol.ac.uk/population-health-sciences/projects/quadas/quadas-2/.
  • Watanabe AT, Lim V, Vu HX, et al. Improved cancer detection using artificial intelligence: a retrospective evaluation of missed cancers on mammography. J Digit Imaging. 2019;32(4):625–637.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.