Skip to main content
Effective Health Care Program
Home » Products » Performance and Usability of Machine Learning for Screening in Systematic Reviews: A Comparative Evaluation of Three Tools » Performance and Usability of Machine Learning for Screening in Systematic Reviews: A Comparative Evaluation of Three Tools

Performance and Usability of Machine Learning for Screening in Systematic Reviews: A Comparative Evaluation of Three Tools

Research Report

This report is available in PDF only (Methods Report [PDF, 424 KB]). People using assistive technology may not be able to fully access information in these files. For additional assistance, please contact us.

Purpose of Project

For title and abstract screening, we explored the reliability of three machine learning tools when used to automatically eliminate irrelevant records or complement the work of a single reviewer. We evaluated the usability of each tool.

Key Messages

  • The reliability of the tools to automatically eliminate irrelevant records was highly variable; a median (range) 70% (0-100%) of relevant records were missed compared to dual independent screening.
  • Abstrackr and RobotAnalyst improved upon single reviewer screening by identifying studies that the single reviewer missed, but performance was not reliable. DistillerSR provided no advantage over single reviewer screening.
  • The tools' usability relied on multiple properties: user friendliness; qualities of the user interface; features and functions; trustworthiness; ease and speed of obtaining the predictions; and practicality of the export files.
  • Standards for conducting and reporting evaluations of machine learning tools for screening will facilitate their replication.

Structured Abstract

Background. Machine learning tools can expedite systematic review (SR) completion by reducing manual screening workloads, yet their adoption has been slow. Evidence of their reliability and usability may improve their acceptance within the SR community. We explored the performance of three tools when used to: (a) eliminate irrelevant records (Automated Simulation) and (b) complement the work of a single reviewer (Semi-automated Simulation). We evaluated the usability of each tool.

Methods. We subjected three SRs to two retrospective screening simulations. In each tool (Abstrackr, DistillerSR, and RobotAnalyst), we screened a 200-record training set and downloaded the predicted relevance of the remaining records. We calculated the proportion missed and the workload and time savings compared to dual independent screening. To test usability, eight research staff undertook a screening exercise in each tool and completed a survey, including the System Usability Scale (SUS).

Results. Using Abstrackr, DistillerSR, and RobotAnalyst respectively, the median (range) proportion missed was 5 (0 to 28) percent, 97 (96 to 100) percent, and 70 (23 to 100) percent in the Automated Simulation and 1 (0 to 2) percent, 2 (0 to 7) percent, and 2 (0 to 4) percent in the Semi-automated Simulation. The median (range) workload savings was 90 (82 to 93) percent, 99 (98 to 99) percent, and 85 (85 to 88) percent for the Automated Simulation and 40 (32 to 43) percent, 49 (48 to 49 percent), and 35 (34 to 38 percent) for the Semi-automated Simulation. The median (range) time savings was 154 (91 to 183), 185 (95 to 201), and 157 (86 to 172) hours for the Automated Simulation and 61 (42 to 82), 92 (46 to 100), and 64 (37 to 71) hours for the Semi-automated Simulation. Abstrackr identified 33-90% of records erroneously excluded by a single reviewer, while RobotAnalyst performed less well and DistillerSR provided no relative advantage. Based on reported SUS scores, Abstrackr fell in the usable, DistillerSR the marginal, and RobotAnalyst the unacceptable usability range. Usability depended on six interdependent properties: user friendliness, qualities of the user interface, features and functions, trustworthiness, ease and speed of obtaining predictions, and practicality of the export file(s).

Conclusions. The workload and time savings afforded in the Automated Simulation came with increased risk of erroneously excluding relevant records. Supplementing a single reviewer’s decisions with relevance predictions (Semi-automated Simulation) improved upon the proportion missed in some cases, but performance varied by tool and SR. Designing tools based on reviewers’ self-identified preferences may improve their compatibility with present workflows.

Journal Citation

Gates A, Guitard S, Pillay J, et al. Performance and usability of machine learning for screening in systematic reviews: a comparative evaluation of three tools. Systematic Reviews. Epub 15 November 2019.

Citation

Suggested citation: Gates A, Guitard S, Pillay J, Elliott SA, Dyson MP, Newton AS, Hartling L. Performance and Usability of Machine Learning for Screening in Systematic Reviews: A Comparative Evaluation of Three Tools. (Prepared by the University of Alberta Evidence-based Practice Center under Contract No. 290-2015-00001-I) AHRQ Publication No. 19(20)-EHC027-EF Rockville, MD: Agency for Healthcare Research and Quality; November 2019. Posted final reports are located on the Effective Health Care Program search page. DOI: https://doi.org/10.23970/AHRQEPCMETHMACHINEPERFORMANCE