Powered by the Evidence-based Practice Centers
Evidence Reports All of EHC
Evidence Reports All of EHC



Developing and Testing a Tool for the Classification of Study Designs in Systematic Reviews of Interventions and Exposures

Research Report Jan 4, 2011
Download PDF files for this report here.

Page Contents

People using assistive technology may not be able to fully access information in these files. For additional assistance, please contact us.

Structured Abstract


Classification of study design can help provide a common language for researchers. Within a systematic review, definition of specific study designs can help guide inclusion, assess the risk of bias, pool studies, interpret results, and grade the body of evidence. However, recent research demonstrated poor reliability for an existing classification scheme.


To review tools used to classify study designs; to select a tool for evaluation; to develop instructions for application of the tool to intervention/exposure studies; and to test the tool for accuracy and interrater reliability.


We contacted representatives from all AHRQ Evidence-based Practice Centers (EPCs), other relevant organizations, and experts in the field to identify tools used to classify study designs. Twenty-three tools were identified; 10 were relevant to our objectives. The Steering Committee ranked the 10 tools using predefined criteria. The highest-ranked tool was a design algorithm for studies of health care interventions developed, but no longer advocated, by the Cochrane Non-Randomised Studies Methods Group. This tool was used as the basis for our classification tool and was revised to encompass more study designs and to incorporate elements of other tools. A sample of 30 studies was used to test the tool. Three members of the Steering Committee developed a reference standard (i.e., the "true" classification for each study); 6 testers applied the revised tool to the studies. Interrater reliability was measured using Fleiss' kappa (?) and accuracy of the testers' classification was assessed against the reference standard. Based on feedback from the testers and the reference standard committee, the tool was further revised and tested by another 6 testers using 15 studies randomly selected from the original sample.


In the first round of testing the inter-rater reliability was fair among the testers (? = 0.26) and the reference standard committee (? = 0.33). Disagreements occurred at all decision points in the algorithm; revisions were made based on the feedback. The second round of testing showed improved interrater reliability (? = 0.45, moderate agreement) with improved, but still low, accuracy. The most common disagreements were whether the study was "experimental" (5/15 studies) and whether there was a comparison (4/15 studies). In both rounds of testing, the level of agreement for testers who had completed graduate-level training was higher than for testers who had not completed training.


Potential reasons for the observed low reliability and accuracy include the lack of clarity and comprehensiveness of the tool, inadequate reporting of the studies, and variability in user characteristics. Application of a tool to classify study designs in the context of a systematic review should be accompanied by adequate training, pilot testing, and documented decision rules.

Project Timeline

Developing and Testing a Tool for the Classification of Study Designs in Systematic Reviews of Interventions and Exposures

Jan 3, 2011
Topic Initiated
Jan 4, 2011
Research Report
Page last reviewed November 2017
Page originally created November 2017

Internet Citation: Research Report: Developing and Testing a Tool for the Classification of Study Designs in Systematic Reviews of Interventions and Exposures. Content last reviewed November 2017. Effective Health Care Program, Agency for Healthcare Research and Quality, Rockville, MD.

Select to copy citation