People using assistive technology may not be able to fully access information in these files. For additional assistance, please contact us.
Classification of study design can help provide a common language for researchers. Within a systematic review, definition of specific study designs can help guide inclusion, assess the risk of bias, pool studies, interpret results, and grade the body of evidence. However, recent research demonstrated poor reliability for an existing classification scheme.
To review tools used to classify study designs; to select a tool for evaluation; to develop instructions for application of the tool to intervention/exposure studies; and to test the tool for accuracy and interrater reliability.
We contacted representatives from all AHRQ Evidence-based Practice Centers (EPCs), other relevant organizations, and experts in the field to identify tools used to classify study designs. Twenty-three tools were identified; 10 were relevant to our objectives. The Steering Committee ranked the 10 tools using predefined criteria. The highest-ranked tool was a design algorithm for studies of health care interventions developed, but no longer advocated, by the Cochrane Non-Randomised Studies Methods Group. This tool was used as the basis for our classification tool and was revised to encompass more study designs and to incorporate elements of other tools. A sample of 30 studies was used to test the tool. Three members of the Steering Committee developed a reference standard (i.e., the "true" classification for each study); 6 testers applied the revised tool to the studies. Interrater reliability was measured using Fleiss' kappa (?) and accuracy of the testers' classification was assessed against the reference standard. Based on feedback from the testers and the reference standard committee, the tool was further revised and tested by another 6 testers using 15 studies randomly selected from the original sample.
In the first round of testing the inter-rater reliability was fair among the testers (? = 0.26) and the reference standard committee (? = 0.33). Disagreements occurred at all decision points in the algorithm; revisions were made based on the feedback. The second round of testing showed improved interrater reliability (? = 0.45, moderate agreement) with improved, but still low, accuracy. The most common disagreements were whether the study was "experimental" (5/15 studies) and whether there was a comparison (4/15 studies). In both rounds of testing, the level of agreement for testers who had completed graduate-level training was higher than for testers who had not completed training.
Potential reasons for the observed low reliability and accuracy include the lack of clarity and comprehensiveness of the tool, inadequate reporting of the studies, and variability in user characteristics. Application of a tool to classify study designs in the context of a systematic review should be accompanied by adequate training, pilot testing, and documented decision rules.