| Levels of Evidence | Footnotes & References | Recent Comments | Related material on Study
Designs and Quality Filters |
Frequently-Asked Questions about these Levels |
What are we to do when the irresistible force of the need to offer clinical advice meets with the immovable object of flawed evidence? All we can do is our best: give the advice, but alert the advisees to the flaws in the evidence on which it is based.
The ancestor of this set of pages was created by Suzanne Fletcher and Dave Sackett 20 years ago when they were working for the Canadian Task Force on the Periodic Health Examination [1]. They generated "levels of evidence" for ranking the validity of evidence about the value of preventive manoeuvres, and then tied them as "grades of recommendations" to the advice given in the report.
The levels have evolved over the ensuing years, most notably as the basis for recommendations about the use of anti-thrombotic agents [2], have grown increasingly sophisticated [3], and have even started to appear in a new generation of evidence-based textbooks that announce, in bold marginal icons, the grade of each recommendation that appears in the texts [4] in bold icons.
However, their orientation remained therapeutic/preventive, and when a group of members of the Centre embarked on creating a new-wave house officers manual (see the EBOC page), the need for levels and grades for diagnosis, prognosis, and harm became overwhelming and the current version of their efforts appears here. They are the work of Chris Ball, Dave Sackett, Bob Phillips, Brian Haynes, Sharon Straus, and Martin Dawes with lots of encouragement and advice from their colleagues.
Comments to this latest version are available. More are welcome as these continue to develop.
Periodic updates will appear here, and surfers are invited to suggest ways that they might be improved or further developed.
A final, cautionary note: these levels and grades speak only to the validity of evidence about prevention, diagnosis, prognosis, therapy, and harm. Other strategies, described elsewhere in the Centres pages, must be applied to the evidence in order to generate clinically useful measures of its potential clinical implications and to incorporate vital patient-values into the ultimate decisions.
Oxford Centre for Evidence-based Medicine Levels of Evidence (May 2001)
|
Level |
Therapy/Prevention, Aetiology/Harm |
Prognosis |
Diagnosis |
Differential diagnosis/symptom prevalence study |
Economic and decision analyses |
|
1a |
SR (with homogeneity*) of RCTs |
SR (with homogeneity*) of inception cohort studies; CDR† validated in different populations |
SR (with homogeneity*) of Level 1 diagnostic studies; CDR† with 1b studies from different clinical centres |
SR (with homogeneity*) of prospective cohort studies |
SR (with homogeneity*) of Level 1 economic studies |
|
1b |
Individual RCT (with narrow Confidence Interval‡) |
Individual inception cohort study with > 80% follow-up; CDR† validated in a single population |
Validating** cohort study with good††† reference standards; or CDR† tested within one clinical centre |
Prospective cohort study with good follow-up**** |
Analysis based on clinically sensible costs or alternatives; systematic review(s) of the evidence; and including multi-way sensitivity analyses |
|
1c |
All or none case-series |
All or none case-series |
Absolute better-value or worse-value analyses †††† |
||
|
2a |
SR (with homogeneity* ) of cohort studies |
SR (with homogeneity*) of either retrospective cohort studies or untreated control groups in RCTs |
SR (with homogeneity*) of Level >2 diagnostic studies |
SR (with homogeneity*) of 2b and better studies |
SR (with homogeneity*) of Level >2 economic studies |
|
2b |
Individual cohort study (including low quality RCT; e.g., <80% follow-up) |
Retrospective cohort study or follow-up of untreated control patients in an RCT; Derivation of CDR† or validated on split-sample§§§ only |
Exploratory** cohort study with good†††reference standards; CDR† after derivation, or validated only on split-sample§§§ or databases |
Retrospective cohort study, or poor follow-up |
Analysis based on clinically sensible costs or alternatives; limited review(s) of the evidence, or single studies; and including multi-way sensitivity analyses |
|
2c |
"Outcomes" Research; Ecological studies |
"Outcomes" Research |
Ecological studies |
Audit or outcomes research |
|
|
3a |
SR (with homogeneity*) of case-control studies |
SR (with homogeneity*) of 3b and better studies |
SR (with homogeneity*) of 3b and better studies |
SR (with homogeneity*) of 3b and better studies |
|
|
3b |
Individual Case-Control Study |
Non-consecutive study; or without consistently applied reference standards |
Non-consecutive cohort study, or very limited population |
Analysis based on limited alternatives or costs, poor quality estimates of data, but including sensitivity analyses incorporating clinically sensible variations. |
|
|
4 |
Case-series (and poor quality cohort and case-control studies§§ ) |
Case-series (and poor quality prognostic cohort studies***) |
Case-control study, poor or non-independent reference standard |
Case-series or superseded reference standards |
Analysis with no sensitivity analysis |
|
5 |
Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles" |
Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles" |
Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles" |
Expert opinion without explicit critical appraisal, or based on physiology, bench research or "first principles" |
Expert opinion without explicit critical appraisal, or based on economic theory or "first principles" |
Produced by Bob Phillips, Chris Ball, Dave Sackett, Doug Badenoch, Sharon Straus, Brian Haynes, Martin Dawes since November 1998.
Users can add a minus-sign "-" to denote the level of that fails to provide a conclusive answer because of:
· EITHER a single result with a wide Confidence Interval (such that, for example, an ARR in an RCT is not statistically significant but whose confidence intervals fail to exclude clinically important benefit or harm)
· OR a Systematic Review with troublesome (and statistically significant) heterogeneity.
· Such evidence is inconclusive, and therefore can only generate Grade D recommendations.
|
* |
By homogeneity we mean a systematic review that is free of worrisome variations (heterogeneity) in the directions and degrees of results between individual studies. Not all systematic reviews with statistically significant heterogeneity need be worrisome, and not all worrisome heterogeneity need be statistically significant. As noted above, studies displaying worrisome heterogeneity should be tagged with a "-" at the end of their designated level. |
|
† |
Clinical Decision Rule. (These are algorithms or scoring systems which lead to a prognostic estimation or a diagnostic category. ) |
|
‡ |
See note #2 for advice on how to understand, rate and use trials or other studies with wide confidence intervals. |
|
§ |
Met when all patients died before the Rx became available, but some now survive on it; or when some patients died before the Rx became available, but none now die on it. |
|
§§ |
By poor quality cohort study we mean one that failed to clearly define comparison groups and/or failed to measure exposures and outcomes in the same (preferably blinded), objective way in both exposed and non-exposed individuals and/or failed to identify or appropriately control known confounders and/or failed to carry out a sufficiently long and complete follow-up of patients. By poor quality case-control study we mean one that failed to clearly define comparison groups and/or failed to measure exposures and outcomes in the same (preferably blinded), objective way in both cases and controls and/or failed to identify or appropriately control known confounders. |
|
§§§ |
Split-sample validation is achieved by collecting all the information in a single tranche, then artificially dividing this into "derivation" and "validation" samples. |
|
†† |
An "Absolute SpPin" is a diagnostic finding whose Specificity is so high that a Positive result rules-in the diagnosis. An "Absolute SnNout" is a diagnostic finding whose Sensitivity is so high that a Negative result rules-out the diagnosis. |
|
‡‡ |
Good, better, bad and worse refer to the comparisons between treatments in terms of their clinical risks and benefits. |
|
††† |
Good reference standards are independent of the test, and applied blindly or objectively to applied to all patients. Poor reference standards are haphazardly applied, but still independent of the test. Use of a non-independent reference standard (where the 'test' is included in the 'reference', or where the 'testing' affects the 'reference') implies a level 4 study. |
|
†††† |
Better-value treatments are clearly as good but cheaper, or better at the same or reduced cost. Worse-value treatments are as good and more expensive, or worse and the equally or more expensive. |
|
** |
Validating studies test the quality of a specific diagnostic test, based on prior evidence. An exploratory study collects information and trawls the data (e.g. using a regression analysis) to find which factors are 'significant'. |
|
*** |
By poor quality prognostic cohort study we mean one in which sampling was biased in favour of patients who already had the target outcome, or the measurement of outcomes was accomplished in <80% of study patients, or outcomes were determined in an unblinded, non-objective way, or there was no correction for confounding factors. |
|
**** |
Good follow-up in a differential diagnosis study is >80%, with adequate time for alternative diagnoses to emerge (eg 1-6 months acute, 1 - 5 years chronic) |
|
A |
consistent level 1 studies |
|
B |
consistent level 2 or 3 studies or extrapolations from level 1 studies |
|
C |
level 4 studies or extrapolations from level 2 or 3 studies |
|
D |
level 5 evidence or troublingly inconsistent or inconclusive studies of any level |
"Extrapolations" are where data is used in a situation which has potentially clinically important differences than the original study situation.
"Extrapolations" are where data
is used in a situation which has potentially clinically important
differences than the original study situation.
| Click here to comment on Levels of Evidence |