CEBM Home PageHow to Use this SiteTeaching Resources and ActivitiesThe CAT BankAbout the Book <I>How to Practice and Teach EBM</I>EBM GlossaryThe EBM Toolbox


Comments on the draft Levels of Evidence

diagnostic gold standards
therapy ain't prevention
some inconsistencies
size (and heterogeneity) matters
Occam's razor
mathematical models

Back to the top of the page

Dave Sackett comments:

i love all of 'em but one, and am not sure i understand that one.

it's the one that describes "positive and negative tests were verified using separate reference standards." hunh? would this apply where negative test results were presumed to be normal and never validated (eg, Goldschlager's gang considered doing angiograms on pts with normal exercise ECGs unethical, and assumed normal exercise ECGs meant normal coronary arteries)?

i guess you are penalizing the SYSTEMATIC non-application of the reference std to all patients (level 3b) by knocking it down an additional peg.

the basis for my disagreement with your proposal to downgrade the use of two different reference standards for the presence and absence of a single target disorder is as before, and dawned on a bunch of us in the hamilton clotting-mafia as we tried to develop better diagnostic tests for clinically-suspected pulmonary embolism:

1. what's the best reference standard for a clinically important pulmonary embolism? i reckon we'd all agree that it's a positive pulmonary angiogram in an appropriately symptomatic patient.

2. but what's the best reference standard for the absence of a clinically important pulmonary embolism? well, given the frequency with which PEs can break up and dissipate by the time we arrange for the angiogram, we decided that a negative pulmonary angiogram would be nowhere near as valid a reference standard as a symptomless 6 month follow-up on NO antithrombotic Rx.

in this case the highest level of diagnostic validity demands different reference standards for patients with and without the target disorder.

Martin Dawes adds

Maybe its time we stopped talking about diagnostic tests and talked about inclusion and exclusion tests (with different gold standards for each - there is a similar problem for establishing negative gold standards for a lot of diseases). At least then people would get used to the idea that they are not necessarily( rarely) the same which for many clinicians is a novel approach.

Chris Ball comments

  1. notes §§ and *** are (in part) duplicates
  2. Like Dave, I wonder whether we should think about adding a comment to 'good reference standards' note indicating when an alternative reference standard is acceptable. I'm thinking about using follow-up for negative tests. This is a pragmatic reference standard in cases where the usual standard is risky or uncomfortable.

Back to the top of the page Klim MacPherson comments:

I found this interesting and very useful.

I would myself think twice before lumping 'Therapy' with 'Prevention' . There are all sorts of different issues in evaluating prevention strategies not least: sometimes their community nature, the real possibility of contamination of the controls, the autonomy of the 'subjects' giving rise to complex issues of preference, possibilities of serious effect modification outside strictly biological mechanisms etc etc . My concern is that Prevention may need variations or changes on the evidence based paradigm which adequately takes account of these issues - otherwise the evidence perforce seems relatively barren.

Bob Phillips' response to this was

We have looked quite closely at the axes (columns) involved, and have come to the conclusion that even if often impractical, prevention studies should be assessed as therapeutic studies.

Klim MacPherson made a futher comment;

I think the issues cannot be dealt with by implying that public health is just another clinical discipline - it is not.

It concerns essentially fit people for whom small (often delayed) benefits have massive effects on the health burden of communities. These people are autonomous (i.e are not patients) and have priorities among which health is unlikely to be foremost.

This does lead to all sorts of different problems - the most acute of which is that randomistion between important preferences is not possible and, if these preferences effect efficacy and outcome, then randomisation (ignoring them) yields attenuated and biased results. But there are other intrinsic problems too.


Back to the top of the page Benjamin Djulbegovic comments:

Love them or hate them, but the idea that not all evidence is created equal is the lasting legacy of EBM. I have been using your system for some time now, and I think it is the best that is currently available. However, there are some inconsistencies that should be dealt with. The main one is that reference to the design architecture of the studies (which serves as a basis for the ranking system) is mixed with notions such as "outcome research", ecological studies, etc. I suggest that you stick with the design of the studies and rank evidence according to the magnitude of bias that one can expect with a particular study design (e.g. it is clear when you rank evidence as SR or RCT, individual RCT with narrow CI, All or none, individual RCT with wide RCT, individual cohort study etc, but is not clear what is meant by "outcome research", "ecological studies", "first principles, etc). If you have in mind a particular technique (design) commonly used in outcome research, then to use it instead. Otherwise, I suggest to avoid these terms (this would be similar to using term "clinical trials" to replace RCT, cohort studies, etc)

My second suggestion relates to the ranking of "Economic and decision analysis". For example, it is really not clear what constitutes level 1b or 2b evidence (and less clear what is level 1a or 2a- could you provide any example of level 1a according to your definitions?). The problem here is that there are various modeling techniques in decision analysis and which one should be applied depends on the nature of the problem. Perhaps you can use published guidelines on C-E analysis to improve your system (see JAMA 1996;276:1339-41; JAMA 1996;276:1253-8).

Finally, would you dare to recommend how much evidence is needed to make "grade A recommendations". Note, for example, that FDA requires "at least two adequate and well-controlled studies, each convincing on its own" or " a single multicenter study of excellent design" to approve a new drug in the US (obviously this is somewhat different issue, but not in principle...). It would be great if you can illustrate each of your level of evidence with the example (link to a particular reference), so that readers are not in doubt what is meant here.


Back to the top of the page Andrew Moore commented:

  1. heterogenity tests are bunk.
  2. size not mentioned, but extremely important.
  3. sr of diagnostics tests useless since the methods of original studies are 99.9% biased
  4. proliferation of levels is confusing.

Martin Dawes replied;

  1. heterogenity tests are bunk. - i agree but the examination of clinical influences on the differences between results is not - so some assessment of agreement is useful and at the moment a chi sq distrib is as good as we have got - we play a game with students - asking them to guess the p value of the chi sq - after about 10 trials (the majic numbers for any learning experience) they get pretty good!
  2. size not mentioned, but extremely important. - yes - much more important - but then so is concealment but we don't make an issue of that - should we list this and blinding? I have always been worried that we are not specific enough by what we mean by a 'low quality' RCT - but overall in terms of getting a feel of the believeability is this important? - probably for concealment but not for blinding
  3. sr of diagnostics tests useless since the methods of original studies are 99.9% biased - and they are so infrequent you can count them on the fingers of a polydactyl primate with an apposed primary digit
  4. proliferation of levels is confusing. - yeah - but if my life is on the line I want to know what the level is

and Dave Sacket commented

Re 1: although study size is vital, it's vital to precision, not validity

Mike Clarke commented;

In thinking about heterogeneity, one should be worried about clinical as well as (if not more than) statistical heterogeneity - the difficulty is that clinical heterogeneity can be much harder to assess than some numbers and a mathematical test. At the moment, the Notes deal with statistical heterogeneity (by talking about the "directions and degrees of results") but not clinical heterogeneity. Therefore, I think you should be more explicit about the importance of considering clinical heterogeneity in determining the quality of a piece of evidence.

To illustrate, a very mathematically precise result would be obtained if I combined all high-quality randomised trials that had an odds ratio of 0.85 (regardless of the interventions these trials assessed). This would be a highly homogenous meta-analysis but clinically meaningless, unless I defined the clinical question in a purely post-hoc statistical way (ie what is the effect of interventions which have been assessed in trials that the result I am interested in). Although this sounds like nonsense, it strikes me as similar to the exclusive use of statistical tests for assessing heterogeneity and so it is important that your levels of evidence do not give this impression.


Back to the top of the page Fabian Jaimes commented:

I think (and I am so sorry by my disagreement) that levels of evidence are more complex every day, and more complex is the same that less useful in clinical practice. This is not the goal of EBM, so, because I believe strongly in the philosophy and conceptual framework of EBM, I believe that in teaching and learning medicine it is necessary the Occam´s rule: the easiest (in terms of usefulness and application) is the best.

And with reference to the phrase "wide confidence interval":

I think that could be "dangerous" say about wide or narrow CI in the context of levels of evidence. As you say, the clinical situations are different, and so the best CI ("wider or narrower") is different in every case.


Back to the top of the page Emilio Corrales asked:

what is the level of evidence from a study of a mathematical model (like a Monte-Carlo simulation based on a Markov model)?

Bob Phillips answered:

I don't think we've really addressed "theoretical" evidence specifically. My answer would be it is an estimation based upon other data, and that its level of evidence can be no higher than the level of the worst evidence on which the model is based. (for example, if the study used therapy results from a systematic review, but prevalence data from an expert panel, it should be level 5)

Dave Sackett commented:

i agree with bob

Martin Dawes also added:

What i have not seen is any paper comparing various decision analysis mathematical models in the same way ioannidis juni moher & others have done for rct vs non rct. It maybe that there are not enough papers containing these methods yet - but possibly its worth doing the search!

What is also interesting is that the different software packages give different answers (not radically - but possibly may alter significance) As only 1% of doctors know about these modelling packages and less then 1% of this 1% could describe the mathematical process (no this does not include me) it's getting to be a very specialised subject.

What is disturbing is the way in which they are held up by many (especially for economic policy decisions) to be some sort of magic answer to complex problems - so any unravelling or debate will be very helpful