Bad Science
One particularly influential composite outcome came from a famous British trial called UKPDS, which looked to see whether intensively managing the blood-sugar levels of patients with diabetes made a difference to their real-world outcomes. This reported three endpoints: it found no benefit for the first two, which were death and diabetes-related death; but it did report a 12 per cent reduction in the composite outcome. This composite outcome consisted of lots of things:
sudden death
death from high or low blood sugar
fatal heart attack
non-fatal heart attack
angina
heart failure
stroke
renal failure
amputation
bleeding into the middle chamber of the eye
diabetes-related damage to the arteries in the eye requiring laser treatment
blindness in one eye
cataracts requiring extraction
That’s quite a list, and a 12 per cent reduction on all of it bundled up together certainly feels like ‘patient oriented evidence that matters’, as we say in the business (‘POEMs’ if you prefer). But most of the improvement in this composite outcome was caused by a reduction in the number of people referred for laser treatment for damage to the arteries in their eyes. That’s nice, but it’s hardly the most important thing on that list, and it’s very much a process outcome, rather than a concrete, real-world one. If you’re interested in real-world outcomes, there wasn’t even any change in the number of people experiencing visual loss, but in any case, it’s clearly a much less important outcome than heart attacks, deaths, strokes or amputation. Similarly, the trial found a benefit for some blood markers suggestive of kidney problems, but no change in actual end-stage kidney disease.
This is only interesting because UKPDS has a slightly legendary status, among medics, as showing the benefit, on multiple outcomes, from intensive blood-sugar control for people with diabetes. How was this widespread belief created? One enterprising group of researchers decided to find every one of the thirty-five diabetes review papers citing the UKPDS study, and see what they said about it.22 Twenty-eight said that the trial found a benefit for the composite outcome, but only one mentioned that most of this was down to improvements on the most trivial outcomes, and only six that it found no benefit for death, which is surely the ultimate outcome that matters. There is a terrifying reality revealed by this study: rumours, oversimplifications and wishful thinking can spread through the academic literature, just as easily as they do through any internet discussion forum.
Trials that ignore drop-outs
Sometimes patients leave a trial altogether, often because they didn’t like the drug they were on. But when you analyse the two groups in your trial you have to make sure you analyse all the patients assigned to a treatment. Otherwise you overstate the benefits of your drug.
One classic failure at the analysis stage which can pervert your data horribly is to analyse patients according to the treatment they actually took, rather than the treatment they were assigned at the randomisation stage of the trial. At first glance, this seems perfectly reasonable: if 30 per cent of your patients dropped out and didn’t take your new tablet, they didn’t experience the benefit, and shouldn’t be included in the ‘new tablet’ group at analysis.
But as soon as you start to think about why patients drop out of treatment in trials, the problems with this method start to become apparent. Maybe they stopped taking your tablets because they had horrible side effects. Maybe they stopped taking your tablets because they decided they didn’t work, and just tipped them in the bin. Maybe they stopped taking your tablets, and coming to follow-up appointments, because they were dead, after your drug killed them. Looking at patients only by the treatment they took is called a ‘per protocol’ analysis, and this has been shown to dramatically overstate the benefits of treatments, which is why it’s not supposed to be used.
If you keep all the patients prescribed your new treatment – including those who stopped taking it – in the ‘new treatment’ group when you do your final calculation, this is called an ‘intention to treat’ analysis. As well as being more conservative, this analysis makes much more sense philosophically. You’re going to use the results of a trial to inform your decision about whether to ‘give someone some tablets’, not ‘force some tablets down their throat compulsorily’. So you want the results to be from an analysis that looks at people according to what they were given by their doctor, rather than what they actually swallowed.
I’ve had the joy of marking sixty exam papers – a Groundhog Day experience if ever there was one – in which a fifth of the marks were to be earned by explaining ‘intention to treat analysis’. This is at the absolute core of the evidence-based medicine curriculum, so it’s utterly bizarre that there are still endless ‘per protocol’ analyses being reported by the drugs industry. One systematic review looked at all the trial reports submitted by companies to the Swedish drug regulator, and then the published academic papers relating to the same trials (if they even existed).23 All but one of the submissions to the regulator featured both ‘intention to treat’ and ‘per protocol’ analyses, because regulators are, for all their faults and obsessive secrecy, at least a little sharper about methodological rigour than many academic journals. All but two of the academic papers, meanwhile, only reported one analysis, usually the ‘per protocol’ one that overstates the benefits. This is the version that doctors read. In the next section, we will see another example of how academic journals participate in the game of overstating results: often, for all their claims to be the gatekeepers for good-quality research, these journals do not do their job well.
Trials that change their main outcome after they’ve finished
If you measure a dozen outcomes in your trial, but cite an improvement in any one of them as a positive result, then your results are meaningless. Our tests for deciding if a result is statistically significant assume that you are only measuring one outcome. By measuring a dozen, you have given yourself a dozen chances of getting a positive result, rather than one, without clearly declaring that. Your study is biased by design, and is likely to find more positive results than there really are.
Imagine we’re playing with dice, and we make a simple (albeit one-sided) arrangement: if I throw a double six, you have to give me £10. So I roll the dice, and they come up double three. But I still demand my £10, claiming that our original agreement was in fact that you give me £10 if I roll a double three; and you still pay me, with the cheerful encouragement of everyone around us. This exact scenario is played out in clinical academic research, as a matter of routine, every day, when we tolerate people doing something called ‘switching the primary outcome’.
Before you begin a clinical trial, you write out the protocol. This is a document describing what you’re going to do: how many participants you’re going to recruit, where and how you’re going to recruit them, what treatment each group will receive, and what outcomes you’re going to measure. In a trial you’ll measure all kinds of things as possible outcomes: perhaps a few different rating scales for ‘pain’, or ‘depression’, or whatever you’re interested in; maybe ‘quality of life’, or ‘mobility’, that you’ll measure with some kind of questionnaire; possibly ‘death from all causes’, and death from each of a number of specific causes too; and lots of other things.
Among all of these many outcomes, you will specify one (or perhaps a couple more, if you account for this in your analysis) as the main, primary outcome. You do this before the trial starts, because you’re trying to avoid one simple problem: if you measure lots of things, some of them will come up as statistically significantly improved, simply from the natural random variation in all trial data. These are real people, remember, in the real world, and their pain, depression, mobility, quality of life and so on will all vary, for all kinds of reasons, many of which have nothing whatsoever to do with the intervention that you’re testing in y
our trial.
If you’re a pure-hearted researcher, you’re using statistical tests specifically to identify genuine benefits of the treatment you’re testing. You’re trying to distinguish these real changes from the normal random variation of background noise that you would expect to see in your patients’ results on various tests. More than anything, you want to avoid finding false positives.
The traditional cut-off for statistical significance is ‘one in twenty’. Roughly speaking, clearing this bar means that if you repeated the same study over and over again, with the same methods, in participants taken from the same population, you’d expect to get the same positive finding you’ve observed one time in every twenty, simply by chance, even if the drug really had no benefit. If you dip two cups into the same jar of white and red beads, every now and then, purely by chance, you will come out with an unusually small number of red beads in one cup, and an unusually large number of red beads in the other. The same is true for any measurement we take in patients: there will be some random variation, and it can sometimes make it look as if one treatment is better than another, on one scoring method, simply through chance. Statistical tests are designed to stop us being misled by that kind of random variation.
So now, let’s imagine you’re running a trial where you measure ten different, independent outcomes. If we set the cut-off for statistical significance as ‘one in twenty’, then even if your drug does nothing useful at all, in your single trial you’ve still got a 50/50 chance of finding a positive benefit on at least one of your outcomes, simply from random variation in your data. If you didn’t pre-specify which of the many outcomes is your primary outcome before you started, you could be cheeky, and report any positive finding you get, in any of your ten outcomes, as a positive result from your trial.
Could you get away with doing this openly, and simply saying: ‘Hey, we measured ten things, and one of them came up as improved, therefore our new drug is awesome’? Well, you probably could get away with it in some quarters, because the consumers of scientific papers aren’t universally switched on to this kind of bait and switch. But generally people would spot it: they would expect to see a ‘primary outcome’ nominated and reported, because they know that if you measure ten things, one of them is pretty likely to come up as improved simply through chance.
The problem is this: even though people know that you should nominate a primary outcome, these primary outcomes often change between the protocol and the paper, after the people conducting the research have seen the results. Even you – a random punter who’s picked up this book on a station platform, and not a professor of either statistics or medicine – can see the madness in this. If the primary outcome reported in the finished paper is different from the primary outcome nominated before the trial started, then that is absurd: the entire point of the primary outcome is that it’s the primary outcome nominated before the trial started. But people do switch their primary outcomes, and this is not just an occasional problem. In fact, it’s almost routine practice.
In 2009, a group of researchers got all the trials they could find on various uses of a drug called gabapentin.24 They then looked at those for which they could obtain internal documents, which meant they could identify the original, pre-specified primary outcome. Then they looked at the published academic papers that reported these trials. Of course, about half of the trials were never published at all (the scandal of this should not wear off with repetition). Twelve trials were published, and they checked to see if the things reported as primary outcomes in the academic papers really were pre-specified as primary outcomes in the internal documents, before the trial started.
What they found was a mess. Of the twenty-one primary outcomes pre-specified in the protocols, which should all have been reported, only eleven actually appeared. Six weren’t reported in any form, and four were reported, but reported as if they were secondary outcomes instead. You can also look at this from the other end of the telescope: twenty-eight primary outcomes were reported in the twelve published trials, but of those, about half were newly introduced, and were never really primary outcomes at all. This is nothing short of ridiculous: there is no excuse, not for the researchers doing the switching, and not for the academic journals failing to check. But that was only one drug. Was it a freak occurrence?
No. In 2004 some researchers published a paper looking at all areas of medicine: they took all the trials approved by the ethics committees of two cities over two years, then chased up the published papers.25 About half of all the outcomes were incorrectly reported. Of the published papers, almost two thirds had at least one pre-specified primary outcome that had been switched, and this was not being done at random: exactly as you’d expect, positive outcomes were more than twice as likely to be properly reported. Other studies on primary-outcome switching report similar results.
To be clear: if you switch your pre-specified primary outcome between the beginning and the end of your trial, without a very good explanation for why you’ve done so, then you’re simply not doing science properly. Your study is broken by design. It should be a universal requirement that all studies report their pre-specified primary outcome as the primary outcome. This should be enforced by all journals, and things should have been done this way since trials began. It’s really not difficult. Yet we have collectively failed to adhere to this simple, obvious core requirement on an epic scale.
For one final illustration of what this means in practice, I shall return to paroxetine, and the studies that were conducted in children. Remember, when an area of medicine is subject to some kind of litigation, documents often become available to researchers that would otherwise be hidden from view, allowing them to identify problems, discrepancies and patterns that would not normally be detectable. For the most part these are documents which should always be in the public domain, but are not. So paroxetine may not be worse than any other drug for this kind of mischief (in fact, as we have seen from the study just described, outcome switching happens across the board): it’s simply one of the cases about which we have the most detail.
In 2008 a group of researchers decided to go through the documents opened up by the litigation over paroxetine, and examine how the results of one clinical trial – ‘trial 329’ – had been published.26 As late as 2007 systematic reviews were still describing this trial as having a positive result, which is how it was reported in publications of its results. But in reality that was completely untrue: the original protocols specified two primary outcomes and six secondary ones. At the end of the trial there was no difference between paroxetine and placebo for any of these outcomes. At least nineteen more outcomes were also measured, making twenty-seven in total. Of those, only four gave a positive result for paroxetine. These positive findings were reported as if they were the main outcomes.
It would be tempting to regard the reporting of trial 329 as some kind of freak episode, an appalling exception in an otherwise sane medical world. Tragically, as the research above demonstrates, this behaviour is widespread.
So widespread, in fact, that there’s room for a small cottage industry, if there are any academics feeling brave enough to pursue the project. Someone somewhere needs to identify all the studies where the main outcomes have been switched, demand access to the raw data, and helpfully, at long last, conduct the correct analyses for the original researchers. If you choose to do this, your published papers will immediately become the definitive reference on these trials, because they will be the only ones to correctly present the pre-specified trial outcomes. The publications from the original researchers will be no more than a tangential and irrelevant distraction.
I’m sure they’ll be pleased to help.
Dodgy subgroup analyses
If your drug didn’t win overall in your trial, you can chop up the data in lots of different ways, to try and see if it won in a subgroup: maybe it works brilliantly in Chinese men between fifty-six and seventy-one. This is as stupid as playing ‘Best of three…Be
st of five…’ And yet it is commonplace.
Time and again we have come back to the same principle in this chapter: if you give yourself multiple chances at finding a positive result, but use statistical tests that assume you only had one go, then you vastly increase your chances of getting the result you want – if you flip a coin for long enough, you will eventually get four heads in a row.
A new way of doing this is the subgroup analysis. The trick is simple: you’ve finished your trial, and it had a negative result. There was no difference in outcome – the patients on placebo did just as well as those on your new tablets. Your drug doesn’t work. This is bad news. But then you dig a little more, do some analyses, and find that the drug worked great for Hispanic non-smoking men aged fifty-five to seventy.
If it’s not immediately obvious why this is a problem, we have to go back and think about the random variation in the data in any trial. Let’s say your drug is supposed to prevent death during the duration of the trial. We know that death happens for all kinds of reasons, at often quite arbitrary moments, and is – cruelly – only partly predictable on the basis of what we know about how healthy people are. You’re hoping that when you run your trial, your drug will be able to defer some of these random unpredictable deaths (though not all, because no drug prevents all causes of death!), and that you’ll be able to pick up that change in death rate, if you have a sufficiently large number of people in your trial.