The Emperor of All Maladies
The HIP filled a great void in insurance. By the mid-1950s, a triad of forces—immigration, World War II, and the Depression—had brought women out of their homes to comprise nearly one-third of the total workforce in New York. These working women sought health insurance, and the HIP, which allowed its enrollees to pool risks and thereby reduce costs, was a natural solution. By the early 1960s, the plan had enrolled more than three hundred thousand subscribers spread across thirty-one medical groups in New York—nearly eighty thousand of them women.
Strax, Shapiro, and Venet were quick to identify the importance of the resource: here was a defined—“captive”—cohort of women spread across New York and its suburbs that could be screened and followed over a prolonged time. The trial was kept deliberately simple: women enrollees in the HIP between the ages of forty and sixty-four were divided into two groups. One group was screened with mammography while the other was left unscreened. The ethical standards for screening trials in the 1960s made the identification of the groups even simpler. The unscreened group—i.e., the one not offered mammography—was not even required to give consent; it could just be enrolled passively in the trial and followed over time.
The trial, launched in December 1963, was instantly a logistic nightmare. Mammography was cumbersome: a machine the size of a full-grown bull; photographic plates like small windowpanes; the slosh and froth of toxic chemicals in a darkroom. The technique was best performed in dedicated X-ray clinics, but unable to convince women to travel to these clinics (many of them located uptown), Strax and Venet eventually outfitted a mobile van with an X-ray machine and parked it in midtown Manhattan, alongside the ice-cream trucks and sandwich vendors, to recruit women into the study during lunch breaks.*
Strax began an obsessive campaign of recruitment. When a subject refused to join the study, he would call, write, and call her again to persuade her to join. The clinics were honed to a machinelike precision to allow thousands of women to be screened in a day:
“Interview . . . 5 stations X 12 women per hour = 60 women. . . . Undress-Dress cubicles: 16 cubicles X 6 women per hour = 96 women per hour. Each cubicle provides one square of floor space for dress-undress and contains four clothes lockers for a total of 64. At the close of the ‘circle,’ the woman enters the same cubicle to obtain her clothes and dress. . . . To expedite turnover, the amenities of chairs and mirrors are omitted.”
Curtains rose and fell. Closets opened and closed. Chairless and mirrorless rooms let women in and out. The merry-go-round ran through the day and late into the evening. In an astonishing span of six years, the trio completed a screening that would ordinarily have taken two decades to complete.
If a tumor was detected by mammography, the woman was treated according to the conventional intervention available at the time—surgery, typically a radical mastectomy, to remove the mass (or surgery followed by radiation). Once the cycle of screening and intervention had been completed, Strax, Venet, and Shapiro could watch the experiment unfold over time by measuring breast cancer mortality in the screened versus unscreened groups.
In 1971, eight years after the study had been launched, Strax, Venet, and Shapiro revealed the initial findings of the HIP trial. At first glance, it seemed like a resounding vindication of screening. Sixty-two thousand women had been enrolled in the trial; about half had been screened by mammography. There had been thirty-one deaths in the mammography-screened group and fifty-two deaths in the control group. The absolute number of lives saved was admittedly modest, but the fractional reduction in mortality from screening—almost 40 percent—was remarkable. Strax was ecstatic: “The radiologist,” he wrote, “has become a potential savior of women—and their breasts.”
The positive results of the HIP trial had an explosive effect on mammography. “Within 5 years, mammography has moved from the realm of a discarded procedure to the threshold of widespread application,” a radiologist wrote. At the National Cancer Institute, enthusiasm for screening rose swiftly to a crescendo. Arthur Holleb, the American Cancer Society’s chief medical officer, was quick to note the parallel to the Pap smear. “The time has come,” Holleb announced in 1971, “for the . . . Society to mount a massive program on mammography just as we did with the Pap test. . . . No longer can we ask the people of this country to tolerate a loss of life from breast cancer each year equal to the loss of life in the past ten years in Viet Nam. The time has come for greater national effort. I firmly believe that time is now.”
The ACS’s massive campaign was called the Breast Cancer Detection and Demonstration Project (BCDDP). Notably, this was not a trial but, as its name suggested, a “demonstration.” There was no treatment or control group. The project intended to screen nearly 250,000 women in a single year, nearly eight times the number screened by Strax in three years, in large part to show that it was possible to muscle through mammographic screening at a national level. Mary Lasker backed it strongly, as did virtually every cancer organization in America. Mammography, the “discarded procedure,” was about to become enshrined in the mainstream.
But even as the BCDDP forged ahead, doubts were gathering over the HIP study. Shapiro, recall, had chosen to randomize the trial by placing the “test women” and “control” women into two groups and comparing mortality. But, as was common practice in the sixties, the control group had not been informed of its participation in a trial. It had been a virtual group—a cohort drawn out of the HIP’s records. When a woman had died of breast cancer in the control group, Strax and Shapiro had dutifully updated their ledgers, but—trees falling in statistical forests—the group had been treated as an abstract entity, unaware even of its own existence.
In principle, comparing a virtual group to a real group would have been perfectly fine. But as the trial enrollment had proceeded in the mid-1960s, Strax and Shapiro had begun to worry whether some women already diagnosed with breast cancer might have entered the trial. A screening examination would, of course, be a useless test for such women since they already carried the disease. To correct for this, Shapiro had begun to selectively remove such women from both arms of the trial.
Removing such subjects from the mammography test group was relatively easy: the radiologist could simply ask a woman about her prior history before she underwent mammography. But since the control group was a virtual entity, there could be no virtual asking. It would have to be culled “virtually.” Shapiro tried to be dispassionate and rigorous by pulling equal numbers of women from the two arms of the trial. But in the end, he may have chosen selectively. Possibly, he overcorrected: more patients with prior breast cancer were eliminated from the screened group. The difference was small—only 434 patients in a trial of 30,000—but statistically speaking, fatal. Critics now charged that the excess mortality in the unscreened group was an artifact of the culling. The unscreened group had been mistakenly overloaded with more patients with prior breast cancer—and the excess death in the untreated group was merely a statistical artifact.
Mammography enthusiasts were devastated. What was needed, they admitted, was a fair reevaluation, a retrial. But where might such a trial be performed? Certainly not in the United States—with two hundred thousand women already enrolled in the BCDDP (and therefore not eligible for another trial), and its bickering academic community shadowboxing over the interpretation of shadows. Scrambling blindly out of controversy, the entire community of mammographers overcompensated as well. Rather than build experiments methodically on other experiments, they launched a volley of parallel trials that came tumbling out over each other. Between 1976 and 1992, enormous parallel trials of mammography were launched in Europe: in Edinburgh, Scotland, and in several sites in Sweden—Malmö, Kopparberg, Östergötland, Stockholm, and Göteborg. In Canada, meanwhile, researchers lurched off on their own randomized trial of mammography, called the National Breast Screening Study (CNBSS). As with so much in the history of breast cancer, mammographic trial-running had turned into an arms race, with each group trying to better th
e efforts of the others.
Edinburgh was a disaster. Balkanized into hundreds of isolated and disconnected medical practices, it was a terrible trial site to begin with. Doctors assigned blocks of women to the screening or control groups based on seemingly arbitrary criteria. Or, worse still, women assigned themselves. Randomization protocols were disrupted. Women often switched between one group and the other as the trial proceeded, paralyzing and confounding any meaningful interpretation of the study as a whole.
The Canadian trial, meanwhile, epitomized precision and attention to detail. In the summer of 1980, a heavily publicized national campaign involving letters, advertisements, and personal phone calls was launched to recruit thirty-nine thousand women to fifteen accredited centers for screening mammography. When a woman presented herself at any such center, she was asked some preliminary questions by a receptionist, asked to fill out a questionnaire, then examined by a nurse or physician, after which her name was entered into an open ledger. The ledger—a blue-lined notebook was used in most clinics—circulated freely. Randomized assignment was thus achieved by alternating lines in that notebook. One woman was assigned to the screened group, the woman on the next line to the control group, the third line to the screened, the fourth to the control, and so forth.
Note carefully that sequence of events: a woman was typically randomized after her medical history and examination. That sequence was neither anticipated nor prescribed in the original protocol (detailed manuals of instruction had been sent to each center). But that minute change completely undid the trial. The allocations that emerged after those nurse interviews were no longer random. Women with abnormal breast or lymph node examinations were disproportionately assigned to the mammography group (seventeen to the mammography group; five to the control arm, at one site). So were women with prior histories of breast cancer. So, too, were women known to be at “high risk” based on their past history or prior insurance claims (eight to mammography; one to control).
The reasons for this skew are still unknown. Did the nurses allocate high-risk women to the mammography group to confirm a suspicious clinical examination—to obtain a second opinion, as it were, by X-ray? Was that subversion even conscious? Was it an unintended act of compassion, an attempt to help high-risk women by forcing them to have mammograms? Did high-risk women skip their turn in the waiting room to purposefully fall into the right line of the allocation book? Were they instructed to do so by the trial coordinators—by their examining doctors, the X-ray technicians, the receptionists?
Teams of epidemiologists, statisticians, radiologists, and at least one group of forensic experts have since pored over those scratchy notebooks to try to answer these questions and decipher what went wrong in the trial. “Suspicion, like beauty, lies in the eye of the beholder,” one of the trial’s chief investigators countered. But there was plenty to raise suspicion. The notebooks were pockmarked with clerical errors: names changed, identities reversed, lines whited out, names replaced or overwritten. Testimonies by on-site workers reinforced these observations. At one center, a trial coordinator selectively herded her friends to the mammography group (hoping, presumably, to do them a favor and save their lives). At another, a technician reported widespread tampering with randomization with women being “steered” into groups. Accusations and counteraccusations flew through the pages of academic journals. “One lesson is clear,” the cancer researcher Norman Boyd wrote dismissively in a summary editorial: “randomization in clinical trials should be managed in a manner that makes subversion impossible.”
But such smarting lessons aside, little else was clear. What emerged from that fog of details was a study even more imbalanced than the HIP study. Strax and Shapiro had faltered by selectively depleting the mammography group of high-risk patients. The CNBSS faltered, skeptics now charged, by succumbing to the opposite sin: by selectively enriching the mammography group with high-risk women. Unsurprisingly, the result of the CNBSS was markedly negative: if anything, more women died of breast cancer in the mammography group than in the unscreened group.
It was in Sweden, at long last, that this stuttering legacy finally came to an end. In the winter of 2007, I visited Malmö, the site for one of the Swedish mammography trials launched in the late 1970s. Perched almost on the southern tip of the Swedish peninsula, Malmö is a bland, gray-blue industrial town set amid a featureless, gray-blue landscape. The bare, sprawling flatlands of Skåne stretch out to its north, and the waters of the Øresund strait roll to the south. Battered by a steep recession in the mid-1970s, the region had economically and demographically frozen for nearly two decades. Migration into and out of the city had shrunk to an astonishingly low 2 percent for nearly twenty years. Malmö had been in limbo with a captive cohort of men and women. It was the ideal place to run a difficult trial.
In 1976, forty-two thousand women enrolled in the Malmö Mammography Study. Half the cohort (about twenty-one thousand women) was screened yearly at a small clinic outside the Malmö General Hospital, and the other half not screened—and the two groups have been followed closely ever since. The experiment ran like clockwork. “There was only one breast clinic in all of Malmö—unusual for a city of this size,” the lead researcher, Ingvar Andersson, recalled. “All the women were screened at the same clinic year after year, resulting in a highly consistent, controlled study—the most stringent study that could be produced.”
In 1988, at the end of its twelfth year, the Malmö study reported its results. Overall, 588 women had been diagnosed with breast cancer in the screened group, and 447 in the control group—underscoring, once again, the capacity of mammography to detect early cancers. But notably, at least at first glance, early detection had not translated into overwhelming numbers of lives saved. One hundred and twenty-nine women had died of breast cancer—sixty-three in the screened and sixty-six in the unscreened—with no statistically discernible difference overall.
But there was a pattern behind the deaths. When the groups were analyzed by age, women above fifty-five years had benefited from screening, with a reduction in breast cancer deaths by 20 percent. In younger women, in contrast, screening with mammography showed no detectable benefit.
This pattern—a clearly discernible benefit for older women, and a barely detectable benefit in younger women—would be confirmed in scores of studies that followed Malmö. In 2002, twenty-six years after the launch of the original Malmö experiment, an exhaustive analysis combining all the Swedish studies was published in the Lancet. In all, 247,000 women had been enrolled in these trials. The pooled analysis vindicated the Malmö results. In aggregate, over the course of fifteen years, mammography had resulted in 20 to 30 percent reductions in breast cancer mortality for women aged fifty-five to seventy. But for women below fifty-five, the benefit was barely discernible.
Mammography, in short, was not going to be the unequivocal “savior” of all women with breast cancer. Its effects, as the statistician Donald Berry describes it, “are indisputable for a certain segment of women—but also indisputably modest in that segment.” Berry wrote, “Screening is a lottery. Any winnings are shared by the minority of women. . . . The overwhelming proportion of women experience no benefit and they pay with the time involved and the risks associated with screening. . . . The risk of not having a mammogram until after age 50 is about the same as riding a bicycle for 15 hours without a helmet.” If all women across the nation chose to ride helmetless for fifteen hours straight, there would surely be several more deaths than if they had all worn helmets. But for an individual woman who rides her bicycle helmetless to the corner grocery store once a week, the risk is so minor that some would dismiss it outright.
In Malmö, at least, this nuanced message has yet to sink in. Many women from the original mammographic cohort have died (of various causes), but mammography, as one Malmö resident described it, “is somewhat of a religion here.” On the windy winter morning that I stood outside the clinic, scores of women—some over fifty-five and som
e obviously younger—came in religiously for their annual X-rays. The clinic, I suspect, still ran with the same efficiency and diligence that had allowed it, after disastrous attempts in other cities, to rigorously complete one of the most seminal and difficult trials in the history of cancer prevention. Patients streamed in and out effortlessly, almost as if running an afternoon errand. Many of them rode off on their bicycles—oblivious of Berry’s warnings—without helmets.
Why did a simple, reproducible, inexpensive, easily learned technique—an X-ray image to detect the shadow of a small tumor in the breast—have to struggle for five decades and through nine trials before any benefit could be ascribed to it?
Part of the answer lies in the complexity of running early-detection trials, which are inherently slippery, contentious, and prone to error. Edinburgh was undone by flawed randomization; the BCDDP by nonrandomization. Shapiro’s trial was foiled by a faulty desire to be dispassionate; the Canadian trial by a flawed impulse to be compassionate.
Part of the answer lies also in the old conundrum of over- and underdiagnosis—although with an important twist. A mammogram, it turns out, is not a particularly good tool for detecting early breast cancer. Its false-positive and false-negative rates make it far from an ideal screening test. But the fatal flaw in mammography lies in that these rates are not absolute: they depend on age. For women above fifty-five, the incidence of breast cancer is high enough that even a relatively poor screening tool can detect an early tumor and provide a survival benefit. For women between forty and fifty years, though, the incidence of breast cancer sinks to a point that a “mass” detected on a mammogram, more often than not, turns out to be a false positive. To use a visual analogy: a magnifying lens designed to make small script legible does perfectly well when the font size is ten or even six points. But then it hits a limit. At a certain size font, chances of reading a letter correctly become about the same as reading a letter incorrectly. In women above fifty-five, where the “font size” of breast cancer incidence is large enough, a mammogram performs adequately. But in women between forty and fifty, the mammogram begins to squint at an uncomfortable threshold—exceeding its inherent capacity to become a discriminating test. No matter how intensively we test mammography in this group of women, it will always be a poor screening tool.