Few scientists had studied this early transition of cancer cells as intensively as George Papanicolaou, a Greek cytologist at Cornell University in New York. Robust, short, formal, and old-worldly, Papanicolaou had trained in medicine and zoology in Athens and in Munich and arrived in New York in 1913. Penniless off the boat, he had sought a job in a medical laboratory but had been relegated to selling carpets at the Gimbels store on Thirty-third Street to survive. After a few months of truly surreal labor (he was, by all accounts, a terrible carpet salesman), Papanicolaou secured a research position at Cornell that may have been just as surreal as carpet selling: he was assigned to study the menstrual cycle of guinea pigs, a species that neither bleeds visibly nor sheds tissue during menses. Using a nasal speculum and Q-tips, Papanicolaou had nonetheless learned to scrape off cervical cells from guinea pigs and spread them on glass slides in thin, watery smears.
The cells, he found, were like minute watch-hands. As hormones rose and ebbed in the animals cyclically, the cells shed by the guinea pig cervix changed their shapes and sizes cyclically as well. Using their morphology as a guide, he could foretell the precise stage of the menstrual cycle often down to the day.
By the late 1920s, Papanicolaou had extended his technique to human patients. (His wife, Maria, in surely one of the more grisly displays of conjugal fortitude, reportedly allowed herself to be tested by cervical smears every day.) As with guinea pigs, he found that cells sloughed off by the human cervix could also foretell the stages of the menstrual cycle in women.
But all of this, it was pointed out to him, amounted to no more than an elaborate and somewhat useless invention. As one gynecologist archly remarked, “in primates, including women,” a diagnostic smear was hardly needed to calculate the stage or timing of the menstrual cycle. Women had been timing their periods—without Papanicolaou’s cytological help—for centuries.
Disheartened by these criticisms, Papanicolaou returned to his slides. He had spent nearly a decade looking obsessively at normal smears; perhaps, he reasoned, the real value of his test lay not in the normal smear, but in pathological conditions. What if he could diagnose a pathological state with his smear? What if the years of staring at cellular normalcy had merely been a prelude to allow him to identify cellular abnormalities?
Papanicolaou thus began to venture into the world of pathological conditions, collecting slides from women with all manners of gynecological diseases—fibroids, cysts, tubercles, inflammations of the uterus and cervix, streptococcal, gonococcal, and staphylococcal infections, tubal pregnancies, abnormal pregnancies, benign and malignant tumors, abscesses and furuncles, hoping to find some pathological mark in the exfoliated cells.
Cancer, he found, was particularly prone to shedding abnormal cells. In nearly every case of cervical cancer, when Papanicolaou brushed cells off the cervix, he found “aberrant and bizarre forms” with abnormal, bloated nuclei, ruffled membranes, and shrunken cytoplasm that looked nothing like normal cells. It “became readily apparent,” he wrote, that he had stumbled on a new test for malignant cells.
Thrilled by his results, Papanicolaou published his method in an article entitled “New Cancer Diagnosis” in 1928. But the report, presented initially at an outlandish “race betterment” eugenics conference, generated only further condescension from pathologists. The Pap smear, as he called the technique, was neither accurate nor particularly sensitive. If cervical cancer was to be diagnosed, his colleagues argued, then why not perform a biopsy of the cervix, a meticulous procedure that, even if cumbersome and invasive, was considered far more precise and definitive than a grubby smear? At academic conferences, experts scoffed at the crude alternative. Even Papanicolaou could hardly argue the point. “I think this work will be carried a little further,” he wrote self-deprecatingly at the end of his 1928 paper. Then, for nearly two decades, having produced two perfectly useless inventions over twenty years, he virtually disappeared from the scientific limelight.
Between 1928 and 1950, Papanicolaou delved back into his smears with nearly monastic ferocity. His world involuted into a series of routines: the daily half-hour commute to his office with Maria at the wheel; the weekends at home in Long Island with a microscope in the study and a microscope on the porch; evenings spent typing reports on specimens with a phonograph playing Schubert in the background and a glass of orange juice congealing on his table. A gynecologic pathologist named Herbert Traut joined him to help interpret his smears. A Japanese fish and bird painter named Hashime Murayama, a colleague from his early years at Cornell, was hired to paint watercolors of his smears using a camera lucida.
For Papanicolaou, too, this brooding, contemplative period was like a personal camera lucida that magnified and reflected old experimental themes onto new ones. A decades-old thought returned to haunt him: if normal cells of the cervix changed morphologically in graded, stepwise fashion over time, might cancer cells also change morphologically in time, in a slow, stepwise dance from normal to malignant? Like Auerbach (whose work was yet to be published), could he identify intermediate stages of cancer—lesions slouching their way toward full transformation?
At a Christmas party in the winter of 1950, challenged by a tipsy young gynecologist in his lab to pinpoint the precise use of the smear, Papanicolaou verbalized a strand of thought that he had been spinning internally for nearly a decade. The thought almost convulsed out of him. The real use of the Pap smear was not to find cancer, but rather to detect its antecedent, its precursor—the portent of cancer.
“It was a revelation,” one of his students recalled. “A Pap smear would give a woman a chance to receive preventive care [and] greatly decrease the likelihood of her ever developing cancer.” Cervical cancer typically arises in an outer layer of the cervix, then grows in a flaky, superficial whirl before burrowing inward into the surrounding tissues. By sampling asymptomatic women, Papanicolaou speculated that his test, albeit imperfect, might capture the disease at its first stages. He would, in essence, push the diagnostic clock backward—from incurable, invasive cancers to curable, preinvasive malignancies.
In 1952, Papanicolaou convinced the National Cancer Institute to launch the largest clinical trial of secondary prevention in the history of cancer using his smearing technique. Nearly every adult female resident of Shelby County, Tennessee—150,000 women spread across eight hundred square miles—was tested with a Pap smear and followed over time. Smears poured in from hundreds of sites: from one-room doctor’s offices dotted among the horse farms of Germantown to large urban community clinics scattered throughout the city of Memphis. Temporary “Pap clinics” were set up in factories and office buildings. Once collected, the samples were funneled into a gigantic microscope facility at the University of Tennessee, where framed photographs of exemplary normal and abnormal smears had been hung on the walls. Technicians read slides day and night, looking up from the microscopes at the pictures. At the peak, nearly a thousand smears were read every day.
As expected, the Shelby team found its fair share of advanced cancerous lesions in the population. In the initial cohort of about 150,000, invasive cervical cancer was found in 555 women. But the real proof of Papanicolaou’s principle lay in another discovery: astonishingly, 557 women were found to have preinvasive cancers or even precancerous changes—early-stage, localized lesions curable by relatively simple surgical procedures. Nearly all these women were asymptomatic; had they never been tested, they would never have been suspected of harboring preinvasive lesions. Notably, the average age of diagnosis of women with such preinvasive lesions was about twenty years lower than the average age of women with invasive lesions—once again corroborating the long march of carcinogenesis. The Pap smear had, in effect, pushed the clock of cancer detection forward by nearly two decades, and changed the spectrum of cervical cancer from predominantly incurable to predominantly curable.
A few miles from Papanicolaou’s laboratory in New York, the core logic of the Pap smear was being extended to a very differe
nt form of cancer. Epidemiologists think about prevention in two forms. In primary prevention, a disease is prevented by attacking its cause—smoking cessation for lung cancer or a vaccine against hepatitis B for liver cancer. In secondary prevention (also called screening), a disease is prevented by screening for its early, presymptomatic stage. The Pap smear was invented as a means of secondary prevention for cervical cancer. But if a microscope could detect a presymptomatic state in scraped-off cervical tissue, then could another means of “seeing” cancer detect an early lesion in another cancer-afflicted organ?
In 1913, a Berlin surgeon named Albert Salomon had certainly tried. A dogged, relentless champion of the mastectomy, Salomon had whisked away nearly three thousand amputated breasts after mastectomies to an X-ray room where he had photographed them after surgery to detect the shadowy outlines of cancer. Salomon had detected stigmata of cancer in his X-rays—microscopic sprinkles of calcium lodged in cancer tissue (“grains of salt,” as later radiologists would call them) or thin crustacean fingerlings of malignant cells reminiscent of the root of the word cancer.
The next natural step might have been to image breasts before surgery as a screening method, but Salomon’s studies were rudely interrupted. Abruptly purged from his university position by the Nazis in the mid-1930s, Salomon escaped the camps to Amsterdam and vanished underground—and so, too, did his shadowy X-rays of breasts. Mammography, as Salomon called his technique, languished in neglect. It was hardly missed: in a world obsessed with radical surgery, since small or large masses in the breast were treated with precisely the same gargantuan operation, screening for small lesions made little sense.
For nearly two decades, the mammogram thus lurked about in the far peripheries of medicine—in France and England and Uruguay, places where radical surgery held the least influence. But by the mid-1960s, with Halsted’s theory teetering uneasily on its pedestal, mammography reentered X-ray clinics in America, championed by pioneering radiographers such as Robert Egan in Houston. Egan, like Papanicolaou, cast himself more as an immaculate craftsman than a scientist—a photographer, really, who was taking photographs of cancer using X-rays, the most penetrating form of light. He tinkered with films, angles, positions, and exposures, until, as one observer put it, “trabeculae as thin as a spider’s web” in the breast could be seen in the images.
But could cancer be caught in that “spider’s web” of shadows, trapped early enough to prevent its spread? Egan’s mammograms could now detect tumors as small as a few millimeters, about the size of a grain of barley. But would screening women to detect such early tumors and extricating the tumors surgically save lives?
Screening trials in cancer are among the most slippery of all clinical trials—notoriously difficult to run, and notoriously susceptible to errors. To understand why, consider the odyssey from the laboratory to the clinic of a screening test for cancer. Suppose a new test has been invented in the laboratory to detect an early, presymptomatic stage of a particular form of cancer, say, the level of a protein secreted by cancer cells into the serum. The first challenge for such a test is technical: its performance in the real world. Epidemiologists think of screening tests as possessing two characteristic performance errors. The first error is overdiagnosis—when an individual tests positive in the test but does not have cancer. Such individuals are called “false positives.” Men and women who falsely test positive find themselves trapped in the punitive stigma of cancer, the familiar cycle of anxiety and terror (and the desire to “do something”) that precipitates further testing and invasive treatment.
The mirror image of overdiagnosis is underdiagnosis—an error in which a patient truly has cancer but does not test positive for it. Underdiagnosis falsely reassures patients of their freedom from disease. These men and women (“false negatives” in the jargon of epidemiology) enter a different punitive cycle—of despair, shock, and betrayal—once their disease, undetected by the screening test, is eventually uncovered when it becomes symptomatic.
The trouble is that overdiagnosis and underdiagnosis are often intrinsically conjoined, locked perpetually on two ends of a seesaw. Screening tests that strive to limit overdiagnosis—by narrowing the criteria by which patients are classified as positive—often pay the price of increasing underdiagnosis because they miss patients that lie in the gray zone between positive and negative. An example helps to illustrate this trade-off. Suppose—to use Egan’s vivid metaphor—a spider is trying to invent a perfect web to capture flies out of the air. Increasing the density of that web, she finds, certainly increases the chances of catching real flies (true positives) but it also increases the chances of capturing junk and debris floating through the air (false positives). Making the web less dense, in contrast, decreases the chances of catching real prey, but every time something is captured, chances are higher that it is a fly. In cancer, where both overdiagnosis and underdiagnosis come at high costs, finding that exquisite balance is often impossible. We want every cancer test to operate with perfect specificity and sensitivity. But the technologies for screening are not perfect. Screening tests thus routinely fail because they cannot even cross this preliminary hurdle—the rate of over- or underdiagnosis is unacceptably high.
Suppose, however, our new test does survive this crucial bottleneck. The rates of overdiagnosis and underdiagnosis are deemed acceptable, and we unveil the test on a population of eager volunteers. Suppose, moreover, that as the test enters the public domain, doctors immediately begin to detect early, benign-appearing, premalignant lesions—in stark contrast to the aggressive, fast-growing tumors seen before the test. Is the test to be judged a success?
No; merely detecting a small tumor is not sufficient. Cancer demonstrates a spectrum of behavior. Some tumors are inherently benign, genetically determined to never reach the fully malignant state; and some tumors are intrinsically aggressive, and intervention at even an early, presymptomatic stage might make no difference to the prognosis of a patient. To address the inherent behavioral heterogeneity of cancer, the screening test must go further. It must increase survival.
Imagine, now, that we have designed a trial to determine whether our screening test increases survival. Two identical twins, call them Hope and Prudence, live in neighboring houses and are offered the trial. Hope chooses to be screened by the test. Prudence, suspicious of overdiagnosis and underdiagnosis, refuses to be screened.
Unbeknownst to Hope and Prudence, identical forms of cancer develop in both twins at the exact same time—in 1990. Hope’s tumor is detected by the screening test in 1995, and she undergoes surgical treatment and chemotherapy. She survives five additional years, then relapses and dies ten years after her original diagnosis, in 2000. Prudence, in contrast, detects her tumor only when she feels a growing lump in her breast in 1999. She, too, has treatment, with some marginal benefit, then relapses and dies at the same moment as Hope in 2000.
At the joint funeral, as the mourners stream by the identical caskets, an argument breaks out among Hope’s and Prudence’s doctors. Hope’s physicians insist that she had a five-year survival: her tumor was detected in 1995 and she died in 2000. Prudence’s doctors insist that her survival was one year: Prudence’s tumor was detected in 1999 and she died in 2000. Yet both cannot be right: the twins died from the same tumor at the exact same time. The solution to this seeming paradox—called lead-time bias—is immediately obvious. Using survival as an end point for a screening test is flawed because early detection pushes the clock of diagnosis backward. Hope’s tumor and Prudence’s tumor possess exactly identical biological behavior. But since doctors detected Hope’s tumor earlier, it seems, falsely, that she lived longer and that the screening test was beneficial.
So our test must now cross an additional hurdle: it must improve mortality, not survival. The only appropriate way to judge whether Hope’s test was truly beneficial is to ask whether Hope lived longer regardless of the time of her diagnosis. Had Hope lived until 2010 (outliving Prudence by a decade), we c
ould have legitimately ascribed a benefit to the test. Since both women died at the exact same moment, we now discover that screening produced no benefit.
A screening test’s path to success is thus surprisingly long and narrow. It must avoid the pitfalls of overdiagnosis and underdiagnosis. It must steer past the narrow temptation to use early detection as an end in itself. Then, it must navigate the treacherous straits of bias and selection. “Survival,” seductively simple, cannot be its end point. And adequate randomization at each step is critical. Only a test capable of meeting all these criteria—proving mortality benefit in a genuinely randomized setting with an acceptable over- and underdiagnosis rate—can be judged a success. With the odds stacked so steeply, few tests are powerful enough to withstand this level of scrutiny and truly provide benefit in cancer.
In the winter of 1963, three men set out to test whether screening a large cohort of asymptomatic women using mammography would prevent mortality from breast cancer. All three, outcasts from their respective fields, were seeking new ways to study breast cancer. Louis Venet, a surgeon trained in the classical tradition, wanted to capture early cancers as a means to avert the large and disfiguring radical surgeries that had become the norm in the field. Sam Shapiro, a statistician, sought to invent new methods to mount statistical trials. And Philip Strax, a New York internist, had perhaps the most poignant of reasons: he had nursed his wife through the torturous terminal stages of breast cancer in the mid-1950s. Strax’s attempt to capture preinvasive lesions using X-rays was a personal crusade to unwind the biological clock that had ultimately taken his wife’s life.
Venet, Strax, and Shapiro were sophisticated clinical trialists: right at the onset, they realized that they would need a randomized, prospective trial using mortality as an end point to test mammography. Methodologically speaking, their trial would recapitulate Doll and Hill’s famous smoking trial of the 1950s. But how might such a trial be logistically run? The Doll and Hill study had been the fortuitous by-product of the nationalization of health care in Great Britain—its stable cohort produced, in large part, by the National Health Service’s “address book” of registered doctors across the United Kingdom. For mammography, in contrast, it was the sweeping wave of privatization in postwar America that provided the opportunity to run the trial. In the summer of 1944, lawmakers in New York unveiled a novel program to provide subscriber-based health insurance to groups of employees in New York. This program, called the Health Insurance Plan (HIP), was the ancestor of the modern HMO.