The good can decay many ways.

  The good candy came anyways.

  The stuffy nose can lead to problems.

  The stuff he knows can lead to problems.

  Some others I’ve seen.

  Some mothers I’ve seen.

  Oronyms are often used in songs and nursery rhymes:

  I scream,

  You scream,

  We all scream

  For ice cream.

  Mairzey doats and dozey doats

  And little lamsey divey,

  A kiddley-divey do,

  Wouldn’t you?

  Fuzzy Wuzzy was a bear,

  Fuzzy Wuzzy had no hair.

  Fuzzy Wuzzy wasn’t fuzzy,

  Was he?

  In fir tar is,

  In oak none is.

  In mud eel is,

  In clay none is.

  Goats eat ivy.

  Mares eat oats.

  And some are discovered inadvertently by teachers reading their students’ term papers and homework assignments:

  Jose can you see by the donzerly light? [Oh say can you see by the dawn’s early light?]

  It’s a doggy-dog world, [dog-eat-dog]

  Eugene O’Neill won a Pullet Surprise. [Pulitzer Prize]

  My mother comes from Pencil Vanea. [Pennsylvania]

  He was a notor republic, [notary public]

  They played the Bohemian Rap City. [Bohemian Rhapsody]

  Even the sequence of sounds we think we hear within a word is an illusion. If you were to cut up a tape of someone’s saying cat, you would not get pieces that sounded like k, a, and t (the units called “phonemes” that correspond roughly to the letters of the alphabet). And if you spliced the pieces together in the reverse order, they would be unintelligible, not tack. As we shall see, information about each component of a word is smeared over the entire word.

  Speech perception is another one of the biological miracles making up the language instinct. There are obvious advantages to using the mouth and ear as a channel of communication, and we do not find any hearing community opting for sign language, though it is just as expressive. Speech does not require good lighting, face-to-face contact, or monopolizing the hands and eyes, and it can be shouted over long distances or whispered to conceal the message. But to take advantage of the medium of sound, speech has to overcome the problem that the ear is a narrow informational bottleneck. When engineers first tried to develop reading machines for the blind in the 1940s, they devised a set of noises that corresponded to the letters of the alphabet. Even with heroic training, people could not recognize the sounds at a rate faster than good Morse code operators, about three units a second. Real speech, somehow, is perceived an order of magnitude faster: ten to fifteen phonemes per second for casual speech, twenty to thirty per second for the man in the late-night Veg-O-Matic ads, and as many as forty to fifty per second for artificially sped-up speech. Given how the human auditory system works, this is almost unbelievable. When a sound like a click is repeated at a rate of twenty times a second or faster, we no longer hear it as a sequence of separate sounds but as a low buzz. If we can hear forty-five phonemes per second, the phonemes cannot possibly be consecutive bits of sound; each moment of sound must have several phonemes packed into it that our brains somehow unpack. As a result, speech is by far the fastest way of getting information into the head through the ear.

  No human-made system can match a human in decoding speech. It is not for lack of need or trying. A speech recognizer would be a boon to quadriplegics and other disabled people, to professionals who have to get information into a computer while their eyes or hands are busy, to people who never learned to type, to users of telephone services, and to the growing number of typists who are victims of repetitive-motion syndromes. So it is not surprising that engineers have been working for more than forty years to get computers to recognize the spoken word. The engineers have been frustrated by a tradeoff. If a system has to be able to listen to many different people, it can recognize only a tiny number of words. For example, telephone companies are beginning to install directory assistance systems that can recognize anyone saying the word yes, or, in the more advanced systems, the ten English digits (which, fortunately for the engineers, have very different sounds). But if a system has to recognize a large number of words, it has to be trained to the voice of a single speaker. No system today can duplicate a person’s ability to recognize both many words and many speakers. Perhaps the state of the art is a system called DragonDictate, which runs on a personal computer and can recognize 30,000 words. But it has severe limitations. It has to be trained extensively on the voice of the user. You…have…to…talk…to…it…like…this, with quarter-second pauses between the words (so it operates at about one-fifth the rate of ordinary speech). If you have to use a word that is not in its dictionary, like a name, you have to spell it out using the “Alpha, Bravo, Charlie” alphabet. And the program still garbles words about fifteen percent of the time, more than once per sentence. It is an impressive product but no match for even a mediocre stenographer.

  The physical and neural machinery of speech is a solution to two problems in the design of the human communication system. A person might know 60,000 words, but a person’s mouth cannot make 60,000 different noises (at least, not ones that the ear can easily discriminate). So language has exploited the principle of the discrete combinatorial system again. Sentences and phrases are built out of words, words are built out of morphemes, and morphemes, in turn, are built out of phonemes. Unlike words and morphemes, though, phonemes do not contribute bits of meaning to the whole. The meaning of dog is not predictable from the meaning of d, the meaning of o, the meaning of g, and their order. Phonemes are a different kind of linguistic object. They connect outward to speech, not inward to mentalese: a phoneme corresponds to an act of making a sound. A division into independent discrete combinatorial systems, one combining meaningless sounds into meaningful morphemes, the others combining meaningful morphemes into meaningful words, phrases, and sentences, is a fundamental design feature of human language, which the linguist Charles Hockett has called “duality of patterning.”

  But the phonological module of the language instinct has to do more than spell out the morphemes. The rules of language are discrete combinatorial systems: phonemes snap cleanly into morphemes, morphemes into words, words into phrases. They do not blend or melt or coalesce: Dog bites man differs from Man bites dog, and believing in God is different from believing in Dog. But to get these structures out of one head and into another, they must be converted to audible signals. The audible signals people can produce are not a series of crisp beeps like on a touch-tone phone. Speech is a river of breath, bent into hisses and hums by the soft flesh of the mouth and throat. The problems Mother Nature faced are digital-to-analog conversion when the talker encodes strings of discrete symbols into a continuous stream of sound, and analog-to-digital conversion when the listener decodes continuous speech back into discrete symbols.

  The sounds of language, then, are put together in several steps. A finite inventory of phonemes is sampled and permuted to define words, and the resulting strings of phonemes are then massaged to make them easier to pronounce and understand before they are actually articulated. I will trace out these steps for you and show you how they shape some of our everyday encounters with speech: poetry and song, slips of the ear, accents, speech recognition machines, and crazy English spelling.

  One easy way to understand speech sounds is to track a glob of air through the vocal tract into the world, starting in the lungs.

  When we talk, we depart from our usual rhythmic breathing and take in quick breaths of air, then release them steadily, using the muscles of the ribs to counteract the elastic recoil force of the lungs. (If we did not, our speech would sound like the pathetic whine of a released balloon.) Syntax overrides carbon dioxide: we suppress the delicately tuned feedback loop that controls our breathing rate to regulate oxygen intake, and instead
we time our exhalations to the length of the phrase or sentence we intend to utter. This can lead to mild hyperventilation or hypoxia, which is why public speaking is so exhausting and why it is difficult to carry on a conversation with a jogging partner.

  The air leaves the lungs through the trachea (windpipe), which opens into the larynx (the voice-box, visible on the outside as the Adam’s apple). The larynx is a valve consisting of an opening (the glottis) covered by two flaps of retractable muscular tissue called the vocal folds (they are also called “vocal cords” because of an early anatomist’s error; they are not cords at all). The vocal folds can close off the glottis tightly, sealing the lungs. This is useful when we want to stiffen our upper body, which is a floppy bag of air. Get up from your chair without using your arms; you will feel your larynx tighten. The larynx is also closed off in physiological functions like coughing and defecation. The grunt of the weightlifter or tennis player is a reminder that we use the same organ to seal the lungs and to produce sound.

  The vocal folds can also be partly stretched over the glottis to produce a buzz as the air rushes past. This happens because the high-pressure air pushes the vocal folds open, at which point they spring back and get sucked together, closing the glottis until air pressure builds up and pushes them open again, starting a new cycle. Breath is thus broken into a series of puffs of air, which we perceive as a buzz, called “voicing.” You can hear and feel the buzz by making the sounds ssssssss, which lacks voicing, and zzzzzzzz, which has it.

  The frequency of the vocal folds’ opening and closing determines the pitch of the voice. By changing the tension and position of the vocal folds, we can control the frequency and hence the pitch. This is most obvious in humming or singing, but we also change pitch continuously over the course of a sentence, a process called intonation. Normal intonation is what makes natural speech sound different from the speech of robots in old science fiction movies and of the Coneheads on Saturday Night Live. Intonation is also controlled in sarcasm, emphasis, and an emotional tone of voice such as anger or cheeriness. In “tone languages” like Chinese, rising or falling tones distinguish certain vowels from others.

  Though voicing creates a sound wave with a dominant frequency of vibration, it is not like a tuning fork or a test of the Emergency Broadcasting System, a pure tone with that frequency alone. Voicing is a rich, buzzy sound with many “harmonics.” A male voice is a wave with vibrations not only at 100 cycles per second but also at 200 cps, 300 cps, 400 cps, 500 cps, 600 cps, 700 cps, and so on, all the way up to 4000 cps and beyond. A female voice has vibrations at 200 cps, 400 cps, 600 cps, and so on. The richness of the sound source is crucial—it is the raw material that the rest of the vocal tract sculpts into vowels and consonants.

  If for some reason we cannot produce a hum from the larynx, any rich source of sound will do. When we whisper, we spread the vocal folds, causing the air stream to break apart chaotically at the edges of the folds and creating a turbulence or noise that sounds like hissing or radio static. A hissing noise is not a neatly repeating wave consisting of a sequence of harmonics, as we find in the periodic sound of a speaking voice, but a jagged, spiky wave consisting of a hodgepodge of constantly changing frequencies. This mixture, though, is all that the rest of the vocal tract needs for intelligible whispering. Some laryngectomy patients are taught “esophageal speech,” or controlled burping, which provides the necessary noise. Others place a vibrator against their necks. In the 1970s the guitarist Peter Frampton funneled the amplified sound of his electric guitar through a tube into his mouth, allowing him to articulate his twangings. The effect was good for a couple of hit records before he sank into rock-and-roll oblivion.

  The richly vibrating air then runs through a gantlet of chambers before leaving the head: the throat or “pharynx” behind the tongue, the mouth region between the tongue and palate, the opening between the lips, and an alternative route to the external world through the nose. Each chamber has a particular length and shape, which affects the sound passing through by the phenomenon called “resonance.” Sounds of different frequencies have different wavelengths (the distance between the crests of the sound wave); higher pitches have shorter wavelengths. A sound wave moving down the length of a tube bounces back when it reaches the opening at the other end. If the length of the tube is a certain fraction of the wavelength of the sound, each reflected wave will reinforce the next incoming one; if it is of a different length, they will interfere with one another. (This is similar to how you get the best effect pushing a child on a swing if you synchronize each push with the top of the arc.) Thus a tube of a particular length amplifies some sound frequencies and filters out others. You can hear the effect when you fill a bottle. The noise of the sloshing water gets filtered by the chamber of air between the surface and the opening: the more water, the smaller the chamber, the higher the resonant frequency of the chamber, and the tinnier the gurgle.

  What we hear as different vowels are the different combinations of amplifications and filtering of the sound coming up from the larynx. These combinations are produced by moving five speech organs around in the mouth to change the shapes and lengths of the resonant cavities that the sound passes through. For example, ee is defined by two resonances, one from 200 to 350 cps produced mainly by the throat cavity, and the other from 2100 to 3000 cps produced mainly by the mouth cavity. The range of frequencies that a chamber filters is independent of the particular mixture of frequencies that enters it, so we can hear an ee as an ee whether it is spoken, whispered, sung high, sung low, burped, or twanged.

  The tongue is the most important of the speech organs, making language truly the “gift of tongues.” Actually, the tongue is three organs in one: the hump or body, the tip, and the root (the muscles that anchor it to the jaw). Pronounce the vowels in bet and butt repeatedly, e-uh, e-uh, e-uh. You should feel the body of your tongue moving forwards and backwards (if you put a finger between your teeth, you can feel it with the finger). When your tongue is in the front of your mouth, it lengthens the air chamber behind it in your throat and shortens the one in front of it in your mouth, altering one of the resonances: for the bet vowel, the mouth amplifies sounds near 600 and 1800 cps; for the butt vowel, it amplifies sounds near 600 and 1200. Now pronounce the vowels in beet and bat alternately. The body of your tongue will jump up and down, at right angles to the bet-butt motion; you can even feel your jaw move to help it. This, too, alters the shapes of the throat and mouth chambers, and hence their resonances. The brain interprets the different patterns of amplification and filtering as different vowels.

  The link between the postures of the tongue and the vowels it sculpts gives rise to a quaint curiosity of English and many other languages called phonetic symbolism. When the tongue is high and at the front of the mouth, it makes a small resonant cavity there that amplifies some higher frequencies, and the resulting vowels like ee and i (as in bit) remind people of little things. When the tongue is low and to the back, it makes a large resonant cavity that amplifies some lower frequencies, and the resulting vowels like a in father and o in core and in cot remind people of large things. Thus mice are teeny and squeak, but elephants are humongous and roar. Audio speakers have small tweeters for the high sounds and large woofers for the low ones. English speakers correctly guess that in Chinese ch’ing means light and ch’ung means heavy. (In controlled studies with large numbers of foreign words, the hit rate is statistically above chance, though just barely.) When I questioned our local computer wizard about what she meant when she said she was going to frob my workstation, she gave me this tutorial on hackerese. When you get a brand-new graphic equalizer for your stereo and aimlessly slide the knobs up and down to hear the effects, that is frobbing. When you move the knobs by medium-sized amounts to get the sound to your general liking, that is twiddling. When you make the final small adjustments to get it perfect, that is tweaking. The ob, id, and eak sounds perfectly follow the large-to-small continuum of phonetic symbolism.

/>   And at the risk of sounding like Andy Rooney on Sixty Minutes, have you ever wondered why we say fiddle-faddle and not faddle-fiddle? Why is it ping-pong and pitter-patter rather than pong-ping and patter-pitter? Why dribs and drabs, rather than vice versa? Why can’t a kitchen be span and spic? Whence riff-raff, mish-mash, flim-flam, chit-chat, tit for tat, knick-knack, zig-zag, sing-song, ding-dong, King Kong, criss-cross, shilly-shally, see-saw, hee-haw, flip-flop, hippity-hop, tick-tock, tic-tac-toe, eeny-meeny-miney-moe, bric-a-brac, clickety-clack, hickory-dickory-dock, kit and kaboodle, and bibbity-bobbity-boo? The answer is that the vowels for which the tongue is high and in the front always come before the vowels for which the tongue is low and in the back. No one knows why they are aligned in this order, but it seems to be a kind of syllogism from two other oddities. The first is that words that connote me-here-now tend to have higher and fronter vowels than verbs that connote distance from “me”: me versus you, here versus there, this versus that. The second is that words that connote me-here-now tend to come before words that connote literal or metaphorical distance from “me” (or a prototypical generic speaker): here and there (not there and here), this and that, now and then, father and son, man and machine, friend or foe, the Harvard-Yale game (among Harvard students), the Yale-Harvard game (among Yalies), Serbo-Croatian (among Serbs), Croat-Serbian (among Croats). The syllogism seems to be: “me” = high front vowel; me first; therefore, high front vowel first. It is as if the mind just cannot bring itself to flip a coin in ordering words; if meaning does not determine the order, sound is brought to bear, and the rationale is based on how the tongue produces the vowels.