Show Summary Details

Page of

PRINTED FROM the OXFORD RESEARCH ENCYCLOPEDIA, LINGUISTICS ( (c) Oxford University Press USA, 2016. All Rights Reserved. Personal use only; commercial use is strictly prohibited. Please see applicable Privacy Policy and Legal Notice (for details see Privacy Policy).

date: 18 February 2018

Direct Perception of Speech

Summary and Keywords

The theory of speech perception as direct derives from a general direct-realist account of perception. A realist stance on perception is that perceiving enables occupants of an ecological niche to know its component layouts, objects, animals, and events. “Direct” perception means that perceivers are in unmediated contact with their niche (mediated neither by internally generated representations of the environment nor by inferences made on the basis of fragmentary input to the perceptual systems). Direct perception is possible because energy arrays that have been causally structured by niche components and that are available to perceivers specify (i.e., stand in 1:1 relation to) components of the niche. Typically, perception is multi-modal; that is, perception of the environment depends on specifying information present in, or even spanning, multiple energy arrays.

Applied to speech perception, the theory begins with the observation that speech perception involves the same perceptual systems that, in a direct-realist theory, enable direct perception of the environment. Most notably, the auditory system supports speech perception, but also the visual system, and sometimes other perceptual systems. Perception of language forms (consonants, vowels, word forms) can be direct if the forms lawfully cause specifying patterning in the energy arrays available to perceivers. In Articulatory Phonology, the primitive language forms (constituting consonants and vowels) are linguistically significant gestures of the vocal tract, which cause patterning in air and on the face. Descriptions are provided of informational patterning in acoustic and other energy arrays. Evidence is next reviewed that speech perceivers make use of acoustic and cross modal information about the phonetic gestures constituting consonants and vowels to perceive the gestures.

Significant problems arise for the viability of a theory of direct perception of speech. One is the “inverse problem,” the difficulty of recovering vocal tract shapes or actions from acoustic input. Two other problems arise because speakers coarticulate when they speak. That is, they temporally overlap production of serially nearby consonants and vowels so that there are no discrete segments in the acoustic signal corresponding to the discrete consonants and vowels that talkers intend to convey (the “segmentation problem”), and there is massive context-sensitivity in acoustic (and optical and other modalities) patterning (the “invariance problem”). The present article suggests solutions to these problems.

The article also reviews signatures of a direct mode of speech perception, including that perceivers use cross-modal speech information when it is available and exhibit various indications of perception-production linkages, such as rapid imitation and a disposition to converge in dialect with interlocutors.

An underdeveloped domain within the theory concerns the very important role of longer- and shorter-term learning in speech perception. Infants develop language-specific modes of attention to acoustic speech signals (and optical information for speech), and adult listeners attune to novel dialects or foreign accents. Moreover, listeners make use of lexical knowledge and statistical properties of the language in speech perception. Some progress has been made in incorporating infant learning into a theory of direct perception of speech, but much less progress has been made in the other areas.

Keywords: Articulatory Phonology, coarticulation, direct perception, invariance problem, inverse problem, multimodal perception, perceptual realism, phonetic gestures, segmentation problem, speech perception

1. Background for the Theory

The direct-realist theory of speech perception (e.g., Best, 1995; Fowler, 1986, 1996; 2011) was developed within the context of a universal, direct-realist theory of perception first proposed by James Gibson (1966, 1979; see also Chemero, 2009; Reed, 1996; Shaw, Turvey, & Mace, 1982). Gibson’s ecological theory (1979) explains how animals can know their environment, as they must do in order to survive and prosper in it. In proposing that animals can know their ecological niche, Gibson’s theory is a realist account. In its further proposal that animals are in unmediated contact with their environment (that is, in contact that is unmediated by mental representations or by inferences about the environment built upon stimulus input), the theory claims that perception is direct (see, e.g., Chemero, 2009; Kelley, 1986).

Direct perception is possible because energy arrays (in reflected light for vision, air for hearing, etc.) to which animals’ perceptual systems are sensitive are lawfully, causally structured over time by objects and events that compose an animal’s ecological niche. Distinctive properties of the objects and events structure energy arrays distinctively, so that the structure over time in energy arrays constitutes information for their causal source. In the theory, sufficiently active, exploratory animals intercept patterned energy in those arrays over time that specifies (stands in 1:1 relation to) components of the niche. They perceive the niche components, not the energy arrays that they intercept. Most typically, information in these arrays specifies “affordances” of the niche for animals, that is, possibilities for action. Canonically, specification is multimodal in the sense that it spans multiple energy arrays (cf. Stoffregen & Bardy, 2001) or else is redundantly present within multiple arrays.

1.1 Gibson’s Direct Realist Theory Applied to Speech

As applied to speech, the theory is that speech perceivers intercept patterned energy arrays that specify language forms, such as consonants, vowels, and word forms. For specification to be possible, those forms have to be public actions that over time can lawfully, causally, and distinctively structure one or more of the energy arrays to which speech perceivers are sensitive. In the account of speech perception as direct, the primitive language forms constituting or composing the consonants and vowels of a language are linguistically significant (“phonetic”) gestures of the vocal tract. They are characterized in section 1.2.1.

There is no logical reason having narrowly to do with speech or language why speech perceivers should recover articulation from the acoustic signal1 rather than going immediately from the acoustic signal to internal representations of language forms (if any). Indeed, the reason for supposing that they perceive phonetic gestures is that there is no special-purpose perceptual system for language (but see Benson, Richardson, Whalen & Lai, 2006, who provide evidence for distinct brain processing areas for phonetic and nonphonetic perception of the same acoustic signals). Language users perceive speech with their auditory and visual perceptual systems (among others; see In an ecological perceptual theory, all perceptual systems are adapted to extraction of information from patterned energy arrays that were caused by ecological events and that are, thereby, about those events; they perceive the events. Listeners to speech, then, should intercept temporally extended patternings of acoustic energy that have been distinctively caused by phonetic gestures of the speaker; thereby they should detect the gestures that the patternings inform about. Those listeners who are also able to see a speaker talking (Sumby & Pollack, 1954), or feel the speaker’s face (Chomsky, 1986), or, sometimes, feel airflow from the speaker’s vocal tract (Gick & Derrick, 2009), or, if they are talking themselves (Sams, Möttönen, & Sihvonen, 2005) and feel themselves talking, extract gestural information from these sources as well. Ordinarily (outside the laboratory), when more than one of those energy arrays is intercepted during an event of talking, they provide consistent, somewhat redundant information about that one event.

There are many other theories of speech perception. Most contrast with the theory of direct realism in rejecting the idea that phonetic gestures are perceived, proposing instead that acoustic “cues” lead to identification of phonetic segments (or in one case, CV syllables; Massaro, 1998) that are represented in the mind of the perceiver (see, e.g., Diehl, Lotto, & Holt, 2004, and Samuel, 2011, for reviews of these and other accounts). It also contrasts, although to a lesser extent, with the “motor theory” of Liberman and colleagues (e.g., Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Liberman & Whalen, 2000), which claims that recruitment of the speech motor system is required in perception to “decode” the coarticulated speech signal in order to identify phonetic gestures; that is, in the motor theory speech signals do not specify their gestural sources.

1.2 Outline of a Research Program for a Theory of Phonetic Perception as Direct

Gibson (1979) presented three main components to a research program for an “ecological” theory of direct realism: description of an animal’s ecological niche to expose what there is to be perceived, description of information in energy arrays that specifies the perceivables of the niche, and presentation of evidence that animals use that information to perceive the niche. The same outline is observed here, following a preliminary discussion of the limited scope of the theory presented in this article.

The theory as outlined here is only about phonetic perception; that is, about how listeners identify primitive language forms produced by speakers. It is ironic, perhaps, that a theory developed from Gibson’s “ecological” theory of perception should be about components of language that are not the most ecologically impactful aspects of language. Broader ecological approaches to language are under development (see, e.g., Golonka, 2015; Raczaszek-Leonardi, 2012; Read & Szokolszky, 2016). However, speech is a valuable place on which to focus attention, because it provides a means by which the language forms that compose utterances can be directly perceived. This puts perceivers in real, unmediated contact with an event of speaking, realizing a central claim of a direct-realist theory.

1.2.1 Speech Perceivables

For a theory of direct perception of speech to be viable, language forms have to be entities or actions in humans’ ecological niche. As such, they lawfully cause patternings distinctive to themselves in one or more of the energy arrays that humans’ perceptual systems can intercept. That distinct language forms give rise to distinct patternings enables the patternings to provide information about the language forms, indeed to specify them. If, as in many accounts of language (see Diehl et al., 2004; Samuel, 2011 for reviews), language forms are, instead, entities (perhaps phonemes, features) that exist primarily in the mind, they cannot be directly perceived, because they are not capable of giving rise to specifying structure in perceptually detectable energy arrays.

Because the theory of direct perception of speech requires public perceivables, it has been developed in the context of compatible theories of linguistic phonology and speech production that identify kinds of language forms that can be perceived directly. Articulatory Phonology, first proposed by Browman and Goldstein (e.g., 1986, 1992), and significantly developed by Goldstein and colleagues (e.g., Pouplier & Goldstein, 2010) and by Gafos and colleagues (e.g., Gafos, 2002; Gafos and Benus, 2006), provides a plausible characterization of language forms. Saltzman’s (e.g., Saltzman & Munhall, 1989) “task dynamics” theory of speech production (among other actions) provides a compatible account of the implementation of Articulatory Phonology’s language forms as coordinated vocal tract actions that exhibit the “equifinality2” characteristic of speech and other actions. (Additional information on this approach is available at.)

In Articulatory Phonology, the primitive language forms are “phonetic gestures.” Phonetic gestures are coordinated actions of the vocal tract that make and release linguistically relevant constrictions in the vocal tract. Constrictions are distinguished by their location in the vocal tract and the degree to which the tract is open or closed. For example, production of [b], [p], and [m] involves making and releasing a constriction gesture at the lips (constriction location) that achieves complete closure (constriction degree) there; [w] involves closure at the same location, but to a more open constriction degree. English [d], [t], and [n] achieve a complete constriction closure with the tongue tip at the alveolar ridge of the palate. These phonetic gestures instantiate linguistic properties and serve functions identified with phonetic features or segments in other linguistic theories (e.g., N. Chomsky & Halle, 1968; Prince & Smolensky, 2004), but they differ from them in being public actions. The actions, implemented by transiently established coordinative relations among vocal tract articulators (“coordinative structures”), are identified by Saltzman and colleagues (e.g., Saltzman & Kelso, 1987) as dynamical systems. In a task dynamics model of speech articulation (e.g., Saltzman and Munhall, 1989), phonetic gestures exhibit the flexibility and equifinality of the actions of humans and other animals. For example, in the context of temporally overlapping production of low vowels, which pull the jaw down, or high vowels, which raise it, lip closure is achieved for [b] via jaw movements accompanied by actions of the upper and lower lips that compensate for the different pulls on the jaw by the vowels.

If language forms are public actions, then they can lawfully and distinctively cause structure in energy arrays such as those in air and reflected light that are available to perceivers; this is the topic of the next section. However, there is an important complication that has driven much of the research directed at listeners’ identification of the component consonants and vowels of a spoken utterance.

The complication is that speakers coarticulate. Coarticulation is temporal overlap among serially nearby phonetic gestures.3 It is not unique to speech, but it is especially salient in activities (such as speaking or typing) in which actions (phonetic gestures for speech, finger movements for typing) that are discrete components of a larger activity (e.g., producing an utterance, typing a sentence) are serially ordered. Temporal overlap of action components is an efficient, even necessary, way of producing such actions, but, in speech, it has led theorists to identify two “problems,” ostensibly for perceivers, but, if not for them, certainly for theorists of speech perception. These are the “segmentation” and “invariance” problems. The segmentation problem is that there are no discrete phonetic-segment- or –gesture-sized intervals in the speech signal that correspond to the component consonants and vowels of word forms. The invariance problem refers to a second consequence of overlap. At any point in time, the acoustic signal carries information about multiple phonetic gestures; accordingly, although invariant acoustic signatures of phonetic primitives have been sought (e.g., Stevens & Blumstein, 1981), they have not yet been found.

These characterizations of the acoustic speech signal have led some theorists to question whether speech is, in fact, composed of discrete language forms (e.g., Port, 2007). However, there is good evidence that speech is alphabetic (summarized, e.g., by Fowler, Shankweiler, & Studdert-Kennedy, 2016).4 Accordingly, an important problem to be solved, especially for a theory of direct perception of speech, is how the acoustic information available to speech perceivers can possibly specify discrete language forms.

1.2.2 Information for Phonetic Perception Acoustic Information for Phonetic Perception

Effective information for language forms should, in a direct-realist theory, be patterning over time in perceptually accessible energy arrays that has been lawfully and distinctively caused by the language forms (in the present account, by phonetic gestures). Such patternings can specify language forms to perceivers. In the next section, evidence for information in energy patterns other than in air will be considered. Here, the focus is on the acoustic signal, which must be the primary source of information for language forms.

Articulatory actions for speech cause distinctive patterning in the acoustic signal. Following is a brief and highly simplified description of some of the patterning. (Additional information and some pictures of vowel and consonant acoustic patterns can be found online, for example on the “Speech Resource Page” on Macquarie University Linguistic Department’s website.) Generally, speakers produce speech on an expiratory airflow. Air from the lungs passes through the glottis, where the vocal cords can be open for unvoiced sounds or cycling open and closed for voiced sounds. For voiced sounds, periodically closing and opening the vocal folds creates successive puffs of air having energy at the fundamental frequency (the rate of vocal fold cycling) and, with progressively lower amplitudes, at integral multiples of the fundamental. Vowels are produced with fairly open constrictions made with the tongue body at different locations in the oral cavity. The resulting shape of the oral cavity serves as a filter on the frequencies in the airflow from the glottal region. The effect of the filtering is to maintain the energy levels in some frequency regions (at the resonant frequencies of the tract) but to attenuate energy in others. Resonances of the vocal tract are called “formants,” and the locations particularly of the lower three formants serve as distinctive information for vowels and vowel-like consonants (including [ɹ],5 [l], [w], [y] in English). However, consonants in which tighter constrictions are made in the oral cavity structure the air differently. Fricatives are produced with constriction gestures (for example, with the tongue tip at the alveolar ridge of the palate for English [s] and [z]) that leave a narrow passage for the expiratory airflow; the passage is narrow enough to create noise-like turbulent energy. Accordingly, frication noise constitutes an important kind of information for such consonants as [s], [z], [f], [v]. Constrictions for stop consonants, as their name suggests, are complete, and so completely stop the airflow briefly. Acoustic information for oral stop consonants ([b], [p], [d], [t], [g], [k], in English) includes near silence during the closure phase while the airflow is stopped, then a burst of energy upon release of the constriction (called a “burst”), and transitions of formants into the following open oral cavity toward the next vowel. In production of nasal stops (such as [m]), while the airflow is stopped in the oral cavity, there is airflow through the nose so that the closure interval has acoustic energy at the resonant frequency of the nasal cavity.

The foregoing characterization is a simplification in many respects. Perhaps most importantly in this context, it presents consonants and vowels as if they are produced in isolation. However, as noted, speakers coarticulate. For example, at the same time that a speaker closes his/her lips to produce [b] in an utterance such as [iba], the constriction locations and degrees of the tongue body smoothly change from those for the preceding vowel into those for the following vowel. Throughout much of the acoustic signal for [iba] it is shaped by two or even all three phonetic gestures (lack of invariance), and the signal cannot be segmented into three discrete zones each of which contains only information for one of the segments (segmentation problem).

No specifying acoustic information for phonetic gestures has been found, and there are at least two kinds of barriers to the possibility of finding any. One is coarticulation, already discussed, which gives rise to massive context sensitivity in the acoustic domain most closely associated with a given consonant or vowel. The second is the “inverse problem”—the difficulty, or even in some accounts, the impossibility, of recovering articulation from the acoustic signals they cause.

If the inverse problem is conceptualized as recovery of a static vocal-tract shape from instantaneous values of the lower three formants, inversion is indeterminate (e.g., Mermelstein, 1967; Schroeder, 1967; Sondhi, 1979). That is, multiple vocal tract shapes can be compatible with a given set of instantaneous values of the lower three formants. From a direct-perception perspective, however, this approach to inversion is analogous to, and doomed for the same reason as, a theory that a two-dimensional retinal image is the starting point for visual perception. Perceivers live in a world of events, and there are richly patterned sources of information in reflected light, air, and other media that provide specifying information over time. Theorists of visual perception are hamstrung if they imagine that visual perceivers are constrained to extracting information from successive freeze-frames on the retina. By the same token, speaking is action, not shape generation. Attempts at inversion are hamstrung if they restrict themselves to extraction of successive shapes from successive instantaneous measurements of the acoustic signal.

For proponents of theories of speech perception in which recovery of articulatory action plays no part, the findings of Mermelstein (1967) and Schroeder (1967) among others has been judged definitive evidence against any theories in which listeners are proposed to perceive articulation (e.g., Diehl et al., 2004). However, this judgment is premature.6 Other approaches to the inverse problem have had more success than those already characterized. (For a review, see Iskarous, 2010.) For example, Yehia and Itakura (1996) augmented formant information with an articulatory cost function, essentially that the recovered vocal-tract shape must be one that minimizes distance from a neutral shape of the vocal tract. Using that constraint led to successful recovery of vocal tract shapes associated with a set of 12 French vowels. In a different approach, Iskarous (2010) chose only to require successful recovery of linguistically critical components of vowels: constriction locations and degrees and lip aperture, another phonetic property in the theory of Articulatory Phonology. Recovery of these phonetic properties of vowel shapes from formant frequencies and amplitudes was successful for 10 vowels of American English. Both Yehia and Itakura (1996) and Iskarous (2010) were sensitive to the fact that speech does not consist of successions of static vowels and expressed optimism that their respective approaches could be expanded to address recovery of dynamic articulatory movement. Some progress in recovery of sequences is reported by McGowan (1994) and by Hogden, Rubin, McDermott, Katagiri, and Goldstein (2007).

Unsurprisingly, none of these approaches deals with conversational speech in all its variability and complexity, but the approaches do constitute progress in addressing the inverse problem beyond earlier efforts to extract detailed static shapes from limited acoustic information. However, something important is missing from all of the approaches from a direct-realist stance.

A strategy from a direct-realist approach would share with that of Iskarous (2010) special attention to linguistically relevant components of vocal tract articulation. It would share with Yehia and Itakura (1996) attention to constraints associated with vocal tract action, and it would share with some other approaches attention to vocal tract action rather than vocal tract shape. However, attention needs to be focused as well on how coarticulated phonetic gestures structure acoustic signals, and that is missing from all approaches taken so far.

As noted, phonetic gestures are produced in overlapping time frames. Moreover, as physical actions, gestures have a necessary time course in which they gradually come into being and gradually terminate. As a gesture begins to be produced, it has small, perhaps undetectable, acoustic consequences that may be buried in consequences for a preceding ongoing gesture. Over time, however, the gesture’s impact on the signal grows as that of its predecessors weaken, reaches its maximum around the time that it achieves its constriction location and degree, then decreases over time. Such a wave of waxing, then waning acoustic effects of a coherent constriction gesture may allow it to be tracked even though it is superimposed on temporally offset waxing-waning waves caused by preceding and following gestures. A strategy for recovering gestures should involve tracking those acoustic waves, because they are the distinctive patterns that the gestures cause. There is some evidence, described in section 1.2.3, that listeners attend to the acoustic speech signal in just this way. Multimodal Information for Phonetic Perception

As animals move through their environment, their movements are guided by information available to many of their perceptual systems. The global path they take may be guided by visual information (and, for example, by auditory information as well for individuals seeking the source of a sound they hear or by olfactory information for individuals tracking a smell7); in addition, their stepping actions are guided by vestibular information for maintaining dynamic balance, and by information provided to a variety of receptors throughout the body, for example, receptors in skin, muscle, and joints (see, e.g., Turvey & Carello, 2011), that is about self-movement, about the solidity and smoothness or bumpiness of the terrain on which they are stepping, and about the forces being exerted on the body by the support surface. In short, perception is canonically multimodal in involving information available to these multiple systems; however, it can also be considered “amodal,” because locomoting animals experience just the one event of moving in their environment. They do not live separately in visual, auditory, olfactory, vestibular, and haptic worlds, for example.

Of course, speech perception can be based on acoustic information only. Telephones are effective devices for conversational exchanges. However, in face-to-face situations, multimodal information is available. Yehia, Rubin, & Vatikiotis-Bateson (1998) measured facial movements (from markers placed on the face), articulatory movements (with magnetometer transducers placed on the tongue, jaw, and lips), and acoustic speech signals (measures derived from linear prediction coefficients and amplitudes) during utterance production. They found that substantial variance in all three domains was predicted by variability in the others. For example, 80% of the variance in vocal tract motions was predicted from measures of facial motion, and about 65% of variability in vocal tract actions and facial movements was predictable from the speech acoustics.8 Chandraskaran, Trubanova, Stillittano, Caplier, and Ghazanfar (2009) also found correlations between facial measures (e.g., area of mouth opening) and acoustic speech measures (e.g., overall acoustic envelope, vocal-tract resonances).

Information for facial movements is available to individuals who can see the face of the speaker and to speakers (perhaps users of the “Tadoma” method of speechreading; e.g., Chomsky, 1986) who feel the face of a speaker with a hand, or who are themselves the speaker and have tactile and other proprioceptive information for what they are doing. Research evidence (briefly reviewed in section suggests that speech perceivers use all of these kinds of information. Perception may be amodal in that perceivers may not know which energy array—acoustic, reflected light, haptic, or even sometimes proprioceptive—is a source of critical perceived phonetic properties of a speech event. As Stoffregen and Bardy (2001) propose for other actions, some specifying information may span energy arrays.

1.2.3 Evidence That Listeners Use Acoustic and Other-Modal Information to Perceive Phonetic Gestures

Having identified the primitive speech perceivables as phonetic gestures of the vocal tract and having characterized the acoustic and other modal information for gestures as best we can to date, this section reviews evidence that listeners, in fact, make use of that information in speech perception and, therefore, perceive phonetic gestures. Four lines of evidence are summarized in– None by itself “proves” that listeners perceive gestures. Taken together, the research lines converge on that conclusion. The First Research Line: “When Articulation and Sound Wave Go Their Separate Ways”

This sentence fragment comes from Liberman (1957, p. 121) who summarized findings that led him to develop his “motor theory” of speech perception, which shares with the direct-perception account a claim that listeners perceive speech articulations. He goes on to note that, under those circumstances, perception tracks articulation. Literally, articulation and the sound wave cannot “go their separate ways,” because the one causes the other. However, due to effects of coarticulation, they can appear to do so under some conditions, and, as noted by Liberman (1957), in those cases, perception follows articulation: (a) when quite different consonantal acoustic patterns are caused by the same consonant gesture coarticulating with different vowels, listeners report hearing the same consonant; and (b) when different consonantal gestures coarticulating with different vowels give rise to the same consonantal acoustic patterns, listeners report hearing different consonants. An example of the first condition occurs with highly simplified synthetic speech syllables [di] and [du]. In both of these syllables, the critical information for the identity of the initial consonant is the transition of the second formant into the vowel. Remarkably, those transitions are acoustically very different. In [di], it is a high rise in frequency; in [du], it is a low fall. Excised from their contexts, they sound quite different, one a high pitched, the other a low-pitched, “chirp.” In context, they both sound like [d] (Liberman, Delattre, Cooper, & Gerstman, 1954). In context, both transitions signal release of a [d] gesture into the vocal-tract shape for the following vowel. In an example of the second condition, acoustically the same stop burst sounds like [p] before vowels [i] and [u], but like [k] before [a] (Liberman, Delattre, & Cooper, 1952). In this case, the burst, in [i] and [u] contexts, can only have been produced by release of a constriction at the lips; in the [a] context, it can only have been produced by release of a constriction by the tongue body against the soft palate. (For additional evidence of the same kind, see Fitch, Halwes, Erickson, & Liberman, 1980.) The Second Research Line: Perceptual Use of Gestural Information Across Modalities

In a theory of direct perception, speech perceivers extract information for speech gestures across any detectable energy arrays that have been causally structured by the gestures. For example, speech in noise is more intelligible to listeners who can see the face of a speaker (e.g., Sumby and Pollack, 1954) than to listeners who cannot. Moreover, infants as young as four months of age detect the correspondence of visible speech gestures and their acoustic consequences (Kuhl & Meltzoff, 1982). Although native-language learning affects how six-month-olds extract information from visible speech gestures (Altvater-Mackensen, Mani, & Grossmann, 2016), Kuhl and Meltzoff’s findings (1982) suggest a native or near-native ability for cross modal or amodal perception of speech.

Perceivers also extract gestural information from the face haptically. Deaf-blind individuals sometimes learn to perceive some speech by means of the “Tadoma” method (see, e.g., Chomsky, 1986), in which they place a hand on the face of a speaker in such a way that they can pick up facial gestures and detect vocal cord vibration to distinguish voiced from unvoiced consonants.

The McGurk effect has been a valuable tool for exploring cross modal speech perception (McGurk & MacDonald, 1976; for a demonstration, see). The effect occurs when a video of a speech event is appropriately dubbed. For example, dubbing a face mouthing [da] with acoustic [ma] leads most listener-observers to report hearing [na], a report that reflects perception of constriction location and degree from the visible speech gestures but perception of other gestures (e.g., the nasality of the consonant and the identity of the vowel) from the acoustic specification. It is striking that cross modal perception occurs under other conditions as well. Haptically feeling the face of a speaker mouthing [ga] or [ba] synchronously with syllables ambiguous between [ga] and [ba] shifts judgments in the direction of the felt syllable (Fowler & Dekle, 1991). Remarkably, proprioceptive information for gestures is effective as well. A listener him- or herself silently mouthing [ka] while [pa] is presented acoustically provides more [ka] judgments than in a condition in which [pa] is mouthed or than in an auditory-only condition (Sams, Möttönen, & Sihvonen, 2005). Finally, perceivers who feel a puff of air on the neck presented synchronously with acoustic [ba] increase judgments that the syllable is (aspirated, breathy) [pa] over a condition in which no air puff is presented with the syllable (Gick & Derrick, 2009). In all cases, across all of these perceptual modalities, the common thread is information in energy arrays that has been caused (or plausibly has been caused) by phonetic gestures of the vocal tract. Multimodal perception means perception of one real-world event of talking. The Third Research Line: Tracking Coarticulated Gestures and Possible Solutions to the Segmentation and Invariance Problems

Listeners behave as if they are tracking overlapping gestures when they perceive speech. To see how this can be determined, see the highly schematized representation of the acoustic consequences of two overlapping phonetic gestures in Figure 1. The figure shows schematically how the acoustic consequences of two phonetic gestures (the stippled region in the figure for the first gesture, the grey region for the second gesture) develop as the gesture is produced over time. As described earlier, a phonetic gesture has a gradually increasing impact on the acoustic signal. Its impact reaches a maximum and then decreases. Because speakers coarticulate, the acoustic consequences of multiple gestures (two in the figure) overlap so that, during most intervals (the region that is both stippled and grey), information for more than one gesture is present in the acoustic signal at the same time.

Direct Perception of SpeechClick to view larger

Figure 1. A schematic display of two “acoustic waves” generated by two temporally overlapping phonetic gestures.

In the display, a vertical line divides the acoustic interval into two parts. The division is placed where the first gesture ceases to have the larger acoustic impact and the next gesture takes over. In conventional measurements of speech from visual displays, a criterion like this is used to divide the display roughly into phonetic segments. For example, an interval with formant structure might be identified with a vowel, a following silent interval followed by a stop burst might be identified with a stop consonant, and so on. This is a convenient way to segment the signal for many purposes, but it is not how listeners identify discrete phonetic components of an utterance.

In a direct-realist account, listeners do not seek temporally discrete acoustic segments and so do not encounter the segmentation problem (the problem that there aren’t any discrete acoustic segments corresponding to phonetic segments; see 1.2.1). Instead, they track the lawfully caused acoustic consequences of distinct gestures over the detectable time courses of the gestures. Hypothetically, this strategy also involves detection of invariant acoustic patterning that is lawfully and distinctively caused by distinct, but temporally overlapping, gestures, a proposed solution to the invariance problem raised in 1.2.1. Invariants of this nature have not been sought or found, however.

There are several signatures of listeners’ strategy of tracking acoustic consequences of gestures. One is that, in context, despite the existence of context-colored regions (that are both grey and stippled) to the left and right of the vertical line in Figure 1, neither the earlier nor the later of the two phonetic segments sounds context-colored to listeners. The whole stippled region will count as information for the first gesture, and the whole grey region will count as information for the second. Some findings confirm this prediction of “parsing” (e.g., Fowler, 1981, 1984; Fowler & Smith, 1986); other findings show that parsing occurs, but is incomplete, so that, for example, a vowel in which a nasal gesture for a following consonant begins to affect the acoustic signal does sound slightly more nasalized than one with no overlapping nasal gesture (e.g., Fowler and Brown, 2000; Krakow & Beddor, 1999). However, it sounds less nasalized in context (where the nasalization can be ascribed to the consonant) that it does isolated from its context.

The first “signature” of parsing, just described, is about what listeners should not hear (context-coloring). A second signature of the proposed mode of extracting acoustic information from the signal is about what they do hear. The prediction is that “grey” information to the left of the vertical line in Figure 1 will serve as information for the second gesture even though it occurs during a time conventionally identified with the first. An interesting recent demonstration of this involves eye tracking, although other methods have also confirmed the prediction (e.g., Martin & Bunnell, 1981; Whalen, 1984). Beddor, McGowan, Boland, Coetzee, & Brasher (2013) presented listeners with sentences having final words such as bend with a nasal consonant ([n]) after the vowel) or bed with no nasal. In bend, coarticulatory nasality begins during the preceding vowel. Participants’ task was to choose which of two pictures presented either side of a fixation mark represented the sentence-final word. Important data were provided by participants’ eye gazes as they listened. Evidence suggested that looks to the picture of bend were programmed during the vowel before the closure interval for the nasal consonant began. That is, listeners used evidence for the nasal gesture during the vowel (so, in the grey area to the left of the vertical line in Figure 1) as information for a nasal consonant.

A final indication of listeners tracking of acoustic information for overlapping phonetic gestures is “compensation for coarticulation.” This finding, redundantly with the first two, is a way of showing that information in the domain of one gesture that has been caused by another (e.g., the grey region to the left the vertical line in Figure 1 and the stippled region to the right) is ascribed to the coarticulating gesture. In a much-studied example (pioneered by Mann, 1980), listeners heard consonant-vowel syllables along a [da] to [ga] continuum. Three continuum members are shown superimposed on the right side of Figure 2. Continuum members differed in the onset frequency of the third formant (F3) and so also in the shape of the F3 transitions. The endpoint [da] transition (dashed line in the figure) fell in frequency to F3 for [a], the endpoint [ga] transition (solid line) rose in frequency, and F3s of intermediate syllables gradually shifted from that of [da] to that of [ga] in equal increments across the continuum. The figure shows just an illustrative mid-continuum transition (dotted line). Continuum members were preceded by syllables [aɹ] and [al]. When continuum members were preceded by [aɹ], listeners gave more D judgments compared to when continuum members were preceded by [al]. This might occur because of the different coarticulatory effects that the back constriction of [ɹ] and the front constriction of [l] should have when they overlap with constriction gestures for [d] and [g]. Think of the acoustic consequences of [ɹ] and [l] as the stippled region of Figure 1, with [d] or [g] as the grey region. The coarticulatory effects of [ɹ] and [l] on [d] and [g] then are the stippled region to the right of the vertical line. In articulation, the tongue body constriction for [ɹ] might pull back the point along the palate at which the tongue tip makes a [d] constriction if the gestures overlap temporally. The tongue tip constriction of [l] overlapping with the tongue body gesture for [g] might pull the point of constriction along the palate forward for [g]. Those coarticulatory effects have an impact on the F3 transition for the [d] and [g]. Listeners, then, behave as if they are “compensating” for those effects, hearing as [d] a too-far-back (as signaled by the F3 transition) continuum member if the preceding syllable is [aɹ], but as [g] (now too-front) if it is [al].

Direct Perception of SpeechClick to view larger

Figure 2. Right side: Schematic spectrographic display of stimuli similar to those used in research by Mann (1980) illustrating the endpoint [da] F3 transition (dashed line), endpoint [ga] F3 transition (solid line) and a continuum midpoint F3 transition (dotted line). Left side: schematic critical F3s of [aɹ] and [al] precursor syllables. These F3s are provided to illustrate the spectral contrast account of compensation for coarticulation. See the text for explanation.

The finding just described reveals compensation for “carryover” coarticulation from (earlier) [ɹ] or [l] onto (later) [d] or [g]. Compensation is also found for anticipatory coarticulation (Mann & Repp, 1980) and for nondirectional coarticulation, as when an intonational peak in fundamental frequency (F0, roughly heard as voice pitch) sounds lower in pitch when its accompanying (coarticulating) vowel is [i] (associated itself with a high “intrinsic” F0) than when its vowel is [a] (with a low intrinsic F0; Silverman, 1987).

The foregoing interpretation of compensation for coarticulation has been disputed by investigators (e.g., Lotto & Kluender, 1998; Lotto & Holt, 2006; see also Kingston, Kawahara, Chambless, Key, Mash, & Watsky, 2014) who point out that many of the findings are consistent with the occurrence of spectral contrast effects. Figure 2 shows how findings of Mann (1980) might be explained as spectral contrast. The left side of the figure shows schematically critical F3 contours for [ɹ] (solid line) and [l] (dashed line)). In a contrast account, the low ending F3 of [ɹ] makes a following F3 sound higher and so more like [d]; the high ending F3 of [l] makes a following F3 sound lower and so more like [g].

The contrast account and the account that compensation reflects gesture tracking make the same predictions in most contexts and so have been difficult to distinguish experimentally. However, in cases in which the two accounts make different predictions, the compensation account has been supported (Johnson, 2011; Viswanathan, Magnuson, & Fowler, 2010). In addition, compensation occurs, as noted, when there is no directionality and so no context to induce contrast (the F0 findings described above). Moreover, audiovisual compensation occurs (e.g., Mitterer, 2006; Viswanathan & Stephens, 2016), a result not subject to a contrast account. Overall, experimental results support that listeners track gestural overlap with compensation for coarticulation as a result. The Fourth Research Line: Perception-Production Links

In contrast to the motor theory of speech perception (e.g., Liberman & Whalen, 2000), the theory of direct perception does not propose that listeners’ own motor systems necessarily are recruited in speech perception in order to decode coarticulatory effects on acoustics. However, it does propose that listeners perceive the phonetic gestures of speakers. This implies a closer link between perception and production than in acoustic theories in which articulation is not accessed in perception (e.g., accounts summarized in Diehl et al., 2004).

The links to production in a direct-realist theory are, indeed, stronger than that. As noted earlier, in Gibson’s theory (e.g., 1979), perception canonically is of affordances, that is of possibilities for action; moreover, Gibson (e.g., 1966) conceived of information pick up in perception as a kind of “resonance.” In that context, findings of embodiment in perception, that is, of unintended dispositions to act in a way that is consistent with what is perceived (e.g., for speech: Fadiga, Craighero, Buccino, & Rizzolatti, 2002; for walking: Takahashi, Kamibayashi, Nakajima, Akai, & Nakazawa, 2008) are perhaps unsurprising, if not predicted, by the theory.9

Two findings in the speech perception literature are especially suggestive of this kind of perception-production link. One concerns the latency with which a listener can produce a spoken response to a spoken prod. In a simple response task, participants produce a fixed response to a prod. Latencies are quite short, because there is no element of choice in response production. In one relevant speech example (Fowler, Brown, Sabadini, & Weihing, 2003), prods were the spoken syllables [pa], [ta], and [ka] for all participants. For different groups of participants, the spoken responses were [pa], [ta], or [ka]. On hearing any of [pa], [ta], or [ka], participants produced their response syllable as quickly as they could. For each group of participants, then, on one-third of trials, their spoken response was the same as the prod. Findings were that response latencies were faster on those trials than on trials on which the prod and the spoken response mismatched. A direct-perception account is that, on those trials, the prod, perceived as overlapped phonetic gestures, provides instructions for the required response. When that is the case, latencies are especially short.

A second kind of evidence for a tight coupling of production and perception is the finding of phonetic convergence. This is a tendency for listener-talkers to converge phonetically with speech of others whom they hear (or see). Convergence does not occur under all conditions (e.g., Babel, 2010; Bourhis & Giles, 1977; Pardo, 2006), but it often occurs (e.g., Babel, 2010; Goldinger, 1998). Although a tendency to converge might be interpreted as a socially affiliative behavior, it occurs even under nonsocial conditions (Goldinger, 1998). Accordingly, it is a kind of default or dispositional behavior. It may be dispositional because perception of a speaker’s gestures, as in the simple reaction time findings just described, serve as a kind of prod. Remarkably, phonetic convergence occurs even to lip-read (silent) speech (Miller, Sanchez, & Rosenblum, 2010), and greater convergence occurs to audiovisual than to audio-only speech, especially if the mouth region of the speaker is visible (Dias & Rosenblum, 2016).

Perception-production links of another kind are found as well. Research shows that changes in habitual ways of producing speech lead to related changes in perception. For example, Nasir and Ostry (2009) perturbed jaw motion during production of [æ] vowels (as in had). Those participants who compensated (in jaw motion control) for the perturbations also shifted their identifications of vowels along a head to had continuum. Moreover, across participants, the magnitude of the perceptual shift was correlated with the magnitude of the compensatory change in production.

These findings underscore the close coupling of perception and action in a direct-realist account. Of course, they only scratch the surface of what is expected in a theory in which perceivers canonically perceive affordances for action.

1.3 Learning and Direct Perception: An Underdeveloped Domain

With experience, animals become more skillful perceivers. Gibson (e.g., 1966) referred to this as “education of attention.” That is, animals may become sensitive to information at a finer grain over experience, or may learn to focus in on or “attune” to the most relevant information for their purposes.

Catherine Best, a proponent of direct-realism, and her colleagues (e.g., 1994; 2015; Best, Tyler, Gooding, Orlando, & Quann, 2009) have developed this idea in studying how infants and young children attune to the phonology of their native language. One result of this education of attention can be loss of sensitivity to phonetic differences that the native language does not use linguistically (e.g., Werker & Tees, 1984); another is an increase in sensitivity to information that is relevant in the native language (Kuhl, Stevens, Hayashi, Deguchi, Kiritani, & Iverson, 2006).

There are many more effects of learning on speech perception than this. There are short-term attunements to unfamiliar accents (e.g., Kraljic, Brennan, & Samuel, 2008; Xie, Theodore, & Myers, 2017). There are also effects of lexical knowledge (including effects of lexical frequency and “density”10) on phonetic perception (e.g., Marslen-Wilson, 1990; Vitevich & Luce, 1998). In the research on convergence in section, Dias and Rosenblum (2016) found more convergence to utterances of lower- than higher-frequency words (see also Goldinger, 1998) and more to utterances of words from lower- than higher-density neighborhoods).

Although all of these learning effects may reflect “education of attention,” more needs to be explained than this, and no further explanation has been forthcoming from the theory of direct realism. This probably reflects, in part, the theory’s rejection of internal “representations” of things (such as words known to a language learner). Yet something needs to be said about how language users carry their learning history around with them. An ambitious account of adaptation to accent has been provided from a different theoretical perspective by Kleinschmidt and Jaeger (2015). Dias and Rosenblum (2016) invoke “exemplar” memory and related accounts, but neither approach is particularly satisfactory to direct realists. An approach to learning and memory within the direct-realist perspective that can lead to distinctive and testable predictions is an important gap in the theory.

1.4 Concluding Remarks

The direct-realist theory of speech perception is unique in having been developed within the context of the much more encompassing ecological theory of perception of James Gibson (1966; 1979). It shares with the motor theory of Liberman (e.g., Liberman et al., 1967; Liberman & Mattingly, 1985; Liberman & Whalen, 2000) the claim that listeners to speech perceive phonetic gestures, but disagrees about why. For Liberman and colleagues, perception of gestures reflects involvement of the motor system in speech perception, which is required to disentangle effects of coarticulation on the acoustic speech signal. In recruitment of the motor system, speech perception is “special,” unlike general auditory perception. In contrast, in a theory of speech perception as direct, gestures are perceived because perceptual systems generally extract information from “proximal stimuli” at the sense organs (e.g., reflected light, patterned air) that is about their causal sources in environmental events. In this account, speech perception is just like other kinds of perception. In most theories other than the direct-perception and motor theories, listeners perceive proximal acoustic patterns and map them to internal phonological or phonetic representations of consonants and vowels.

The direct-realist theory of speech perception can explain much of the behavioral findings on speech perception. Some findings, as summarized here, converge to support the theory. However, it faces two main challenges. One is to solve the inverse and invariance problems; that is, to show that acoustic speech signals do specify their gestural sources. A second is to develop an account of short-term and longer-term learning on speech perception. Neither problem should be insurmountable; however, neither as yet has been surmounted.

Further Reading

Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 171–204). Timonium, MD: York Press.Find this resource:

Best, C. T. (2015). Devil or angel in the details? Perceiving phonetic variation as information about phonological structure. In J. Romero & M. Riera (Eds.), The phonetics-phonology interface: Representations and methodologies (pp. 3–31). Amsterdam: John Benjamins.Find this resource:

Chemero, A. (2009) Radical embodied cognition. Cambridge, MA: MIT Press.Find this resource:

Fowler, C. (1986). An event approach to a theory of speech perception from a direct-realist perspective. Journal of Phonetics, 14(1), 3–28.Find this resource:

Fowler, C. A., Shankweiler, D., & Studdert-Kennedy, M. (2016). Perception of the speech code revisited: Speech is alphabetic after all. Psychological Review, 13(2), 125–150.Find this resource:

Gibson, J. J. (1966). The senses considered as perceptual systems. Boston: Houghton-Mifflin.Find this resource:

Gibson, J. J. (1979). The ecological approach to visual perception. Boston: Houghton-Mifflin.Find this resource:

Iskarous, K. (2010). Vowel constrictions are recoverable from formants. Journal of Phonetics, 38(3), 375–387.Find this resource:

Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74(6), 431–461.Find this resource:

Liberman, A. M., & Whalen, D. H. (2000). On the relation of speech to language. Trends in Cognitive Sciences, 4(5), 187–196.Find this resource:

Reed, E. S. (1996). Encountering the world: Toward an ecological psychology. Oxford: Oxford University Press.Find this resource:

Rosenblum, L. D. (2008). Speech perception as a multimodal phenomenon. Current Directions in Psychological Science, 17(6), 405–409.Find this resource:

Samuel, A. (2011). Speech perception. Annual Review of Psychology, 62, 49–72.Find this resource:

Viswanathan, N., Magnuson, J. S., & Fowler, C. A. (2010). Compensation for coarticulation: Disentangling auditory and gestural theories of perception of coarticulatory effects in speech. Journal of Experimental Psychology: Human Perception and Performance, 36(4), 1005–1015.Find this resource:


Altvater-Mackensen, N., Mani, N., & Grossmann, T. (2016). Audiovisual speech perception in infancy: The influence of vowel identity and infants’ productive abilities on sensitivity to (mis)matches between auditory and visual speech cues. Developmental Psychology, 52(2), 191–204.Find this resource:

Babel, M. (2010). Dialect divergence and convergence in New Zealand English. Language in Society, 39(4), 437–456.Find this resource:

Beddor, P. S., McGowan, K. B., Boland, J. E., Coetzee, A. W., & Brasher, A. (2013). The time course of perception of coarticulation. The Journal of the Acoustical Society of America, 133(4), 2350–2366.Find this resource:

Benson, R. R., Richardson, M., Whalen, D. H., & Lai, S. (2006). Phonetic processing areas revealed by sinewave speech and acoustically similar nonspeech. Neuroimage, 31(1), 342–353.Find this resource:

Best, C. T. (1994). The emergence of native-language phonological influences in infants: A perceptual assimilation model. In J. Goodman & H. C. Nusbaum (Eds.), The development of speech perception: The transition from speech sounds to spoken words (pp. 167–224). Cambridge, MA: MIT Press.Find this resource:

Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 171–204). Timonium, MD: York Press.Find this resource:

Best, C. T. (2015). Devil or angel in the details? Perceiving phonetic variation as information about phonological structure. In J. Romero & M. Riera (Eds.), The phonetics-phonology interface: Representations and methodologies (pp. 3–31). Amsterdam: John Benjamins.Find this resource:

Best, C. T., Tyler, M. D., Gooding, T. N., Orlando, C. B., & Quann, C. A. (2009). Development of phonological constancy: Toddlers’ perception of native- and Jamaican-accented words. Psychological Science, 20(5), 539–542.Find this resource:

Bourhis, R., & Giles, H. (1977). The language of intergroup distinctiveness. In H. Giles (Ed.), Language, ethnicity and intergroup relation (pp. 119–135). London: Academic Press.Find this resource:

Browman, C. P., & Goldstein, L. M. (1986). Towards an articulatory phonology. Phonology Yearbook, 3(1), 219–252.Find this resource:

Browman, C. P., & Goldstein, L. (1992). Articulatory phonology: An overview. Phonetica, 49(3–4), 155–180.Find this resource:

Chandrasekaran, C., Trubanova, A., Stillittano, S., Caplier, A., & Ghazanfar, A. A. (2009). The natural statistics of audiovisual speech. PLoS Computational Biology, 5(7), 1–18.Find this resource:

Chemero, A. (2009) Radical embodied cognition. Cambridge, MA: MIT Press.Find this resource:

Chomsky, C. (1986). Analytic study of the Tadoma Method: Language abilities of three deaf-blind subjects. Journal of Speech, Language, and Hearing Research, 29(3), 332–347.Find this resource:

Chomsky, N., & Halle, M. (1968). The sound pattern of English. New York: Harper & Row.Find this resource:

Dell, G. S. (1986). A spreading-activation theory of retrieval in sentence production. Psychological Review, 93(3), 283–321.Find this resource:

Dias, J. W., & Rosenblum, L. D. (2016). Visibility of speech articulation enhances auditory phonetic convergence. Attention, Perception, & Psychophysics, 78(1), 317–333.Find this resource:

Diehl, R. L., Lotto, A. J., & Holt, L. L. (2004). Speech perception. Annual Review of Psychology, 55, 149–179.Find this resource:

Fadiga, L., Craighero, L., Buccino, G., & Rizzolatti, G. (2002). Speech listening specifically modulates the excitability of tongue muscles: A TMS study. European Journal of Neuroscience, 15, 399–402.Find this resource:

Fitch, H. L., Halwes, T., Erickson, D. M., & Liberman, A. M. (1980). Perceptual equivalence of two acoustic cues for stop consonant manner. Perception & Psychophysics, 27(4), 343–350.Find this resource:

Fowler, C. A. (1981). Production and perception of coarticulation among stressed and unstressed vowels. Journal of Speech, Language, and Hearing Research, 24(1), 127–139.Find this resource:

Fowler, C. A. (1984). Segmentation of coarticulated speech in perception. Perception & Psychophysics, 36(4), 359–368.Find this resource:

Fowler, C. (1986). An event approach to a theory of speech perception from a direct-realist perspective. Journal of Phonetics, 14, 3–28.Find this resource:

Fowler, C. A. (1996). Listeners do hear sounds, not tongues. The Journal of the Acoustical Society of America, 99(3), 1730–1741.Find this resource:

Fowler, C. A. (2011). Speech perception. In P. C. Hogan (Ed.), The Cambridge encyclopedia of the language sciences (pp. 793–796). Cambridge, UK: Cambridge University Press.Find this resource:

Fowler, C. A., & Brown, J. M. (2000). Perceptual parsing of acoustic consequences of velum lowering from information for vowels. Perception & Psychophysics, 62(1), 21–32.Find this resource:

Fowler, C. A., Brown, J. M., Sabadini, L., & Weihing, J. (2003). Rapid access to speech gestures in perception: Evidence from choice and simple response time tasks. Journal of Memory and Language, 49(3), 396–413.Find this resource:

Fowler, C. A., & Dekle, D. J. (1991). Listening with eye and hand: Cross-modal contributions to speech perception. Journal of Experimental Psychology: Human Perception and Performance, 17(3), 816–828.Find this resource:

Fowler, C. A. Shankweiler, D., & Studdert-Kennedy, M. (2016). Perception of the speech code revisited: Speech is alphabetic after all. Psychological Review, 132(2), 125–150.Find this resource:

Fowler, C. A., & Smith, M. R. (1986). Speech perception as “vector analysis”: An approach to the problems of segmentation and invariance. In J. Perkell & D. Klatt (Eds.), Invariance and variability of speech processes (pp. 123–135). Hillsdale, NJ: Lawrence Erlbaum Associates.Find this resource:

Fowler, C. A., & Xie, X. (2016). Involvement of the motor system in speech perception. In P. van Lieshout, B. Maassen, & H. Terband (Eds.), Speech motor control in normal and disordered speech: Future developments in theory and methodology (pp. 1–24). Rockville, MD: ASHA Press.Find this resource:

Gafos, A. I. (2002). A grammar of gestural coordination. Natural Language & Linguistic Theory, 20(2), 269–337.Find this resource:

Gafos, A. I., & Benus, S. (2006). Dynamics of phonological cognition. Cognitive Science, 30(5), 905–943.Find this resource:

Gibson, J. J. (1966). The senses considered as perceptual systems. Boston: Houghton-Mifflin.Find this resource:

Gibson, J. J., (1979). The ecological approach to visual perception. Boston: Houghton-Mifflin.Find this resource:

Gick, B., & Derrick, D. (2009). Aero-tactile integration in speech perception. Nature, 462(7272), 502–504.Find this resource:

Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105(2), 251–279.Find this resource:

Golonka, S. (2015). Laws and conventions in language-related behaviors. Ecological Psychology, 27(3), 236–250.Find this resource:

Hogden, J., Rubin, P., McDermott, E., Katagiri, S., & Goldstein, L. (2007). Inverting mappings from smooth paths through Rn to paths through Rm: A technique applied to recovering articulation from acoustics. Speech Communication, 49(5), 361–383.Find this resource:

Iskarous, K. (2010). Vowel constrictions are recoverable from formants. Journal of Phonetics, 38(3), 375–387.Find this resource:

Johnson, K. (2011). Retroflexed vs bunched [r] in compensation for coarticulation. UC Berkeley Annual Lab Report, 114–127.Find this resource:

Kelley, D. (1986). The evidence of the senses: A realist theory of perception. Baton Rouge: Louisiana State University Press.Find this resource:

Kingston, J., Kawahara, S., Chambless, D., Key, M., Mash, D., & Watsky, S. (2014). Context effects as auditory contrast. Attention, Perception, & Psychophysics, 76(5), 1437–1464.Find this resource:

Kleinschmidt, D. F., & Jaeger, T. F. (2015). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122(2), 148–203.Find this resource:

Krakow, R., & Beddor, P. (1999). Perception of coarticulatory nasalization by speakers of English and Thai: Evidence for partial compensation. The Journal of the Acoustical Society of America, 106(5), 2868–2887.Find this resource:

Kraljic, T., Brennan, S. E., & Samuel, A. G. (2008). Accommodating variation: Dialects, idiolects, and speech processing. Cognition, 107(1), 54–81.Find this resource:

Kuhl, P., & Meltzoff, A. (1982). The bimodal perception of speech in infancy. Science, 218(4577), 1138–1141.Find this resource:

Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., & Iverson, P. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental Science, 9(2), F13–F21.Find this resource:

Kühnert, B., & Nolan, F. (1999). The origin of coarticulation. In W. J. Hardcastle & N. Hewlett (Eds.), Coarticulation: Theory, data and techniques (pp. 7–30). Cambridge, UK: Cambridge University Press.Find this resource:

Liberman, A. M. (1957). Some results of research on speech perception. Journal of the Acoustical Society of America, 29(1), 117–123.Find this resource:

Liberman, A. M, Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74(6), 431–461.Find this resource:

Liberman, A. M., Delattre, P. C., & Cooper, F. S. (1952). The role of selected stimulus-variables in the perception of the unvoiced stop consonants. American Journal of Psychology, 24(6), 497–516.Find this resource:

Liberman, A. M., Delattre, P. C., Cooper, F. S., & Gerstman, L. J. (1954). The role of consonant-vowel transitions in the perception of the stop and nasal consonants. Psychological Monographs: General and Applied, 68(8), 1–13.Find this resource:

Liberman, A. M., & Mattingly, I. (1985). The motor theory of speech perception revised. Cognition, 21(1), 1–36.Find this resource:

Liberman, A. M., & Whalen, D. H. (2000). On the relation of speech to language. Trends in Cognitive Sciences, 4(5), 187–196.Find this resource:

Lotto, A. J., & Holt, L. L. (2006). Putting phonetic context effects into context: A commentary on Fowler (2006). Perception & Psychophysics, 68(2), 178–183.Find this resource:

Lotto, A. J., & Kluender, K. R. (1998). General contrast effects in speech perception: Effect of preceding liquid on stop consonant identification. Perception & Psychophysics, 60(4), 602–619.Find this resource:

Mann, V. A. (1980). Influence of preceding liquid on stop-consonant perception. Perception & Psychophysics, 28(5), 407–412.Find this resource:

Mann, V. A., & Repp, B. H. (1980). Influence of vocalic context on perception of the [∫‎]-[s] distinction. Perception & Psychophysics, 28(3), 213–228.Find this resource:

Marslen-Wilson, W. (1990). Activation, competition, and frequency in lexical access. In G. Altmann (Ed.), Cognitive models of speech processing: Psycholinguistic and computational perspectives (pp. 148–172). Cambridge, MA: MIT Press.Find this resource:

Martin, J. G., & Bunnell, H. T. (1981). Perception of anticipatory coarticulation effects. The Journal of the Acoustical Society of America, 69(2), 559–567.Find this resource:

Massaro, D. W. (1998). Perceiving talking faces: From speech perception to a behavioral principle. Cambridge, MA: MIT Press.Find this resource:

McGowan, R. S. (1994). Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: Preliminary model tests. Speech Communication, 14(1), 19–48.Find this resource:

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748.Find this resource:

McQueen, J. M., Cutler, A., & Norris, D. (2006). Phonological abstraction in the mental lexicon. Cognitive Science, 30(6), 1113–1126.Find this resource:

Mermelstein, P. (1967). Determination of the vocal‐tract shape from measured formant frequencies. The Journal of the Acoustical Society of America, 41(5), 1283–1294.Find this resource:

Miller, R. M., Sanchez, K., & Rosenblum, L. D. (2010). Alignment to visual speech information. Attention, Perception, & Psychophysics, 72(6), 1614–1625.Find this resource:

Mitterer, H. (2006). On the causes of compensation for coarticulation: Evidence for phonological mediation. Perception & Psychophysics, 68(7), 1227–1240.Find this resource:

Nasir, S. M., & Ostry, D. J. (2009). Auditory plasticity and speech motor learning. Proceedings of the National Academy of Sciences, 106(48), 20470–20475.Find this resource:

Norris, D., McQueen, J. M., & Cutler, A. (2003). Perceptual learning in speech. Cognitive Psychology, 47(2), 204–238.Find this resource:

Pardo, J. S. (2006). On phonetic convergence during conversational interaction. The Journal of the Acoustical Society of America, 119(4), 2382–2393.Find this resource:

Port, R. (2007). How are words stored in memory? Beyond phones and phonemes. New Ideas in Psychology, 25(2), 143–170.Find this resource:

Pouplier, M., & Goldstein, L. (2010). Intention in articulation: Articulatory timing in alternating consonant sequences and its implications for models of speech production. Language and Cognitive Processes, 25(5), 616–649.Find this resource:

Prince, A., & Smolensky, P. (2004). Optimality theory: Constraint interaction in generative grammar. Malden, MA: Blackwell.Find this resource:

Rączaszek-Leonardi, J. (2012). Language as a system of replicable constraints. In H. H. Pattee & J. Raczaszek-Leonardi (Eds.), Laws, language and life (pp. 295–333). Dordrecht: Springer.Find this resource:

Read, C., & Szokolszky, A. (2016). A developmental ecological study of novel metaphoric language use. Language Sciences, 53(Part A), 86–98.Find this resource:

Reed, E. S. (1996). Encountering the world: Toward an ecological psychology. Oxford: Oxford University Press.Find this resource:

Rosenblum, L. D. (2008). Speech perception as a multimodal phenomenon. Current Directions in Psychological Science, 17(6), 405–409.Find this resource:

Rosenblum, L. D. (2010). See what I’m saying: The extraordinary power of our five senses. New York: Norton.Find this resource:

Saltzman, E., & Kelso, J. A. (1987). Skilled actions: A task-dynamic approach. Psychological Review, 94(1), 84–106.Find this resource:

Saltzman, E. L., & Munhall, K. G. (1989). A dynamical approach to gestural patterning in speech production. Ecological Psychology, 1(4), 333–382.Find this resource:

Sams, M., Möttönen, R., & Sihvonen, T. (2005). Seeing and hearing others and oneself talk. Cognitive Brain Research, 23(2–3), 429–435.Find this resource:

Samuel, A. (2011). Speech perception. Annual Review of Psychology, 62, 49–72.Find this resource:

Schroeder, M. R. (1967). Determination of the geometry of the human vocal tract by acoustic measurements, Journal of the Acoustical Society of America, 41(4), 1002–1010.Find this resource:

Shaw, R., Turvey, M. T., & Mace, W. (1982). Ecological psychology: The consequence of a commitment to realism. In W. Weimer & D. Palermo (Eds.), Cognition and the symbolic processes, 2 (pp. 159–226). Hillsdale, NJ: Erlbaum.Find this resource:

Silverman, K. E. A. (1987). The structure and processing of fundamental frequency contours (Unpublished doctoral dissertation). University of Cambridge.Find this resource:

Sondhi, M. (1979). Estimation of vocal-tract areas: The need for acoustical measurements. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(3), 268–273.Find this resource:

Stevens, K. N., & Blumstein, S. E. (1981). The search for invariant acoustic correlates of phonetic features. In P. Eimas & J. Miller (Eds.), Perspectives on the study of speech (pp. 1–38). Hillsdale, NJ: Erlbaum.Find this resource:

Stoffregen, T. A., & Bardy, B. G. (2001). Specification in the global array. Behavioral and Brain Sciences, 24(2), 246–254.Find this resource:

Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America, 26(2), 212–215.Find this resource:

Takahashi, M., Kamibayashi, K., Nakajima, T., Akai, J., & Nakazawa, K. (2008). Changes in corticospinal excitability during observation of walking. NeuroReport, 19, 727–731.Find this resource:

Turvey, M. T., & Carello, C. (2011). Obtaining information by dynamic (effortful) touching. Philosophical Transactions of the Royal Society B, 366, 3123–3132.Find this resource:

Viswanathan, N., Magnuson, J. S., & Fowler, C. A. (2010). Compensation for coarticulation: Disentangling auditory and gestural theories of perception of coarticulatory effects in speech. Journal of Experimental Psychology: Human Perception and Performance, 36(4), 1005–1015.Find this resource:

Viswanathan, N., & Stephens, J. D. (2016). Compensation for visually specified coarticulation in liquid–stop contexts. Attention, Perception, & Psychophysics, 78, 2341–2347.Find this resource:

Vitevitch, M. S., & Luce, P. A. (1998). When words compete: Levels of processing in spoken word recognition. Psychological Science, 9(4), 325–329.Find this resource:

Werker, J. F., & Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7(1), 49–63.Find this resource:

Whalen, D. H. (1984). Subcategorical phonetic mismatches slow phonetic judgments. Perception & Psychophysics, 35(1), 49–64.Find this resource:

Xie, X., Theodore, R. M., & Myers, E. (2017). More than a boundary shift: Perceptual adaptation to foreign-accented speech reshapes the internal structure of phonetic categories. Journal of Experimental Psychology: Human Perception and Performance, 43(1), 206–217.Find this resource:

Yehia, H., & Itakura, F. (1996). A method to combine acoustic and morphological constraints in the speech production inverse problem. Speech Communication, 18(2), 151–174.Find this resource:

Yehia, H., Rubin, P., & Vatikiotis-Bateson, E. (1998). Quantitative association of vocal-tract and facial behavior. Speech Communication, 26(1), 23–43.Find this resource:


(1.) But contrast motor theorists, e.g., Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967, for whom the acoustic speech signal is special in being “encoded” by coarticulation. That specialness, for motor theorists, necessitates articulatory access in perception.

(2.) “Equifinality” refers to achievement of same goal (say, lip closure in production of [b], [p], or [m]) in a variety of ways (in the example, with different relative contributions of jaw, upper lip, and lower lip motions), often depending on contextual constraints. What counts for talkers and listeners are the goal-directed actions (e.g., lip closure), not the specific articulatory movements that compose them.

(3.) Coarticulation is also sometimes defined as adjustments to the production of a phonetic segment to its context (e.g., Kühnert & Nolan, 1999). Although there is certainly some context-sensitivity, in the present account it is minor compared to the fundamental nature of coarticulation (captured by its name), which is temporal overlap of phonetic gestures.

(4.) For example, talkers make consonant and vowel misordering or substitution errors when they talk (e.g., saying “lork yibrary” intending to say “York library,” or “beef needle soup” instead of intended “beef noodle soup”; examples from Dell, 1986 suggesting that these language forms are discrete components of a plan to speak. For evidence that listeners extract such units, see, e.g., McQueen, Cutler, & Norris, 2006; Norris, McQueen, & Cutler, 2003).

(5.) [ɹ] is the symbol for American English “r” as in “rabbit.”

(6.) Keep in mind that infants learn to talk by hearing the speech of others. Presumably, they extract articulatory information from acoustic speech signals.

(7.) This behavior is not restricted to that of bloodhounds. See, for example (from Rosenblum, 2010). Retrieved from

(8.) This does not mean, of course, that the acoustic signal does not specify vocal tract actions. Recovery is only as good as the acoustic and articulatory measures taken.

(9.) See Fowler & Xie, 2016, for a review of additional evidence of embodiment in speech.

(10.) Density is a frequency-weighted measure of the number of words having phonetic properties overlapping with one another. Words from dense neighborhoods have many high-frequency, similar-sounding neighbors; words from sparse neighborhoods have few.