Show Summary Details

Page of

PRINTED FROM the OXFORD RESEARCH ENCYCLOPEDIA, LINGUISTICS (linguistics.oxfordre.com). (c) Oxford University Press USA, 2016. All Rights Reserved. Personal use only; commercial use is strictly prohibited. Please see applicable Privacy Policy and Legal Notice (for details see Privacy Policy).

Subscriber: null; date: 25 May 2017

Speech Perception in Phonetics

Summary and Keywords

In their conversational interactions with speakers, listeners aim to understand what a speaker is saying, that is, they aim to arrive at the linguistic message, which is interwoven with social and other information, being conveyed by the input speech signal. Across the more than 60 years of speech perception research, a foundational issue has been to account for listeners’ ability to achieve stable linguistic percepts corresponding to the speaker’s intended message despite highly variable acoustic signals. Research has especially focused on acoustic variants attributable to the phonetic context in which a given phonological form occurs and on variants attributable to the particular speaker who produced the signal. These context- and speaker-dependent variants reveal the complex—albeit informationally rich—patterns that bombard listeners in their everyday interactions.

How do listeners deal with these variable acoustic patterns? Empirical studies that address this question provide clear evidence that perception is a malleable, dynamic, and active process. Findings show that listeners perceptually factor out, or compensate for, the variation due to context yet also use that same variation in deciding what a speaker has said. Similarly, listeners adjust, or normalize, for the variation introduced by speakers who differ in their anatomical and socio-indexical characteristics, yet listeners also use that socially structured variation to facilitate their linguistic judgments. Investigations of the time course of perception show that these perceptual accommodations occur rapidly, as the acoustic signal unfolds in real time. Thus, listeners closely attend to the phonetic details made available by different contexts and different speakers. The structured, lawful nature of this variation informs perception.

Speech perception changes over time not only in listeners’ moment-by-moment processing, but also across the life span of individuals as they acquire their native language(s), non-native languages, and new dialects and as they encounter other novel speech experiences. These listener-specific experiences contribute to individual differences in perceptual processing. However, even listeners from linguistically homogenous backgrounds differ in their attention to the various acoustic properties that simultaneously convey linguistically and socially meaningful information. The nature and source of listener-specific perceptual strategies serve as an important window on perceptual processing and on how that processing might contribute to sound change.

Theories of speech perception aim to explain how listeners interpret the input acoustic signal as linguistic forms. A theoretical account should specify the principles that underlie accurate, stable, flexible, and dynamic perception as achieved by different listeners in different contexts. Current theories differ in their conception of the nature of the information that listeners recover from the acoustic signal, with one fundamental distinction being whether the recovered information is gestural or auditory. Current approaches also differ in their conception of the nature of phonological representations in relation to speech perception, although there is increasing consensus that these representations are more detailed than the abstract, invariant representations of traditional formal phonology. Ongoing work in this area investigates how both abstract information and detailed acoustic information are stored and retrieved, and how best to integrate these types of information in a single theoretical model.

Keywords: acoustic variation, sociophonetic variation, coarticulation, perceptual compensation, normalization, gestural perception, auditory perception, exemplar models, individual differences, sound change

In their conversational interactions with speakers, listeners aim to understand what a speaker is saying, that is, they aim to arrive at the linguistic message, which is interwoven with social and other information, being conveyed by the input speech signal. Correspondingly, a broad goal of theories of speech perception is to provide an account of how listeners extract linguistic forms—in particular, phonological and lexical forms—from the acoustic structure of the input.

Across the more than 60 years of speech perception research, a foundational issue has been to account for listeners’ ability to achieve stable linguistic percepts corresponding to the speaker’s intended message despite highly variable acoustic signals. There are many sources of variation, but at the forefront of investigations have been the acoustic variants that are attributable to the phonetic context in which a given phonological form occurs or to the particular speaker who produced the signal. The theme that unifies this overview is its exploration of empirical perceptual investigations of, and theoretical approaches to, these two sources of variation. Speech perception is shown to be a malleable and dynamic process. It is argued that this active processing enables listeners not only to recognize and adjust for the context- and speaker-dependent realizations of the same phonological form but, as the acoustic signal unfolds, to use these variants to inform and facilitate perceptual decisions.1

1 Acoustic Variation

The acoustic speech signal is the result of passing the wave generated by a sound source—in voiced sounds, the vibrating larynx—through the filter of the supralaryngeal vocal tract. During speech production, the resonant characteristics of the vocal tract filter are modified over time due to the dynamic movements of the speech articulators (e.g., tongue tip raising, lip rounding). As is readily seen via vocal tract imaging methods, such as magnetic resonance imaging or ultrasound, the motions of different articulators overlap with each other, or coarticulate, and these overlapping articulations cut across traditional phonological units such as consonants and vowels. Even without imaging data, some overlapping articulations can be readily evident. For example, in English, the lip gesture for the rounded vowel in suit co-occurs not only with the back tongue body gesture for the vowel but also with the tongue tip constriction for the fricative. (Compare the rounded lips for /s/ in suit to the spread lips in seat.) Acoustically, the fricative spectrum is shaped both by the alveolar constriction and lip rounding, with lip rounding lengthening the resonating vocal tract and thereby lowering the peak noise frequencies of /s/. Perceptually, the relatively low-frequency fricative noise provides listeners with information about an upcoming rounded vowel. However, this lower frequency energy also renders rounded /s/ somewhat /ʃ/-like, potentially making the /s/-/ʃ/ contrast perceptually less distinct in rounded (suit, shoot) than in unrounded (seat, sheet) vowel contexts.

The context-dependent nature of the acoustic characteristics of speech sounds illustrated by the coarticulatory lip rounding example is pervasive. That phonetic context influences the spectral and temporal properties of consonants and vowels has been recognized from the earliest work in speech acoustics and perception (e.g., Cooper, Delattre, Liberman, Borst, & Gerstman, 1952; House & Fairbanks, 1953; Stevens & House, 1963). Unsurprisingly, then, among the first questions posed by speech perception researchers was to ask how listeners arrive at stable percepts despite the variation introduced by different contexts (Cooper et al., 1952). Many of these questions persist: “One of the longstanding, and as yet unresolved, puzzles in research on speech perception and spoken word recognition is how listeners rapidly decode highly variable acoustic signals articulated by a speaker’s vocal tract into seemingly discrete and invariant phonemes and words” (McMurray, Tanenhaus, & Aslin, 2002, p. B33).

Because the resonant characteristics of the vocal tract depend on that acoustic filter’s size and shape, those characteristics differ across speakers as well as across contexts. For example, not only does context-dependent lip rounding during a fricative lengthen the vocal tube and thereby lower the fricative’s frequency spectrum, but so does a speaker with a relatively long vocal tract more generally produce speech sounds with relatively low resonant frequencies. In ways that go beyond anatomical factors, an individual’s age, gender, characteristic coarticulatory strategies, and other traits also contribute to systematic, speaker-specific acoustic patterns. Classic early research on acoustic variation across speakers showed that the resonant, or formant, frequencies of vowels of different phonological categories (e.g., English /a/ and /ɔ/ or /u/ and /ʊ/) overlap with each other when compared across speakers (Peterson & Barney, 1952). More recent study of vowels shows massive overlap of different speakers’ vowel categories in the acoustic space defined by the frequencies of the first (F1) and second (F2) formants, yet highly accurate identification of these same vowels by listeners (Hillenbrand, Getty, Clark, & Wheeler, 1995). Although overlapping formant values are expected given what is known about acoustic-articulatory relations (e.g., Fant, 1960; Stevens, 1998), findings such as these have nonetheless given rise to long-standing questions about how listeners accommodate across-speaker variation (e.g., Joos, 1948; Ladefoged & Broadbent, 1957). More generally, the acoustic consequences of coarticulated speech produced by different speakers in different prosodic contexts, at different speaking rates, and in words of different lexical frequencies reveal the variable yet informationally rich patterns that bombard listeners in their everyday conversational interactions.

2 Malleable Perception

As characterized in the previous section, and as approached by some researchers in the early (and in some cases, not so early) years of speech perception experimentation, the variable acoustic signal might be regarded as not ideal for the listener. How do listeners’ perceptual mechanisms resolve what appears to be the many-to-one and one-to-many ‘problem’ of the acoustics-to-percept relation? The preponderance of evidence from decades of research shows that listeners are highly sensitive to the phonetic details made available by different contexts and different speakers, and that the structured, lawful nature of this variation informs perception.

2.1 Perception of Contextual Variation

Investigations relating to two classic perceptual phenomena, categorical perception and compensation for coarticulation, serve as illustrations of listeners’ sensitivity to the phonetic variation introduced by context and listeners’ ability to resolve that variation.

Categorical perception experiments comprise two types of tasks, typically involving stimuli that vary in equal physical steps along an acoustic dimension. An especially well-studied dimension is stop voice onset time (VOT), which is the time from the release of stop closure to onset of voicing for a following vowel. In categorical perception studies, listeners’ tasks are to identify the stimuli (e.g., as /b/ or /p/) and, for pairs of stimuli that are close to each other along the acoustic continuum, judge whether those stimuli are the same or different. These experiments tend to yield two striking outcomes, especially for stop consonants: (i) despite the acoustic continuum, identification shifts abruptly rather than gradually from one category (e.g., /b/) to the next (/p/), and (ii) discrimination is not uniform across the continuum but is accurate for stimulus pairs that straddle the identification category boundary and poor for within-category comparisons (Liberman, Haris, Hoffman, & Griffith, 1957; see Repp, 1984, for a review).

Abrupt category shifts and discrimination that is closely tied to categorization reveal that listeners group different acoustic events into perceptual equivalence classes. It is an important pair of findings that influenced the early development of theories of speech perception (see Section 5.1) and continues to inform the study of speech and non-speech perceptual processing. Despite their discrete, ‘categorical’ judgments, though, listeners are also aware of fine-grained phonetic variants. One manifestation of this awareness is that category boundaries are highly flexible. For example, listeners require longer voicing delays (i.e., longer positive VOT) to report hearing /p/ rather than /b/ when overall syllable duration is relatively long (Miller & Volaitis, 1989), presumably because, in production, voiceless stop VOT is longer at slower speaking rates (Miller, Green, & Reeves, 1986). In general, these and other findings (reviewed by Repp & Liberman, 1987) show that equivalence classes are context specific in ways that closely mirror the production patterns to which listeners are exposed.

Listeners’ sensitivity to differences within equivalence classes emerges in perceptual tasks that measure processing time, stimulus goodness, and more. For example, in discrimination tasks, listeners’ reaction times when responding ‘same’ to within-category pairs of stimuli are longer for acoustically distinct than for acoustically identical pairs (Pisoni & Tash, 1974). In identification tasks, response times are slowed for stimuli that approach a category boundary (Blumstein et al., 2005). Response times are slowed as well for stimuli manipulated to include a subcategorical mismatch between co-varying properties that underlie a target contrast (e.g., a mismatch between VOT and fundamental frequency information for stop voicing; Whalen et al., 1993). Given subcategorical effects on response times, it is unsurprising that, when asked to judge the relative goodness of members of the same category, listeners regularly rate some stimuli as better category exemplars than others (Miller, 1997). Patterns of neural activation (Blumstein, Myers, & Rissman, 2005) and eye movements in a visual world paradigm (McMurray et al., 2002) also demonstrate that listeners are sensitive to gradient information in the input signal. Clearly, achieving perceptual stability across acoustic variants does not incur the cost of discarding that variation; rather, the phonetic details remain available to the listener.

Studies relating to the perceptual phenomenon of compensation for coarticulation similarly demonstrate that listeners factor out subcategorical—in this case, coarticulatory—variation in making perceptual decisions yet also use those same variants as linguistic information. Numerous studies show that listeners accommodate the acoustic effects of coarticulation, perceptually reducing, and in some cases eliminating, those influences. For example, as introduced in Section 1, anticipatory lip rounding for an upcoming rounded vowel lowers the noise frequency of a preceding /s/, rendering /s/ acoustically more /ʃ/-like in rounded contexts. Listeners resolve the potential ambiguity by attributing the relatively low-frequency noise of the fricative to its coarticulatory source. This resolution is illustrated by responses to identification tasks in which listeners hear fricatives, drawn from a (synthetic) /s/ to /ʃ/ continuum, embedded in different vowel contexts. When the fricative’s noise frequencies are not fully informative, and fall between those characteristic of /s/ and /ʃ/, listeners tend to identify the fricative as /s/ if that fricative is followed by a rounded vowel (as in suit). However, that same fricative is typically heard as /ʃ/ when followed by an unrounded vowel (as in sheet), that is, when the relatively low-frequency noise cannot be due to anticipatory lip rounding (Mann & Repp, 1980; Smits, 2001; Whalen, 1981). These and many other findings suggest that listeners compensate for coarticulation, ‘undoing’ (or nearly so) the variation introduced by coarticulation.

Other findings demonstrate that the acoustic variation that is perceptually undone by compensation, and attributed to its coarticulatory source, nonetheless remains available to the listener. Anticipatory lip rounding again serves as an illustration: vowel context influences listeners’ judgments of fricative identity and fricative context influences judgments of vowel identity. Dutch listeners, for example, are more likely to report hearing an unrounded vowel after higher-frequency (more /s/-like) fricative noise than after lower-frequency (more /ʃ/-like) noise (Mitterer, 2006). Along similar lines, English-speaking listeners compensate for coarticulatory vowel nasalization, perceiving nasalized vowels as being relatively oral when followed by a nasal consonant (Beddor & Krakow, 1999; Kawasaki, 1986), and they also use vowel nasality in real-time processing to anticipate an upcoming nasal consonant (Beddor, McGowan, Boland, Coetzee, & Brasher, 2013).

Although the theoretical interpretation of such complementary findings, whereby listeners perceptually factor out the variation due to coarticulation and yet use that same variation to inform perception, has been controversial (e.g., Fowler, 2006; Lotto & Holt, 2006; Mitterer, 2006; see Section 5.1), the overall findings point toward structured contextual variation being perceptually advantageous.

2.2 Perception of Speaker-Specific Variation

Listeners also arrive at stable percepts for the acoustic variation that is introduced by speakers who differ in their anatomical and socio-indexical characteristics. In a seminal investigation of perception of speaker-specific variation, Ladefoged and Broadbent (1957) showed that listeners identify vowels in relation to a (synthetic) speaker’s overall vowel space. In that study, a target word ambiguous between English bit and bet, for example, was labeled as bet when the vowels of a precursor sentence had relatively low F1 frequencies but as bit when the precursor vowel F1 frequencies were high. Because the acoustic differences between the precursor sentences roughly mimicked the acoustic consequences of across-speaker vocal tract differences, the results were interpreted as evidence that listeners adjust or normalize for speaker-specific variation on the basis of information external to the target acoustic signal. This finding of ‘extrinsic normalization,’ replicated for natural speech by Ladefoged (1989), offered an early, preliminary answer to the question of how listeners are able to accurately identify phonologically different but acoustically overlapping speech sounds—particularly vowels—produced by different speakers (see Section 1).

Subsequent study of perception of the acoustic variation introduced by different speakers has contributed to delineating the relative perceptual contributions of acoustic information (i) extrinsic to the vowel (Nearey, 1989; Sjerps & Smiljanić, 2013), (ii) intrinsic to the vowel nucleus (Syrdal & Gopal, 1986), and (iii) intrinsic to the vowel but spread across the dynamic acoustic pattern (Morrison & Nearey, 2007; Strange, Jenkins, & Johnson, 1983). All three approaches (reviewed by Johnson, 2005) continue to be fruitful lines of inquiry. Subsequent study has also established that across-speaker normalization holds for consonants as well as vowels. Fricatives produced by—or thought to be produced by—men and women are a well-studied example. The spectra of fricatives produced by women tend to have a higher center of gravity than those produced by men (Jongman, Wayland, & Wong, 2000), likely due to both anatomical and sociophonetic factors (Fuchs & Toda, 2010). Listeners adjust for these speaker-dependent differences. In findings that parallel the vowel-dependent compensatory effects for fricatives (Section 2.1), an ambiguous (synthetic) fricative drawn from an /s/-/ʃ/ continuum is more likely to be perceived as /s/ if spliced into an utterance produced by a man than an utterance produced by a woman (Johnson, 1991; Mann & Repp, 1980; Munson, Jefferson, & McDonald, 2006). That is, listeners assess the relatively low noise frequencies as due to the (assumed) characteristics of that particular speaker.

A more recent line of inquiry is the study of listeners’ informed use of socially structured variation in making linguistic decisions. Matched guise studies, in which the same target acoustic signal is variously attributed to speakers from different social groups (e.g., through visual representations or more overt attribution), illustrate how social factors alter phonetic judgments. Listeners report hearing more /ʃ/ for fricatives along an /s/-/ʃ/ continuum, for example, not only when the remainder of the utterance is produced by a woman, but also when the auditory context is held constant and the manipulated gender information is visual (Munson, 2011; Strand, 1999). Listeners’ beliefs about a speaker’s age and socioeconomic class (Hay et al., 2006), dialect (Niedzielski, 1999), and race (Staum Casasanto, 2010) also influence phonetic decisions. Such findings demonstrate that the perceptual consequences of socially conditioned variation go beyond adjustments for the actual acoustic properties of a given speaker’s productions and extend to listeners’ expectations concerning these properties. (See Drager, 2010, for an assessment of some of the major findings.)

Listeners presumably bring these social expectations to their conversational interactions in ways that enhance perception. Recent findings, for example, indicate that social expectations that match the acoustic signal (e.g., visual information that the speaker is Asian paired with a Chinese-accented English-speaking voice) improve the accuracy of phonetic decisions of listeners experienced with the targeted social variation, whereas a mismatch reduces accuracy for inexperienced listeners (McGowan, 2015). Further complicating the emerging picture, though, is that listeners’ social expectations and experiences interact with their social weightings. For example, for listeners who speak a rhotic dialect of American English, a non-rhotic realization of slender [slɛndə] primes (i.e., speeds reaction time to) semantically related thin when produced by a speaker of the prestigious Southern Standard British English dialect but not when produced by a non-rhotic speaker from New York City (Sumner & Kataoka, 2013). Accounting for this growing body of research requires a theoretical framework that simultaneously encodes linguistic and social information. Docherty and Foulkes (2014) provide an overview of the essential characteristics of an integrated approach; Sumner, Kim, King, and McGowan (2014), for example, offer a preliminary model. (See also Section 5.2.)

Although the discipline remains in the relatively early stages of understanding how social categories and phonetic categories interact in perception, important parallels between the processing of contextual and socially conditioned variation are evident. Variation structures the acoustic signal in ways that inform the source of that signal: contextual variation informs listeners about vocal tract dynamics and social variation informs other speaker characteristics. Listeners closely attend to this structured variation, attributing the variation to its source and using it to determine what the speaker is saying.

3 Dynamic Perception

Up to this point, speech perception has been shown to be malleable in that listeners adjust for multiple sources of variation. Speech perception is also malleable in that it changes over time. Perception evolves in listeners’ moment-by-moment processing as the acoustic signal unfolds (e.g., Dahan, Magnuson, Tanenhaus, & Hogan, 2001). Moreover, perception evolves over the life span as individuals acquire their native language(s) (e.g., Werker & Gervain, 2013), non-native languages (Best & Tyler, 2007; Flege, 1995; Strange, 2011), and new accents and dialects (Bradlow & Bent, 2008; Clopper, 2014) and as they are exposed to other perceptual learning experiences (Samuel & Kraljic, 2009).

Consider how listeners’ attention to, and use of, contextual and speaker-specific information evolves. Even young infants perceive categorically and are also sensitive to within-category variants. For example, infants four months old and younger show evidence of categorical-like perception in that they discriminate certain differences between stimuli along a stop VOT continuum but respond to other same-sized VOT differences—differences that fall within the same voicing category for adults—as though they were equivalent (Eimas, Siqueland, Jusczyk, & Vigorito, 1971). (See Werker et al., 1998, for an overview of methods used to test infant speech perception.) However, like adults, infants can discriminate within-category VOT differences under more sensitive testing conditions (McMurray & Aslin, 2005). Such within-category sensitivity is important to some current theoretical accounts of perceptual development, which postulate that infants’ emerging phonological categories are modified over time as these young language learners attend to the phonetically gradient distributional properties of the ambient language (e.g., Maye et al., 2002; Pierrehumbert, 2003). Although these language-specific categories emerge during the first year of life (Werker & Tees, 1984), aspects of phonological categorization continue to develop throughout childhood (e.g., Hazan & Barrett, 2000; Nittrouer & Miller, 1997).

Among the subphonemic properties to which infants are sensitive are the context-specific coarticulated variants due to overlapping articulatory movements. Eight-month-olds, for example, differentiate sequences of syllables whose vowels provide appropriate place information for following consonants from sequences whose vowels, through splicing techniques, contain inappropriate coarticulatory information (e.g., appropriate [Vpp] compared to inappropriate [Vgp], where the subscript indicates the stop context in which the vowel was originally produced). Evidence of this sensitivity emerges in infants’ responses to familiar vs. novel syllable sequences. Specifically, after being familiarized with coarticulatorily appropriate strings of syllables, eight-month-olds look longer (in a preferential listening procedure) when hearing test stimuli that repeat a familiarized syllable sequence than when hearing a novel sequence—but only if the repeated sequence contains appropriate coarticulatory information (Curtin, Mintz, & Byrd, 2001). Eight-month-olds also respond in ways that indicate that they use appropriate coarticulatory cues to segment the speech stream (Johnson & Jusczyk, 2001). Discrimination responses of even younger infants are consistent with their perceptually compensating for coarticulation (Fowler, Best, & McRoberts, 1990). Taken together, these and other findings indicate that infants use and adjust for structured contextual variation, although the full range of flexibility characteristic of adult perception takes time to emerge.

Comparable developmental findings are reported for speaker-specific variation. Early findings demonstrated that six-month-olds are able to categorize vowels produced by different talkers (Kuhl, 1979). Subsequent studies replicated this basic finding of speaker normalization across a range of stimuli and infant ages, although these studies showed as well that variation due to different speakers introduces some perceptual disruption for young infants—disruption that lessens across the first year of life (e.g., Houston & Jusczyk, 2000; Werker & Curtin, 2005). (However, the disruption does not completely disappear. Under some conditions, across-speaker variation can come at a cost for adult perceivers as well; Mullennix, Pisoni, & Martin, 1989.) For older infants being tested in challenging experimental tasks, such as the discrimination of minimal pair words, the addition of across-speaker variation can facilitate learning the contrast (Rost & McMurray, 2010), presumably by reinforcing—through subphonemic variation—the systematic phonetic properties crucial to the contrast being acquired.

The dynamic processing of context- and speaker-specific variation as the acoustic signal unfolds in real time is a more recent area of investigation. This work rests on the well-founded assumption that listeners are actively processing the rich information afforded by variation; of interest is the time course of how this information is processed as listeners make linguistic decisions. Results from studies using a visual world paradigm, in which eye movements to a visual display are monitored as participants hear coarticulated speech, indicate that fixations to a target (as opposed to competitor) image increase over time as the coarticulatory information for that target word becomes available. For example, when hearing CVC words, participants fixate a target image (e.g., fixate the image for tap rather than tack) more quickly when the vowel gives appropriate anticipatory place information for the final consonant ([tæpp]) than when the vowel contains inappropriate place information ([tækp]; Dahan et al., 2001). In addition, the detailed time course of target image fixations when listeners hear early compared to late onset of (appropriate) coarticulatory cues indicates that listeners attend to that information very soon after it becomes available (Beddor et al., 2013). That is, listeners closely track the evolving acoustic effects of overlapping articulations in real-time processing.

Similarly, study of the time course of listeners’ adaptation to socio-indexical properties points towards rapid use of new information about a speaker to anticipate a target word and rule out competitors. For example, speakers of some American English dialects produce a raised version of /æ/ before voiced, but not voiceless, velar codas. For listeners familiar with this dialect, hearing [ɛ] or [eɪ] (rather than [æ]) has the potential to signal that the upcoming stop is /g/ (e.g., bag) rather than /k/ (back). Listeners newly exposed to this dialect have been shown to adapt to this vowel variant over the course of an eye-tracking task: participants fixate words ending in /k/ (with unraised [æ]) more quickly, and over time more often, after hearing pre-/g/ raised vowels than before this exposure (Dahan, Drucker, & Scarborough, 2008). Moreover, this rapid adaptation is talker-specific, with even stronger perceptual facilitation emerging when participants are given visual information about talker identity (Trude & Brown-Schmidt, 2012).

That listeners dynamically adjust for the acoustic variation introduced by different contexts and speakers is, of course, expected. However, the time course, both in perceptual development and over the short span of an experimental setting, demonstrates that listeners are not simply accommodating the variation but are active participants who recruit linguistic and other cognitive resources in achieving malleable perception. Learners use the signal variation to extract the regularities that help them formulate the relevant categories and language users more generally attend to the time-varying properties that enable them to be efficient perceivers.

4 Individual Differences in Perceptual Flexibility

Different listeners use the multiple sources of information in the speech signal in different ways. Individual differences emerge in listeners’ attention to specific acoustic properties, the magnitude of their accommodation to contextual and speaker-specific variation, and their achievement of native-like perception in a second language. These and other individual differences have been recognized for decades (e.g., Liberman et al., 1957; Mann & Repp, 1980; Miyawaki et al., 1975). In more recent years, given increasing evidence of the systematic nature of listener differences combined with growing interest in their theoretical, pedagogical, and clinical implications, study of listener-specific perceptual strategies has become an area of inquiry in its own right. That is, just as acoustic variation is now regarded by many researchers as information rather than signal noise, so listener variation is now studied as an important window on perceptual processing.

Of the factors that underlie individual differences in perception, the linguistic experiences that listeners bring to the task of perception are among the more obvious and widely investigated. The influence of experience emerges perhaps most clearly when listeners are learning a new language. There is a sizable literature (reviewed by Best & Tyler, 2007) showing that the extent to which language learners achieve native-like perception in a second language depends both on learners’ first (L1) and second (L2) linguistic experiences. Less sizable, but growing, is the literature demonstrating influences of native language experiences on perception of varieties of L1. Experimentally, these native-language experiential influences tend to emerge when speech is presented under speeded or noisy listening conditions and demonstrate greater processing difficulties for less familiar speech patterns such as an unfamiliar regional variety (e.g., Adank, Evans, Stuart-Smith, & Scott, 2009; Floccia, Goslin, Girard, & Konopczynski, 2006; see Clopper, 2014, for a review).

However, even for listeners from linguistically homogeneous backgrounds, there are systematic individual differences in perception. The relative importance that listeners assign to the multiple acoustic properties that differentiate phonological categories offers clear illustrations. For example, voiceless stops differ from voiced stops in having longer voicing lag (VOT, Section 2.1) and higher fundamental frequency (f0) at the onset of the vowel following the stop (Löfqvist, Baer, & McGarr, 1989). The relative contribution of VOT and f0 to phonation distinctions differs across languages and dialects, but even within a single variety individual listeners differ in how heavily they weight f0 when judging whether a stop is, say, /b/ or /p/ (Haggard, Ambler, & Callow, 1970; Shultz, Francis, & Llanos, 2012). Similarly, listeners differ in the relative importance given to coarticulatory information, such as the usefulness of vowel nasalization as information for an upcoming nasal consonant (Beddor et al., 2013), and in the extent to which they perceptually compensate for coarticulatory influences (Kataoka, 2011; Yu, 2010). Listeners also differ in their adjustments for speaker-specific properties (Yu, 2010) and in their sensitivity to these properties in learning to identify different voices (Nygaard & Pisoni, 1998). These listener-specific perceptual strategies are robust; they are consistent over time (Idemaru, Holt, & Seltman, 2012), across tasks (Yu & Lee, 2014), and across different sources of variation (Baese-Berk, Bent, Borrie, & McKee, 2015).

For populations of listeners whose linguistic experiences do not differ in obvious ways, one source of individual variation is likely that different weights for co-varying acoustic properties are often consistent with the same phonological percept, that is, there is more than one way to arrive at the ‘correct’ category (Beddor, 2009; Schertz, Cho, Lotto, & Warner, 2015). Other factors contributing to differential attention to, and flexibility for, fine-grained phonetic variation likely include differences in perceptual acuity (Perkell et al., 2004), cognitive processing style and “autistic traits” (Stewart & Ota, 2008), working memory capacity (Kong & Edwards, 2011; Tamati, Gilbert, & Pisoni, 2013), and social awareness (Garrett & Johnson, 2013). For example, neurotypical individuals with fewer autistic traits have been shown to exhibit relatively weak compensation for the coarticulatory effects of vowel context on a preceding fricative (Yu, 2010). Such correlations between processing style and perceptual compensation are of particular interest for the study of how coarticulated variants might serve as a source of new sound patterns that spread through a speech community. If a listener undercompensates for the coarticulatory rounding effects of, say, /u/ on a preceding /s/, then a speaker’s intended /su/ might be heard and re-interpreted by that listener as /ʃu/ (Ohala, 1981). If that listener is also relatively extroverted, with more social contacts (as would be consistent with few autistic traits), then that listener’s speech patterns might be especially likely to spread within their social group (Yu, 2013).

Of course, the scenario in which listener-specific perceptual patterns contribute to sound change rests on the assumption that listeners manifest their individual perceptual biases in their own productions. This assumption is consistent not only with some accounts of the phonetic precursors to sound change (Beddor, 2009; Harrington, Kleber, & Reubold, 2008; Ohala, 1981; Yu, 2013), but also with certain broader theoretical approaches to the perception-production relation. Motor theorists, for example, argue that perceiving speech involves recruitment of the listener’s motor system (Liberman & Mattingly, 1985; see Section 5). Approaches to exemplar theory often assume a perception-production loop in which the phonetic details of the perceived input are reflected in production (Pierrehumbert, 2001; see Section 5). However, to date, empirical tests of the assumption of a close perception–production relation within the individual language user have met with mixed results. That there is a link between perception and production patterns is suggested, for example, by findings that participants who more accurately identify or discriminate targeted speech stimuli also produce greater distinctions between those stimuli (e.g., Ghosh et al., 2010; Hay, Warrem. & Drager, 2006; Perkell et al., 2004). Such a link is also consistent with findings that subgroups of participants (e.g., older vs. younger participants) who differ in the degree to which they produce a given coarticulatory pattern correspondingly differ in the extent to which they compensate for those coarticulatory effects (Harrington et al., 2008). Results of other studies, though, indicate that individuals who produce more extensive coarticulation do not more accurately discriminate between those coarticulated variants (Grosvald & Corina, 2012), nor do they exhibit larger perceptual adjustments for those variants (Kataoka, 2011). Moreover, the relative weights that a listener assigns to co-varying acoustic properties, such as VOT and f0 for voicing distinctions, do not necessarily correlate with that individual’s own productions (Shultz et al., 2012, for English; see Coetzee, Beddor, & Wissing, 2014, for Afrikaans). (See Stevens & Harrington, 2014, for an overview of this literature from the perspective of sound change.)

5 Theoretical Approaches to Speech Perception

Theories of speech perception aim to explain how listeners interpret the input acoustic signal as linguistic forms. A theoretical account of perception should specify the nature of the information that listeners recover from the acoustic signal. It should, as well, specify the mechanisms that enable listeners to simultaneously attend to, yet factor out, the time-varying, context- and speaker-specific information afforded by acoustic variation. That is, a theoretical account should specify the principles that underlie accurate, stable, flexible, and dynamic perception as achieved by different listeners in different contexts. Over the past several decades, these broad goals of theoretical approaches to speech perception have given rise to overarching issues that continue to differentiate current models and drive empirical investigation. Two dominant issues are the nature of the objects of speech perception (Section 5.1) and the nature of language users’ cognitive representations in relation to speech perception (Section 5.2).

5.1 Gestural or Auditory Perception?

Theoretical approaches to speech perception differ in their conception of the nature of the information that listeners recover from the acoustic signal. One fundamental distinction is whether the objects or primitives of speech perception are gestural or auditory.

That the objects of speech perception are articulatory events—that is, the vocal tract gestures of the speaker—is a central tenet of two theoretical approaches, the motor theory (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Liberman & Mattingly, 1985) and direct realism (Fowler, 1986, 1996). For motor theory, early empirical motivation for postulating gestural perception emerged from findings of complex acoustics-to-percept mappings coupled with results consistent with perception paralleling articulation more closely than acoustics (e.g., Cooper et al., 1952; Liberman et al., 1957). Categorical perception of stop consonants (Section 2.1), for example, was interpreted as supporting “the assumption that the listener uses the inconstant sound as a basis for finding his way back to the articulatory gestures that produced it” (Liberman et al., 1967, p. 453). However, the gestures that listeners arrive at are assumed to be the speaker’s intended gestures rather than the actual articulatory events due to motor theorists’ belief that coarticulatory overlap of movements of the vocal tract organs results in gestures not being directly recoverable from the acoustic signal (Liberman & Mattingly, 1985). Motor theorists postulate as well that perception of gestures occurs in a biologically specialized phonetic module, that is, in a perceiving system adapted specifically for speech (Liberman & Mattingly, 1989; Whalen & Liberman, 1987). (See Galantucci, Fowler, & Turvey, 2006, for a detailed assessment of the main claims of motor theory.)

According to the theory of direct realism, though, speech perception is not special. For direct realism, that listeners perceive vocal tract gestures follows from the theory’s more general perspective that perceptual systems serve as the means by which perceivers know their environment. In this view, although listeners hear an acoustic signal, they perceive the event that caused the (lawful) structure in that signal; in the case of speech, the events that cause this structure are gestures of the vocal tract (Fowler, 1996). Moreover, because the causal events are claimed to be directly perceived, direct realists posit that the actual vocal tract actions, rather than intended gestures, are apprehended (Best, 1995; Fowler, 1996). Within this approach, it follows that listeners “parse the [acoustic] signal along gestural lines” (Fowler, 2006, p. 162), an expectation consistent with, for example, findings of compensation for coarticulation (Section 2.1).

However, the more widely held theoretical perspective, sometimes labeled the general auditory approach to speech perception (Diehl, Lotto, & Holt, 2004; Lotto & Holt, 2006), is that the objects of perception are auditory. In this approach, speech sounds—like all sounds—are perceived via general processes of audition and perceptual learning. For example, the many-to-one acoustics-to-percept mappings whereby different acoustic signals elicit the same percept (as illustrated in Sections 1 and 2) are taken to be due not to these sounds being the outcome of the same vocal tract gesture but rather to “the general ability of the perceiver to make use of multiple imperfect acoustic cues to categorize complex stimuli” (Diehl et al., 2004, p. 154). Also, if the perceptual processing of speech is not different from that of non-speech sounds, then speech and non-speech stimuli that share critical acoustic characteristics should give rise to similar perceptual responses. Findings demonstrating these similarities are offered as empirical support for the general auditory approach (e.g., Laing, Liu, Lotto, & Holt, 2012; Lotto & Kluender, 1998; Pisoni, 1977).

A particularly clear illustration of the debate between gestural and auditory approaches is a series of investigations of compensation for coarticulated approximant-stop sequences. Due to coarticulation, velar /g/ is more fronted in an alveolar /alga/ context than in post-alveolar /aɹga/, rendering /g/ somewhat more /d/-like after /l/ than after /ɹ/. Acoustically, this effect is seen in the onset frequency of the third formant (F3). Typically, onset F3 is lower for /g/ than for /d/. But after an /l/, which has a high-frequency F3, onset F3 for /g/ raises (i.e., becomes more /d/-like). Listeners adjust for these contextual effects: when identifying stops belonging to a /d/-/g/ continuum varying in F3, listeners are more likely to identify ambiguous stops as /g/ in an /al_a/ than in an /aɹ_a/ context (Mann, 1980). The “classic” account of the perceptual effect is a gestural interpretation (see Section 2.1), whereby listeners are compensating for coarticulation, attributing the stop’s relatively high F3 onset to the preceding context (Fowler, 2006; Fowler et al., 1990; Mann, 1986). That is, listeners appear to be gesturally parsing the signal. Auditory theorists offer an alternative account, proposing instead that the effect of the approximant is not specific to speech but is rather due to the more general auditory effect of spectral contrast. Specifically, the F3 onset frequency of the ambiguous stop sounds relatively low (i.e., relatively /g/-like) after /l/’s high-frequency F3. Consistent with a spectral contrast account, human listeners give compensatory-like responses when the stop consonant is preceded by a non-speech context that mimics critical properties of /l/ and /ɹ/ (Lotto & Kluender, 1998; see Viswanathan et al., 2010); moreover, birds trained on the relevant speech stimuli also show compensatory-like responses (Lotto, Kluender, & Holt, 1997). However, inconsistent with an auditory explanation is the finding that when the pre-stop auditory context is held constant, and the only information for the difference between a preceding /l/ or /ɹ/ is visual, listeners show compensation for the visual coarticulatory context (Fowler, Brown, & Mann, 2000; see also Fowler, 2006, and the response by Lotto & Holt, 2006). Thus, gestural and auditory theorists have each marshaled empirical evidence that is more problematic for one account than the other, and they continue to provide new evidence that informs the debate.

5.2. Abstract and Episodic Encoding

Perceptual findings, such as those discussed in Sections 2 and 3, provide strong evidence that listeners (i) adjust for systematic subphonemic variation and thereby judge these variants as belonging to the same phonological category and (ii) use that variation in making informed linguistic decisions. These—among many other—findings raise foundational questions concerning the nature of phonological representations in relation to speech perception. There is now general recognition that these representations are more detailed than the abstract, invariant representations of traditional formal phonology, in which phonetic details are assumed to have been stripped away.

A highly influential alternative to the abstract representations of formal phonology is the proposal that listeners store memories of the physical manifestations of their speech experiences. Early empirical motivation for this proposal came, in part, from findings that listeners not only attend to the details of a particular speaker’s voice, but they retain this information over time (e.g., Goldinger, 1996). Rather than having speaker normalization and memory for speaker-specific information “peacefully coexist” (Goldinger, 1998, p. 264), exemplar models instead postulate direct storage of episodic traces without normalization (Goldinger, 1998; Johnson, 1997). Contextual variation is, of course, also retained in this approach. The episodic traces are stored, with associated labels (e.g., labels about who produced the exemplar), as instances of a linguistic category. Most models assume the linguistic categories to be lexical, although some also assume parsing of word exemplars into segment-sized units (e.g., Pierrehumbert, 2001). Listeners determine the linguistic category to which a new exemplar belongs by computing its phonetic distance from stored exemplars; the shorter the distance (taking into account attentional weights, for example; Pierrehumbert, 2001), the more probable the category. Among the attractions of an exemplar approach to phonological representations is that it provides a means for capturing listeners’ (and speakers’) integration of socio-indexical and phonetic properties in perception (and production) (e.g., Docherty & Mendoza-Denton, 2012). Another appeal has been that exemplar approaches often assume a perception-production loop in which the phonetic details of the perceived input are mirrored in production (Pierrehumbert, 2001). In this approach, an individual listener’s systematic perceptual biases or weightings, which include weightings based on social factors, have the potential to contribute to sound change because listeners-turned-speakers should manifest those biases in production (see Section 4).

However, purely exemplar models have serious shortcomings as a theory of speech perception. For example, they have difficulty accounting for listeners’ remarkable abilities for category learning (Pierrehumbert, 2006), their flexibility in adjusting category boundaries and generalizing these new boundaries to unheard words (Cutler, Eisner, McQueen, & Norris, 2010; see also Section 2.1), and their use of abstract knowledge when encoding new instances of speech (Goldinger, 2007), to name but a few. That a hybrid model is needed has perhaps already become the generally accepted view (e.g., Docherty & Foulkes 2014; Goldinger, 2007; Nielsen, 2011; Pierrehumbert, 2006; Pisoni & Levi, 2007). It is expected that work in this area will increasingly focus on how both abstract information and detailed acoustic information are stored and retrieved (e.g., Cutler et al., 2010; Kimball, Cole, Dell, & Shattuck-Hufnagel, 2015), and how best to integrate in a single model the essential contributions of abstract and exemplar approaches (e.g., Hawkins, 2010).

6 Looking Ahead

For speech perception researchers, understanding the relation between the input signal and linguistic percept involves studying the structure of the physical speech signal, the structure of the cognitive representation of speech, and the perceptual processing that enables listeners to attend to—and learn to attend to—the complex informational structure of that highly variable signal. These issues are being addressed through theoretical developments (e.g., hybrid abstract/episodic models) and methodological innovations (e.g., for studying the time course of speech processing). They are being addressed as well through study of listeners’ integrated use of linguistic information and other (e.g., social) information that falls outside the more traditional scope of speech perception research, and through more focused attention on individual listeners as elucidating the cognitive capabilities and social and linguistic experiences that guide perceivers of speech. Although speech perception has been informed by linguistic theory from the earliest years of the discipline (Jusczyk & Luce, 2002), increasingly speech perception models and empirical findings are informing theories of phonological representation and phonological change.

Further Reading

Classic Early Findings

Eimas, P. D., Siqueland, E. R, Jusczyk, P., & Vigorito, J. (1971). Speech perception in infants. Science, 171, 303–306.Find this resource:

    Kuhl, P. K., & Miller, J. D. (1975). Speech perception by the chinchilla: voiced-voiceless distinction in alveolar plosive consonants. Science, 190, 69–72.Find this resource:

      Ladefoged, P., & Broadbent, D. E. (1957). Information conveyed by vowels. Journal of the Acoustical Society of America, 29, 98–104.Find this resource:

        Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358–368.Find this resource:

          Miyawaki, K., Strange, W., Verbrugge, R., Liberman, A. M., Jenkins, J. J., & Fujimura, O. (1975). An effect of linguistic experience: The discrimination of [r] and [l] by native speakers of Japanese and English. Perception & Psychophysics, 18, 331–340.Find this resource:

            Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. Journal of the Acoustical Society of America, 24, 175–184.Find this resource:

              Selected Theoretical Approaches

              Diehl, R. L., Lotto, A. J., & Holt, L. L. (2004). Speech perception. Annual Review of Psychology, 55, 149–179.Find this resource:

                Fowler, C. A. (1986). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics, 14, 3–28.Find this resource:

                  Goldinger, S. D. (1998). Echo of echoes? An episodic theory of lexical access. Psychological Review, 105, 251–279.Find this resource:

                    Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21, 1–36.Find this resource:

                      Ohala, J. J. (1981). The listener as a source of sound change. In C. S. Masek, R. A. Hendrick, & M. F. Miller (Eds.), Papers from the parasession on language and behavior (pp.178–203). Chicago: Chicago Linguistic Society.Find this resource:

                        Overviews (Especially of Recent Findings and Their Theoretical Implications)

                        Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception: commonalities and complementarities. In O.-S. Bohn & M. J. Munro (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 13–34). Amsterdam: John Benjamins.Find this resource:

                          Drager, K. (2010). Sociophonetic variation in speech perception. Language and Linguistics Compass, 4, 473–480.Find this resource:

                            Gervain, J., & Mehler, J. (2010). Speech perception and language acquisition in the first year of life. Annual Review of Psychology, 61, 191–218.Find this resource:

                              Pisoni, D. B., & Remez, R. E. (Eds.) (2005). The handbook of speech perception. Malden, MA: Blackwell.Find this resource:

                                Samuel A. G., & Kraljic, T. (2009). Perceptual learning in speech perception. Attention, Perception & Psychophysics, 71, 1207–1218.Find this resource:

                                  References

                                  Adank, P., Evans, B. G., Stuart-Smith, J., & Scott, S. K. (2009). Comprehension of familiar and unfamiliar native accents under adverse listening conditions. Journal of Experimental Psychology: Human Perception and Performance, 35, 520–529.Find this resource:

                                    Baese-Berk, M., Bent, T., Borrie, S., & McKee, M. (2015). Individual differences in perception of unfamiliar speech. In The Scottish Consortium for ICPhS 2015 (Ed.), Proceedings of the 18th International Congress of Phonetic Sciences. Paper number 0460.1-5. Glasgow, U.K.: The University of Glasgow. Retrieved from http://www.icphs2015.info/pdfs/Papers/ICPHS0460.pdfFind this resource:

                                      Beddor, P. S. (2009). A coarticulatory path to sound change. Language, 85, 785–821.Find this resource:

                                        Beddor, P. S., & Krakow, R. A. (1999). Perception of coarticulatory nasalization by speakers of English and Thai: evidence for partial compensation. Journal of the Acoustical Society of America, 106, 2868–2887.Find this resource:

                                          Beddor, P. S., McGowan, K. B., Boland, J. E., Coetzee, A. W., & Brasher, A. (2013). The time course of perception of coarticulation. Journal of the Acoustical Society of America, 133, 2350–2366.Find this resource:

                                            Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 171–204). Baltimore, MD: York Press.Find this resource:

                                              Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception: commonalities and complementarities. In O.-S. Bohn & M. J. Munro (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 13–34). Amsterdam: John Benjamins.Find this resource:

                                                Blumstein, S. E., Myers, E. B., & Rissman, J. (2005). The perception of voice onset time: An fMRI investigation of phonetic category structure. Journal of Cognitive Neuroscience, 17, 1353–1366.Find this resource:

                                                  Bradlow, A. R., & Bent, T. (2008). Perceptual adaptation to non-native speech. Cognition, 106, 707–729.Find this resource:

                                                    Clopper, C. G. (2014). Sound change in the individual: Effects of exposure on cross-dialect speech processing. Laboratory Phonology, 5, 69–90.Find this resource:

                                                      Coetzee, A. W., Beddor, P. S., & Wissing, D. (2014). Emergent tonogenesis in Afrikaans. Journal of the Acoustical Society of America, 135, 2421–2422 (abstract).Find this resource:

                                                        Cooper, F. S., Delattre, P. C., Liberman, A. M., Borst, J., & Gerstman, L. J. (1952). Some experiments on the perception of synthetic speech sounds. Journal of the Acoustical Society of America, 24, 597–606.Find this resource:

                                                          Curtin, S., Mintz, T. H., & Byrd, D. (2001). Coarticulatory cues enhance infants’ recognition of syllable sequences in speech. In A. H.-J. Do, L. Domínguez, & A. Johansen (Eds.), Proceedings of the 25th annual Boston University Conference on Language Development (pp. 190–201). Somerville, MA: Cascadilla Press.Find this resource:

                                                            Cutler, A., Eisner, F., McQueen, J. M., & Norris, D. (2010). How abstract phonemic categories are necessary for coping with speaker-related variation. In C. Fougeron, B. Kühnert, M. D’Imperio, & N. Vallée (Eds.), Laboratory phonology (Vol. 10, pp. 91–111). Berlin: De Gruyter Mouton.Find this resource:

                                                              Dahan, D., Drucker, S. J., & Scarborough, R. A. (2008). Talker adaptation in speech perception: Adjusting the signal or the representations? Cognition, 108, 710–718.Find this resource:

                                                                Dahan, D., Magnuson, J. S., Tanenhaus, M. K., & Hogan, E. M. (2001). Subcategorical mismatches and the time course of lexical access: Evidence for lexical competition. Language and Cognitive Processes, 16, 507–534.Find this resource:

                                                                  Diehl, R. L., Lotto, A. J., & Holt, L. L. (2004). Speech perception. Annual Review of Psychology, 55, 149–179.Find this resource:

                                                                    Docherty, G. J., & Foulkes, P. (2014). An evaluation of usage-based approaches to the modelling of sociophonetic variability. Lingua, 142, 42–56.Find this resource:

                                                                      Docherty, G., & Mendoza-Denton, N. (2012). Speaker-related variation—sociolinguistic factors. In A. C. Cohn, C. Fougeron, & M. K. Huffman (Eds.), The Oxford handbook of laboratory phonology (pp. 43–60). Oxford: Oxford University Press.Find this resource:

                                                                        Drager, K. (2010). Sociophonetic variation in speech perception. Language and Linguistics Compass, 4, 473–480.Find this resource:

                                                                          Eimas, P. D., Siqueland, E. R, Jusczyk, P., & Vigorito, J. (1971). Speech perception in infants. Science, 171, 303–306.Find this resource:

                                                                            Fant, G. (1960). Acoustic theory of speech production. The Hague: Mouton.Find this resource:

                                                                              Flege, J. E. (1995). Second-language speech learning: Theory, findings, and problems. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 233–277). Baltimore, MD: York Press.Find this resource:

                                                                                Floccia, C., Goslin, J., Girard, F., & Konopczynski, G. (2006). Does a regional accent perturb speech processing? Journal of Experimental Psychology: Human Perception and Performance, 32, 1276–1293.Find this resource:

                                                                                  Fowler, C. A. (1986). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics, 14, 3–28.Find this resource:

                                                                                    Fowler, C. A. (1996). Listeners do hear sounds, not tongues. Journal of the Acoustical Society of America, 99, 1730–1741.Find this resource:

                                                                                      Fowler, C. A. (2006). Compensation for coarticulation reflects gesture perception, not spectral contrast. Perception & Psychophysics, 68, 161–177.Find this resource:

                                                                                        Fowler, C. A., Best, C. T., & McRoberts, G. W. (1990). Young infants’ perception of liquid coarticulatory influences on following stop consonants. Perception & Psychophysics, 48, 559–570.Find this resource:

                                                                                          Fowler, C. A., Brown, J. M., & Mann, V. A. (2000). Contrast effects do not underlie effects of preceding liquids on stop-consonant identification by humans. Journal of Experimental Psychology: Human Perception and Performance, 26, 877–888.Find this resource:

                                                                                            Fuchs, S., & Toda, M. (2010). Do differences in male versus female /s/ reflect biological or sociophonetic factors? In S. Fuchs, M. Toda, & M. Żygis (Eds.), Turbulent sounds: An interdisciplinary guide (pp. 281–302). Berlin: De Gruyter Mouton.Find this resource:

                                                                                              Galantucci, B., Fowler, C. A., & Turvey, M. T. (2006). The motor theory of speech perception reviewed. Psychonomic Bulletin & Review, 13, 361–377.Find this resource:

                                                                                                Garrett, A., & Johnson, K. (2013). Phonetic bias in sound change. In A. C. L. Yu (Ed.), Origins of sound change: Approaches to phonologization (pp. 51–97). Oxford: Oxford University Press.Find this resource:

                                                                                                  Ghosh, S. S., Matthies, M. L., Maas, E., Hanson, A., Tiede, M., Ménard, L., et al. (2010). An investigation of the relation between sibilant production and somatosensory and auditory acuity. Journal of the Acoustical Society of America, 128, 3079–3087.Find this resource:

                                                                                                    Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 1166–1183.Find this resource:

                                                                                                      Goldinger, S. D. (1998). Echo of echoes? An episodic theory of lexical access. Psychological Review, 105, 251–279.Find this resource:

                                                                                                        Goldinger, S. D. (2007). A complementary-systems approach to abstract and episodic speech perception. In J. Trouvain & W. J. Barry (Eds.), Proceedings of the 17th International Congress of Phonetic Sciences (pp. 49–54). Saarbrücken, Germany.Find this resource:

                                                                                                          Grosvald, M., & Corina, D. (2012). The production and perception of sub-phonemic vowel contrasts and the role of the listener in sound change. In M.-J. Solé & D. Recasens (Eds.), The initiation of sound change: Production, perception, and social factors (pp. 77–100). Amsterdam: John Benjamins.Find this resource:

                                                                                                            Haggard, M., Ambler, S., & Callow, M. (1970). Pitch as a voicing cue. Journal of the Acoustical Society of America, 47, 613–617.Find this resource:

                                                                                                              Harrington, J., Kleber, F., & Reubold, U. (2008). Compensation for coarticulation, /u/-fronting, and sound change in standard southern British: An acoustic and perceptual study. Journal of the Acoustical Society of America, 123, 2825–2835.Find this resource:

                                                                                                                Hawkins, S. (2010). Phonetic variation as communicative system: Perception of the particular and the abstract. In C. Fougeron, B. Kühnert, M. D’Imperio, & N. Vallée (Eds.), Laboratory phonology (Vol. 10, pp. 479–510). Berlin: De Gruyter Mouton.Find this resource:

                                                                                                                  Hazan, V., & Barrett, S. (2000). The development of phonemic categorization in children aged 6–12. Journal of Phonetics, 28, 377–396.Find this resource:

                                                                                                                    Hay, J., Warren, P., & Drager, K. (2006). Factors influencing speech perception in the context of a merger-in-progress. Journal of Phonetics, 34, 458–484.Find this resource:

                                                                                                                      Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97, 3099–3111.Find this resource:

                                                                                                                        House, A. S., & Fairbanks, G. (1953). The influence of consonant environment upon the secondary acoustical characteristics of vowels. Journal of the Acoustical Society of America, 25, 105–113.Find this resource:

                                                                                                                          Houston, D. M., & Jusczyk, P. W. (2000). The role of talker-specific information in word segmentation by infants. Journal of Experimental Psychology: Human Perception and Performance, 26, 1570–1582.Find this resource:

                                                                                                                            Idemaru, K., Holt, L. L., & Seltman, H. (2012). Individual differences in cue weights are stable across time: The case of Japanese stop lengths. Journal of the Acoustical Society of America, 132, 3950–3964.Find this resource:

                                                                                                                              Johnson, E. K., & Jusczyk, P. W. (2001). Word segmentation by 8-month-olds: when speech cues count more than statistics. Journal of Memory and Language, 44, 548–567.Find this resource:

                                                                                                                                Johnson, K. (1991). Differential effects of speaker and vowel variability on fricative perception. Language and Speech, 34, 265–279.Find this resource:

                                                                                                                                  Johnson, K. (1997). Speech perception without speaker normalization: An exemplar model. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 145–165). San Diego, CA: Academic Press.Find this resource:

                                                                                                                                    Johnson, K. (2005). Speaker normalization in speech perception. In D. B. Pisoni & R. Remez (Eds.), The handbook of speech perception (pp. 363–389). Oxford: Blackwell Publishers.Find this resource:

                                                                                                                                      Jongman, A., Wayland, R., & Wong, S. (2000). Acoustic characteristics of English fricatives. Journal of the Acoustical Society of America, 108, 1252–1263.Find this resource:

                                                                                                                                        Joos, M. (1948). Acoustic phonetics. Language, 24(suppl. 2), 1–136.Find this resource:

                                                                                                                                          Jusczyk, P. W., & Luce, P. A. (2002). Speech perception and spoken word recognition: past and present. Ear and Hearing, 23, 2–40.Find this resource:

                                                                                                                                            Kataoka, R. (2011). Phonetic and cognitive bases of sound change (Unpublished PhD diss.). University of California, Berkeley, CA.Find this resource:

                                                                                                                                              Kawasaki, H. (1986). Phonetic explanation for phonological universals: The case of distinctive vowel nasalization. In J. J. Ohala & J. J. Jaeger (Eds.), Experimental phonology (pp. 81–103). Orlando, FL: Academic Press.Find this resource:

                                                                                                                                                Kimball, A. E., Cole, J., Dell, G., & Shattuck-Hufnagel, S. (2015). Categorical vs. episodic memory for pitch accents in English. In The Scottish Consortium for ICPhS 2015 (Ed.), Proceedings of the 18th International Congress of Phonetic Sciences. Glasgow, U.K.: The University of Glasgow. Paper number 0897.1-5. Retrieved from http://www.icphs2015.info/pdfs/Papers/ICPHS0897.pdfFind this resource:

                                                                                                                                                  Kong, E. J., & Edwards, J. (2011). Individual differences in speech perception: Evidence from visual analogue scaling and eye-tracking. In W.-S. Lee & E. Zee (Eds.), Proceedings of the 17th International Congress of Phonetic Sciences (pp. 1126–1129). Hong Kong: City University of Hong Kong.Find this resource:

                                                                                                                                                    Kuhl, P. K. (1979). Speech perception in early infancy: perceptual constancy for spectrally dissimilar vowel categories. Journal of the Acoustical Society of America, 66, 1668–1679.Find this resource:

                                                                                                                                                      Ladefoged, P. (1989). A note on “Information conveyed by vowels.” Journal of the Acoustical Society of America, 85, 2223–2224.Find this resource:

                                                                                                                                                        Ladefoged, P., & Broadbent, D. E. (1957). Information conveyed by vowels. Journal of the Acoustical Society of America, 29, 98–104.Find this resource:

                                                                                                                                                          Laing, E. J. C., Liu, R., Lotto A. J., & Holt, L. L. (2012). Tuned with a tune: Talker normalization via general auditory processes. Frontiers in Psychology, 3.Find this resource:

                                                                                                                                                            Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74, 431–461.Find this resource:

                                                                                                                                                              Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358–368.Find this resource:

                                                                                                                                                                Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21, 1–36.Find this resource:

                                                                                                                                                                  Liberman, A. M., & Mattingly, I. G. (1989). A specialization for speech perception. Science, 243, 489–494.Find this resource:

                                                                                                                                                                    Löfqvist, A., Baer, T., & McGarr, N. S. (1989). The cricothyroid muscle in voicing control. Journal of the Acoustical Society of America, 85, 1314–1321.Find this resource:

                                                                                                                                                                      Lotto, A. J., & Holt, L. L. (2006). Putting phonetic context effects into context: A commentary on Fowler. Perception & Psychophysics, 68, 178–183.Find this resource:

                                                                                                                                                                        Lotto, A. J., & Kluender, K. R. (1998). General contrast effects in speech perception: Effect of preceding liquid on stop consonant identification. Perception & Psychophysics, 60, 602–619.Find this resource:

                                                                                                                                                                          Lotto, A. J., Kluender, K. R., & Holt, L. L. (1997). Perceptual compensation for coarticulation by Japanese quail (Coturnix coturnix japonica). Journal of the Acoustical Society of America, 102, 1134–1140.Find this resource:

                                                                                                                                                                            Mann, V.A. (1980). Influence of preceding liquid on stop-consonant perception. Perception and Psychophysics, 28, 407–412.Find this resource:

                                                                                                                                                                              Mann, V. A. (1986). Distinguishing universal and language-dependent levels of speech perception: Evidence from Japanese listeners’ perception of English “l” and “r”. Cognition, 24, 169–196.Find this resource:

                                                                                                                                                                                Mann, V. A., & Repp, B. H. (1980). The influence of vocalic context on perception of the [ʃ]-[s] distinction. Perception and Psychophysics, 28, 213–228.Find this resource:

                                                                                                                                                                                  Maye, J., Werker, J. F., & Gerken, L. (2002). Infant sensitivity to distributional information can affect phonetic discrimination. Cognition, 82, B101–B111.Find this resource:

                                                                                                                                                                                    McGowan, K. B. (2015). Social expectation improves speech perception in noise. Language and Speech, 58, 502–521.Find this resource:

                                                                                                                                                                                      McMurray, B., & Aslin, R. N. (2005). Infants are sensitive to within-category variation in speech perception. Cognition, 95, B15–B26.Find this resource:

                                                                                                                                                                                        McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2002). Gradient effects of within-category phonetic category phonetic variation on lexical access. Cognition, 86, B33–B42.Find this resource:

                                                                                                                                                                                          Miller, J. L. (1997). Internal structure of phonetic categories. Language and Cognitive Processes, 12, 856–860.Find this resource:

                                                                                                                                                                                            Miller, J. L., Green, K. P., & Reeves, A. (1986). Speaking rate and segments: A look at the relation between speech production and speech perception for the voicing contrast. Phonetica, 43, 106–115.Find this resource:

                                                                                                                                                                                              Miller, J. L., & Volaitis, L. E. (1989). Effect of speaking rate on the perceptual structure of a phonetic category. Perception & Psychophysics, 46, 505–512.Find this resource:

                                                                                                                                                                                                Mitterer, H. (2006). On the causes of compensation for coarticulation: Evidence for phonological mediation. Perception & Psychophysics, 68, 1127–1240.Find this resource:

                                                                                                                                                                                                  Miyawaki, K., Strange, W., Verbrugge, R., Liberman, A. M., Jenkins, J. J., & Fujimura, O. (1975). An effect of linguistic experience: The discrimination of [r] and [l] by native speakers of Japanese and English. Perception & Psychophysics, 18, 331–340.Find this resource:

                                                                                                                                                                                                    Morrison, G. S., & Nearey, T. M. (2007). Testing theories of vowel inherent spectral change. Journal of the Acoustical Society of America, 122, EL15–EL22.Find this resource:

                                                                                                                                                                                                      Mullennix, J. W., Pisoni, D. B., & Martin, C. S. (1989). Some effects of talker variability on spoken word recognition. Journal of the Acoustical Society of America, 85, 365–378.Find this resource:

                                                                                                                                                                                                        Munson, B. (2011). The influence of actual and imputed talker gender on fricative perception, revisited. Journal of the Acoustical Society of America, 130, 2631–2634.Find this resource:

                                                                                                                                                                                                          Munson, B., Jefferson, S. V., & McDonald, E. C. (2006). The influence of perceived sexual orientation on fricative identification. Journal of the Acoustical Society of America, 119, 2427–2437.Find this resource:

                                                                                                                                                                                                            Nearey, T. M. (1989). Static, dynamic, and relational properties in vowel perception. Journal of the Acoustical Society of America, 85, 2088–2113.Find this resource:

                                                                                                                                                                                                              Niedzielski, N. (1999). The effect of social information on the perception of sociolinguistic variables. Journal of Language and Social Psychology, 18, 62–85.Find this resource:

                                                                                                                                                                                                                Nielsen, K. (2011). Specificity and abstractness of VOT imitation. Journal of Phonetics, 39, 132–142.Find this resource:

                                                                                                                                                                                                                  Nittrouer, S., & Miller, M. E. (1997). Predicting developmental shifts in perceptual weighting schemes. Journal of the Acoustical Society of America, 101, 2253–2266.Find this resource:

                                                                                                                                                                                                                    Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60, 355–376.Find this resource:

                                                                                                                                                                                                                      Ohala, J. J. (1981). The listener as a source of sound change. In C. S. Masek, R. A. Hendrick, & M. F. Miller (Eds.), Papers from the parasession on language and behavior (pp. 178–203). Chicago: Chicago Linguistic Society.Find this resource:

                                                                                                                                                                                                                        Perkell, J. S., Guenther, F. H., Lane, H., Matthies, M. L., Stockmann, E., Tiede, M., et al. (2004). The distinctness of speakers’ productions of vowel contrasts is related to their discrimination of the contrasts. Journal of the Acoustical Society of America, 116, 2338–2344.Find this resource:

                                                                                                                                                                                                                          Peterson, G. E., & Barney, H. L. (1952). Control methods used in a study of the vowels. Journal of the Acoustical Society of America, 24, 175–184.Find this resource:

                                                                                                                                                                                                                            Pierrehumbert, J. B. (2001). Exemplar dynamics: Word frequency, lenition and contrast. In J. Bybee & P. Hopper (Eds.), Frequency effects and the emergence of linguistic structure (pp. 137–157). Amsterdam: John Benjamins.Find this resource:

                                                                                                                                                                                                                              Pierrehumbert, J. B. (2003). Phonetic diversity, statistical learning, and acquisition of phonology. Language and Speech, 46, 115–154.Find this resource:

                                                                                                                                                                                                                                Pierrehumbert, J. B. (2006). The next toolkit. Journal of Phonetics, 34, 516–530.Find this resource:

                                                                                                                                                                                                                                  Pisoni, D. B. (1977). Identification and discrimination of the relative onset time of two component tones: Implications for voicing perception in stops. Journal of the Acoustical Society of America, 61, 1352–1361.Find this resource:

                                                                                                                                                                                                                                    Pisoni, D. B., & Levi, S. V. (2007). Representations and representational specificity in speech perception and spoken word recognition. In M. G. Gaskell (Ed.), The Oxford handbook of psycholinguistics (pp. 3–18). Oxford: Oxford University Press.Find this resource:

                                                                                                                                                                                                                                      Pisoni, D. B., & Tash, J. (1974). Reaction times to comparisons within and across phonetic categories. Perception & Psychophysics, 15, 285–290.Find this resource:

                                                                                                                                                                                                                                        Repp, B. H. (1984). Categorical perception: Issues, methods, findings. In N. J. Lass (Ed.), Speech and language: Advances in basic research and practice (Vol. 10, pp. 243–335). New York: Academic Press.Find this resource:

                                                                                                                                                                                                                                          Repp, B. H., & Liberman, A. M. (1987). Phonetic category boundaries are flexible. In S. Harnad (Ed.), Categorical perception. The groundwork of cognition (pp. 89–112). Cambridge, U.K.: Cambridge University Press.Find this resource:

                                                                                                                                                                                                                                            Rost, G. C., & McMurray, B. (2010). Finding the signal by adding noise the role of noncontrastive phonetic variability in early word learning. Infancy, 15, 608–635.Find this resource:

                                                                                                                                                                                                                                              Samuel A. G., & Kraljic, T. (2009). Perceptual learning in speech perception. Attention, Perception & Psychophysics, 71, 1207–1218.Find this resource:

                                                                                                                                                                                                                                                Schertz, J., Cho, T., Lotto, A., & Warner, N. (2015). Individual differences in phonetic cue use in production and perception of a non-native sound contrast. Journal of Phonetics, 52, 183–204.Find this resource:

                                                                                                                                                                                                                                                  Shultz, A. A., Francis, A. L., & Llanos, F. (2012). Differential cue weighting in perception and production of consonant voicing. Journal of the Acoustical Society of America, 132, EL95–EL101.Find this resource:

                                                                                                                                                                                                                                                    Sjerps, M. J., & Smiljanić, R. (2013). Compensation for vocal tract characteristics across native and non-native languages. Journal of Phonetics, 41, 145–155.Find this resource:

                                                                                                                                                                                                                                                      Smits, R. (2001). Evidence for hierarchical categorization of coarticulated phonemes. Journal of Experimental Psychology: Human Perception and Performance, 27, 1145–1162.Find this resource:

                                                                                                                                                                                                                                                        Staum Casasanto, L. (2010). What do listeners know about sociolinguistic variation? University of Pennsylvania Working Papers in Linguistics: Selected Papers from NWAV 37, 15(2), 39–49.Find this resource:

                                                                                                                                                                                                                                                          Stevens, K. N. (1998). Acoustic phonetics. Cambridge, MA: MIT Press.Find this resource:

                                                                                                                                                                                                                                                            Stevens, K. N., & House, A. S. (1963). Perturbation of vowel articulations by consonantal context: an acoustical study. Journal of Speech and Hearing Research, 6, 111–128.Find this resource:

                                                                                                                                                                                                                                                              Stevens, M., & Harrington, J. (2014). The individual and the actuation of sound change. Loquens, 1, 1–10.Find this resource:

                                                                                                                                                                                                                                                                Stewart, M. E., & Ota, M. (2008). Lexical effects on speech perception in individuals with “autistic” traits. Cognition, 109, 157–162.Find this resource:

                                                                                                                                                                                                                                                                  Strand, E. A. (1999). Uncovering the role of gender stereotypes in speech perception. Journal of Language and Social Psychology, 18, 86–99.Find this resource:

                                                                                                                                                                                                                                                                    Strange, W. (2011). Automatic selective perception (ASP) of first and second language speech: A working model. Journal of Phonetics, 39, 456–466.Find this resource:

                                                                                                                                                                                                                                                                      Strange, W., Jenkins, J. J., & Johnson, T. L. (1983). Dynamic specification of coarticulated vowels. Journal of the Acoustical Society of America, 74, 695–705.Find this resource:

                                                                                                                                                                                                                                                                        Sumner, M., & Kataoka, R. (2013). Effects of phonetically-cued talker variation on semantic encoding. Journal of the Acoustical Society of America, 134, EL485–EL491.Find this resource:

                                                                                                                                                                                                                                                                          Sumner, M., Kim, S. K., King, E., & McGowan, K. B. (2014). The socially-weighted encoding of spoken words: A dual-route approach to speech perception. Frontiers in Psychology, 4.Find this resource:

                                                                                                                                                                                                                                                                            Syrdal, A. K., & Gopal, H. S. (1986). A perceptual model of vowel recognition based on the auditory representation of American English vowels. Journal of the Acoustical Society of America, 79, 1086–1100.Find this resource:

                                                                                                                                                                                                                                                                              Tamati, T. N., Gilbert, J. L., & Pisoni, D. B. (2013). Some factors underlying individual differences in speech recognition on PRESTO: A first report. Journal of the American Academy of Audiology, 24, 616–634.Find this resource:

                                                                                                                                                                                                                                                                                Trude, A. M., & Brown-Schmidt, S. (2012). Talker-specific perceptual adaptation during online speech perception. Language and Cognitive Processes, 27, 979–1001.Find this resource:

                                                                                                                                                                                                                                                                                  Viswanathan, N., Magnuson, J. S., & Fowler, C. A. (2010). Compensation for coarticulation: Disentangling auditory and gestural theories of perception of coarticulatory effects in speech. Journal of Experimental Psychology: Human Perception and Performance, 36, 1005–1015.Find this resource:

                                                                                                                                                                                                                                                                                    Werker, J. F., & Curtin, S. (2005). PRIMIR: A developmental framework of infant speech processing. Language Learning and Development, 1, 197–234.Find this resource:

                                                                                                                                                                                                                                                                                      Werker, J. F., & Gervain, J. (2013). Speech perception in infancy: A foundation for language acquisition. In P. D. Zelazo (Ed.), The Oxford handbook of developmental psychology, Vol. 1, Body and mind (pp. 909–925).Find this resource:

                                                                                                                                                                                                                                                                                        Werker, J. F., Shi, R., Desjardins, R., Pegg, J. E., Polka, L., & Patterson, M. (1998). Three methods for testing infant speech perception. In A. Slater (Ed.), Perceptual development: Visual, auditory, and speech perception in infancy (pp. 389–420). East Sussex, U.K.: Psychology Press.Find this resource:

                                                                                                                                                                                                                                                                                          Werker, J. F., & Tees, R. C. (1984). Cross-language speech perception: evidence for perceptual reorganization during the first year of life. Infant Behavior & Development, 7, 49–63.Find this resource:

                                                                                                                                                                                                                                                                                            Whalen, D. H. (1981). Effects of vocalic formant transitions and vowel quality on the English [s]-[š] boundary. Journal of the Acoustical Society of America, 69, 275–282.Find this resource:

                                                                                                                                                                                                                                                                                              Whalen, D. H., Abramson, A. S., Lisker, L., & Mody, M. (1993). F0 gives voicing information even with unambiguous voice onset times. Journal of the Acoustical Society of America, 93, 2152–2159.Find this resource:

                                                                                                                                                                                                                                                                                                Whalen, D. H., & Liberman, A. M. (1987). Speech perception takes precedence over nonspeech perception. Science, 237, 169–171.Find this resource:

                                                                                                                                                                                                                                                                                                  Yu, A. C. L. (2010). Perceptual compensation is correlated with individuals’ “autistic” traits: Implications for models of sound change. PLoS ONE, 5(8), e11950.Find this resource:

                                                                                                                                                                                                                                                                                                    Yu, A. C. L. (2013). Individual differences in socio-cognitive processing and the actuation of sound change. In A. C. L. Yu (Ed.), Origins of sound change: Approaches to phonologization (pp. 201–227). Oxford: Oxford University Press.Find this resource:

                                                                                                                                                                                                                                                                                                      Yu, A. C. L., & Lee, H. (2014). The stability of perceptual compensation for coarticulation within and across individuals: A cross-validation study. Journal of the Acoustical Society of America, 136, 382–388.Find this resource:

                                                                                                                                                                                                                                                                                                        Notes:

                                                                                                                                                                                                                                                                                                        (1.) This article is based in part upon work supported by the National Science Foundation under Grant Number BCS-1348150 to Patrice Speeter Beddor and Andries W. Coetzee; any opinions, findings, and conclusions expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation. I thank Jelena Krivokapić and Kevin McGowan for valuable comments on an earlier draft.