Show Summary Details

Page of

PRINTED FROM the OXFORD RESEARCH ENCYCLOPEDIA, LINGUISTICS (linguistics.oxfordre.com). (c) Oxford University Press USA, 2016. All Rights Reserved. Personal use only; commercial use is strictly prohibited. Please see applicable Privacy Policy and Legal Notice (for details see Privacy Policy).

date: 20 November 2017

Corpus Phonology

Summary and Keywords

Corpus Phonology is an approach to phonology that places corpora at the center of phonological research. Some practitioners of corpus phonology see corpora as the only object of investigation; others use corpora alongside other available techniques (for instance, intuitions, psycholinguistic and neurolinguistic experimentation, laboratory phonology, the study of the acquisition of phonology or of language pathology, etc.). Whatever version of corpus phonology one advocates, corpora have become part and parcel of the modern research environment, and their construction and exploitation has been modified by the multidisciplinary advances made within various fields. Indeed, for the study of spoken usage, the term ‘corpus’ should nowadays only be applied to bodies of data meeting certain technical requirements, even if corpora of spoken usage are by no means new and coincide with the birth of recording techniques. It is therefore essential to understand what criteria must be met by a modern corpus (quality of recordings, diversity of speech situations, ethical guidelines, time-alignment with transcriptions and annotations, etc.) and what tools are available to researchers. Once these requirements are met, the way is open to varying and possibly conflicting uses of spoken corpora by phonological practitioners. A traditional stance in theoretical phonology sees the data as a degenerate version of a more abstract underlying system, but more and more researchers within various frameworks (e.g., usage-based approaches, exemplar models, stochastic Optimality Theory, sociophonetics) are constructing models that tightly bind phonological competence to language use, rely heavily on quantitative information, and attempt to account for intra-speaker and inter-speaker variation. This renders corpora essential to phonological research and not a mere adjunct to the phonological description of the languages of the world.

Keywords: corpus phonology, annotation, discreteness, generative phonology, Optimality Theory, phonetics, recording, sociolinguistics, transcription, tier, urban dialectology, usage-based grammar, variation

1. The Scope of Corpus Phonology

Corpus Phonology, as indicated by the name, is an approach to phonology that places spoken corpora at the center of phonological research. As a sub-branch of corpus linguistics it comes in two forms: a strong version that states that the study of spoken corpora should be the aim of phonology; a weaker version that stresses that corpora should occupy pride of place within the set of techniques available (for example, intuitions, psycholinguistic and neurolinguistic experimentation, laboratory phonology, the study of the acquisition of phonology or of language pathology, etc.). Whether one defends a strong or a weak version, corpora are part and parcel of the modern research environment, and their construction and exploitation has been modified by the multidisciplinary advances made within various fields.

It is first explained in section 2 what is meant by a ‘corpus’: although the term covers a variety of interrelated senses, it is now applied in linguistics mainly to bodies of data meeting certain technical requirements. It is then shown that corpora of the spoken language are by no means a new development (section 3). By examining, however briefly, the history of spoken language corpora in linguistic research, we will come across a number of debates (many of which are still alive today) concerning the nature of data relevant to phonological research. Thereafter, we examine in section 4 modern ways of building corpora, exploiting them, and maintaining them. We will rely heavily on the methods and standards defended by the contributors to the Oxford Handbook of Corpus Phonology (Durand, Gut, Kristoffersen, 2014). However, we will also mention other types of work that, while not done under the banner of Corpus Phonology, can be considered good examples of this kind of approach. In section 5, we proceed to a conclusion and focus on the following epistemological issue: are corpora central to phonology or just an element within a large palette of techniques for exploring the phonological structures of given languages? Whether one adopts a strong or a weak version of corpus phonology, it is argued that off-the-cuff observations and intuitions should play a minimal role in phonology and that systematic data collection and the back and forth cycles between theorization, observations, and experimentation necessarily require the use of corpora.

As underlined by de Lacy [Theoretical Phonology], the term ‘phonology’ has two main meanings: a descriptive (or taxonomic) sense and a cognitive one. Descriptive phonology is concerned with the development of frameworks for analyzing the sound systems of individual languages and for establishing techniques and methods for an adequate description of spoken data, its storage, and retrieval. Theoretical phonology (at least as practiced within the generative tradition) is cognitive: it is concerned with the mental process of abstraction from physical data: how mental representations are computed from acoustic data and how these representations are linked to articulatory movements that produce speech sounds. But, in reality, these two senses of phonology are not sharply separated, and within the generative tradition, there are many differences of opinion as to the scope of phonology. For instance, some researchers posit that phonology feeds a separate phonetic module, whereas others assume that phonology is the only module responsible for the storage, production, and perception of speech sounds. For our purposes, many of the debates surrounding the nature of phonology can be ignored, but we do return to some of these issues at various points below. It should also be noted that much traditional phonology focuses on lexical generalizations, but it is arguably in the area of utterance (or post-lexical) phonology that the contribution of corpora has been most noticeable.

2. What is Meant by a ‘Corpus’?

The word ‘corpus’ from the Latin for ‘body’ appeared in the 18th century to designate a complete collection of writings. Expressions such as ‘the Shakespearian corpus’ came to be used within disciplines such as literary criticism so that, when linguists started thinking seriously about justifying their observations, the meaning of ‘corpus’ was extended to the collection of a large number of books, articles, magazines, and other relevant written documents that had been deliberately gathered. With the advent of computers, the term ‘corpus’ became restricted to collections of texts in machine-readable forms and a new field of linguistics emerged (corpus linguistics), which considers that reliable language analysis must be based on large samples of language-use in natural contexts and with minimal experimental interference. Within corpus linguistics, there are divergent views regarding corpus annotation. Some linguists such as Sinclair (1992) advocate minimal annotation and defend the idea that texts should speak for themselves; others advocate extensive annotations as the only way to achieve insights into linguistic structure. These divergent views are no longer incompatible and, as shown in section 4, many modern spoken language corpora are, in fact, complex databases that allow a body of data to be available in various formats (from sound files to aligned annotations on different representational levels or tiers).

Spoken language corpora could not be envisaged until sound recording and reproduction became technically possible. The classical sound recording technology is analog. Nowadays, digital recordings are the norm: the audio signal is stored as a series of binary digits representing samples of the amplitude of the source at equal time intervals, at a sample rate high enough to convey all frequencies capable of being heard. Modern spoken corpora are based on digital sound files in various formats: (i) uncompressed audio formats, such as WAV, AIFF or AU; (ii) formats said to be with lossless compression such as FLAC; (iii) formats said to be with lossy compression, such as MP3. This is a fast-moving field but one thing is certain: the conversion of older recordings to a digital format is a necessary step before a corpus can be studied by modern techniques.

Speech recordings are based on speech events: each speech event (let us take as an example a one-to-one conversation) is spatio-temporally unique. It involves participants who are defined in terms of age, gender, social class, educational level, etc., and who are related in certain ways (say a boss and an employee). There is a setting (for example, the conversation is in the boss’s office), a purpose, and a topic. The recording preserves only one portion of the interaction (the sound signal) and can be treated as a replica of the speech event in that respect (leaving aside the possible loss of information at certain frequencies). A number of researchers stress that the audio signal is not sufficient and that language-use is multimodal. For such researchers, linguistics needs to consider a wider range of resources than just spoken language and must integrate visible behaviour (gesture, gaze, and the entire body) in the study of how interactions work. Some modern tools, as we shall see in section 4, allow for the treatment of visual information and catch part of the context of interaction in addition to the participants’ actions. Advocates of strongly ecological corpora (that is, corpora based on authentic interactions and not artificial situations set up by a linguist) treat this dimension as essential. We will briefly examine the relevance of these claims for phonology in section 4.

Limiting ourselves to sound recordings for the moment, which we saw are partial replicas of speech events, we will treat these as the primary data of a spoken corpus, since their ultimate source is no longer accessible. Now all modern spoken language corpora will include the primary data and additional information. Thus, Gut and Voorman (2014, p. 16) define a phonological corpus as a representative sample of language that contains:

  1. 1. Primary data in the form of audio or video data.

  2. 2. Secondary data in the form of phonological annotations that refer to the raw data by time information (time-alignment).

  3. 3. Metadata about the recordings and corpus as a whole (for example, the date of the recording, the name of the participants, etc.).

In what follows, we will adopt this terminology and separate corpus data into the three types summarized above. It should be noted that our terminological decision to require accessible audio recordings excludes from modern corpus phonology bodies of transcription constituted without a sound source that can be reexamined at will. This means that all the work done by descriptive linguists on all continents to transcribe unwritten (and often endangered) languages without the use of recorders will not be discussed here. This is not meant to disparage such work, which has been essential to the development of modern linguistics, and which paved the way for modern corpus work. Indeed, in section 3 below, we underline our debt to the past, and our brief survey of the history of spoken corpora will show how the constitution and exploitation of spoken language corpora has evolved over time and is leading to increasingly better descriptions of the phonology and phonetics of an ever expanding range of languages. We will take advantage of this section to point out the evolution of practices. For reasons of space, English will receive a great deal of attention here, but the reader is referred to Durand, Gut, and Kristoffersen (2014) for a much wider coverage and further remarks on a range of other languages.

3. Spoken Corpora: A Brief Historical Survey

While the study of written documents is an invaluable source of information in many areas of linguistics, the possibility of recording speech opened new avenues for researchers keen to understand the structure of the spoken language. Much of the early interest for the recording of the spoken language came from folklorists and dialectologists wishing to observe and preserve the speech of older speakers who were assumed to represent purer varieties of disappearing dialects. For instance, in France, in 1911 (ten years before the first radio programs) the grammarian and historical linguist Ferdinand Brunot, with the help of the engineer Emile Pathé, created the Archives de la parole (Speech archives) at the Sorbonne and, with the help of Charles Bruneau, planned a sound atlas of France. This project led to an early form of a spoken usage encyclopedia, with a strong folkloristic focus, which has been available since 1977 at the Bibliothèque Nationale de France (Cordereix, 2005; Durand, Laks, & Lyche, 2016).

In the English-speaking world, the New Zealand Broadcasting Service is well-known for having established a Mobile Disc Recording Unit, which travelled round various parts of provincial New Zealand during the period 1946–1948 (Gordon, Maclagan, & Hay, 2004; Hay, Mclagan, & Gordon, 2008). The intention was to set up a historical archive of pioneer reminiscences and stories for posterity. The speakers who were recorded were mainly elderly, although a few middle-aged informants were also included. These recordings are one of the three spoken corpora that constitute the Origins of New Zealand English corpus (ONZE). Although not all parts of rural New Zealand were covered, about 300 speakers were recorded by the Mobile Unit, singly or in groups. These recordings will serve to illustrate some of the issues that arise in constructing spoken corpora.

First, there is the issue of quality. According to Gordon, Maclagan, andHay (2004), the quality of the mobile unit interviews (our primary data here) is generally good, although they note extraneous noises such as ticking clocks, meowing cats, or rattling tea cups. Second, the number of speakers involved in some interviews poses a problem. While many conversations are one-to-one, some discussions involve up to eight speakers, which means that it can be difficult to establish the identity of speakers and to disentangle what is being said. Third, the profile of speakers raises one of the most difficult problems in corpus construction: how can we ensure that we have a balanced sample in terms of dimensions such as gender, age, profession, or education? Even nowadays, corpora of recorded material from the radio or television present us with speakers about whom we generally have little or no information. The involvement of the speakers in these recordings is also an issue: Gordon, Maclagan, and Hay (2004) tell us that some speakers hold the floor for extended periods of time while others speak only briefly. Moreover, the interviews vary in their degree of formality as some speakers read from notes, others tell popular stories, and still others are engaged in casual conversations. The degree of relaxation also varies widely: some informants sound quite nervous at first but then relax, others do not relax at all, while some seem completely oblivious of the fact that they are being recorded. All these issues are considered crucial in the elaboration of modern corpora. The weaknesses mentioned above are seen by some as fatal to corpus studies, but trying to reconstruct the evolution of New Zealand phonology outside recordings such as these is no easy matter. Many conjectures can certainly be entertained, but the close auditory and acoustic study of a number of people from different generations, as provided by the ONZE corpus as a whole, is invaluable in historical phonology (see Trudgill, 2004 for a demonstration of this claim for New Zealand English).

For the reasons just given, many linguists came to realize that more precise ways of gathering data were required. We will examine two avenues that mutually reinforced each other: on the one hand, the constitution of large-scale general corpora; on the other, the development of sociolinguistic surveys, in particular the strand of urban dialectology (linked to the name of William Labov). Before doing so, we will make a brief mention of the work of Charles Carpenter Fries, who was one of the first linguists to use recorded data as a major source of information for systematic linguistic descriptions.

Although many linguists belonging to the so-called structuralist or descriptive era insisted on the idea that the study of a language should follow a bottom-up, inductive route, very few linguists actually carried out this program. Fries’ work with spoken corpora was novel in this respect. Thus his famous book The Structure of English (Fries, 1952) is based on approximately 50 hours of telephone conversations involving approximately 300 speakers. Once transcribed, this corpus yielded a database of 250,000 running words. The recordings were made surreptitiously, which is no longer an accepted procedure (see section 4). But the laudable intention was a desire to come as close as possible to spontaneous, unstudied linguistic interactions and not limit oneself to laboratory speech.

Fries was committed to the idea that we should work from representative samples of the language, that the studies we make should be replicable, and that the relative frequency of various features or constructions was an essential element of any linguistic analysis. Although Fries focused mainly on syntax and did not study segmental phonology, he attempted to deal with intonation and made the interesting observation that yes-no questions were more often realized with a falling intonation than a rising one. Some aspects of his methodology can be criticized (see Fries, 2010), but his work constituted a first benchmark for empirically based studies of language behavior. Unfortunately, his corpora were never made available to the community at large, which means that the verification and replication of observation cannot be carried out. This is an important issue, and many modern researchers consider that the public accessibility of corpora is central to the assessment of claims made on the basis of the data they contain.

The next most important corpus for the study of the spoken language was the Survey of English Usage (hereafter, SEU), which was launched in 1959 under the directorship of Randolph Quirk at University College London. This survey was intended to cover both written and spoken British English and provide examples of language use in a range of situations (see Figure 1). Its orthographic transcription consists of a million words, and it is made up of 200 texts of spoken and written material of 5,000 words each. A first important issue was the desire to provide a representative range of language-use. Thus the spoken samples are structured as follows:

Corpus PhonologyClick to view larger

Figure 1. Range of speech styles in the spoken half of the Survey of English Usage (SEU).

All general surveys have grappled with the issue of register distribution, and finding enough informants in comparable situations is still a major challenge for the study of natural interaction.

The next issue is the transcription of the data. The initial SEU transcriptions were made by hand on index cards, but the spoken component (nearly half a million words, with additional material from other corpora) was made electronically available by Jan Svartvik of Lund University in the 1970s, under the name London-Lund Corpus of Spoken English (see Svartvik & Quirk, 1980). The study of the spoken language has often been neglected for obvious reasons even in the electronic age. Whereas techniques for scanning written material were available from the early days of the computer revolution, automatic techniques for scanning speech and producing high quality written transcriptions have not been generally available until recently. Progress in speech technology is fast changing the picture but, at the time of writing, a high-quality automatic transcription of spontaneous speech remains a challenge (see Strik & Cucchiarini, 2014 for further discussion). This is the reason why a very large number of projects on spoken language set up in the late 20th century or the early 21st century have relied on transcriptions made by humans. Large projects now tend to opt for standard orthographic transcriptions of recordings (but often without all the punctuation symbols).

For the SEU, two types of transcriptions were devised: a full prosodic and paralinguistic transcription and a reduced one. The full transcription was the result of a fruitful collaboration between speech specialists covering all areas of linguistic structure. It provided conventions for the annotation of speech turns and overlaps between speakers, hesitations, pauses, incomplete words, etc. It also allowed for the coding of prosody, which is examined later in the article. It indicated the gulf that exists between authentic spoken language and the standard written language. One might be tempted to dismiss this type of data as reflecting performance errors that in no way impact competence, in the terminology of Chomsky (1965). But, if we study pauses, which are often viewed as a performance phenomenon, we discover that they are far from random. As pointed out by Delais and Yoo (2014, pp. 211–213), pauses are not only the reflection of breathing, but they can be related to prosodic phrasing and thus reflect syntactic and discourse features. In an early study on the topic, Goldman-Eisler (1968) had already argued that pauses reflected the planning of speech and, if we take a cognitive view of phonology, learning how phonological structure is integrated to syntactic phrasing is an important area of research (Grosjean & Deschamps, 1972, 1973, 1975).

The SEU developed a complete notation for prosody (stress, rhythm, intonation), which was integrated to the base level transcription system. When we examine the full and the reduced transcription systems adopted within the SEU, we are clearly faced with quite complex conventions. These conventions reflect in part the assumptions adopted within what we can call the British tradition of prosody description (Crystal & Quirk, 1964; Crystal, 1969). The full and reduced transcriptions led quite naturally to simplified transcriptions that have found their way into encyclopedic grammars of English such as Quirk, Greenbaum, Sidney, Leech, and Svartvik (1985, pp. 1587–1608). But even the reduced transcriptions are extremely complex. Unfortunately, the more complex an annotation system is, the more likely it is to lead to errors and inconsistencies, and the more difficult it is to explore in a mechanical way. This is why, as indicated above, many modern large scale projects have opted for a base level that is orthographic (with no or few signs of punctuation or group division), leaving annotations such as the above to other levels or tiers of description within tools allowing for the multi-layering of annotation (see section 4).

The SEU showed the way for the constitution of very large corpora, but unfortunately, many of them, such as the Brown corpus (Francis & Kučera,1979) have focused on the written language. One of the most famous modern corpora is the British National Corpus (BNC; Burnard & Aston, 1998). It provides a hundred-million word collection of samples of written and spoken language from a wide range of sources, designed to offer a representative cross-section of British English from the later part of the 20th century, both spoken and written. But the spoken part amounts to only 10% of the whole corpus. It consists of orthographic transcriptions of unscripted informal conversations (recorded by volunteers selected from different age, region, and social classes in a demographically balanced way) and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins. A full phonetic or phonological exploitation is still awaited.

The second strand, which has contributed and still contributes to the development of corpus phonology, is the emergence of modern sociolinguistics in the wake of William Labov’s contribution to this field (inter alia Labov, 1966, 1972, 1994, 2001; Labov, Ash, & Boberg, 2006). The standard Chomskyan approach uses informal observations and intuitions as a primary method of investigation and idealizes the community through the postulation of ideal speaker-hearers. But, if we do not accept these premises, we must enter speech communities to study them. When we do so, Labov argues, we discover that variation is rife, and we observe a profound heterogeneity within usage. Formal analyses used to exclude variation by resorting to one of two strategies: (a) seeing the variants as belonging to different systems, for example, as instances of ‘dialect mixture’ or ‘code switching’; or (b) considering the variants as examples of ‘free variation.’ These claims were not usually established empirically, and classical structural or generative theories offer no way to separate cases that differ quantitatively. For example, in AAVE (Afro-American Vernarcular English) word-final /ld/ clusters are often simplified by deletion of the /d/, but careful investigation shows that the final /d/ is never totally lost and deletes more often in monomorphemes such as bold than in inflected forms such as rolled (Labov, 1972, pp. 216–226). Many traditional generative accounts can only point to the fact that the /d/ is optional and have no way of indicating that the two realizations are not on an equal footing (but see section 4 on some versions of Optimality Theory [OT]).

One of the major differences between the Labovian approach and classical generative views is that the analysis of the data is argued to require a quantitative dimension. This assumption stands in sharp contrast with the Chomskyan view that linguistic systems are based on a combinatorial device (called ‘merge’ in the latest instantiations of the Chomskyan paradigm) that is free of quantitative information. For Labov, the hypothetical-deductive model, which is generally presented as the model of scientific construction, does not rest solely on making simple and elegant hypotheses. As he puts it, “Watson’s discovery of the structure of DNA is one of the most striking cases of the role of simplicity in scientific research. Watson was convinced that the solution must be a simple one, and this conviction motivated his persistent attempt at model building (1969). But simplicity merely suggested the best approach: the validity of his model was established by the convergence of many quantitative measurements. Hafner and Presswood (1969) cite another case in the theory of weak interactions where considerations of simplicity led to a new theoretical attack; but again, as in all other cases I know, the acceptance of the theory as correct depended upon new quantitative data” (Labov, 1972, p. 202, footnote 13).

Making reliable observations while taking social parameters into account is no easy matter. Labov’s work was original and fundamental in many respects. First, it was argued that the best recording and storage techniques available must be used by linguists to obtain data of good quality (the primary data in our terminology). Second, rigorous attempts were made to calibrate the speakers in terms of social parameters such as age, gender, occupation, education, or ethnicity. Third, register was taken into account via techniques that tried to reach the vernacular defined as “the style in which the minimum attention is given to the monitoring of speech” (Labov, 1972, p. 208) and by comparing this vernacular to other styles in which some degree of self-monitoring may be involved. The most famous application of some of these principles was Labov’s study of New York City English, where he applied a series of techniques to study a wide spectrum of speakers in a range of contexts (Labov, 1966). All the techniques that were used represented an attempt to circumvent what Labov calls ‘the Observer’s paradox’: our goal is to observe the way people use language when they are not being observed, and yet we can only obtain these data by systematic observation. This entails a tension between various requirements (for example, the quality of recording imposes severe constraints on how we conduct our investigations), but the claim is that this is not impossible. Indeed the vast range of sociolinguistic corpus work done within the last 50 years is a testimony to the feasibility of the enterprise (see Labov, 1966, 1972, 1994, 2001; Sankoff, 1989, the essays in Chambers, J. F. K., Trudgill, P., & Schilling-Estes, 2002, and in Mesthrie, 2011, as well as Mallinson [Sociolinguistics).

From the point of view of phonology, which is our main concern here, it can be objected that the focus on variables within sociolinguistics (e.g., the presence or absence of a coda [r] in New York) entails a lack of attention to phonological systems as a whole. As a result, we can only obtain an atomistic vision of phonological structure. But this criticism is somewhat unfair. For a start, the research undertaken by Labov himself has always paid attention to the system in its entirety, and this has also been true of other work that has provided the foundation of modern sociolinguistics (e.g., Trudgill, 1974). Limiting ourselves to English, a lot of descriptive work has been achieved on varieties that had been neglected or poorly described (Burridge & Kortmann, 2008; Kortmann & Upton, 2008; Mesthrie, 2008; Schneider, 2008). Moreover, because of the attention to change in progress, crucial areas such as chain shifts and mergers have been explored. One famous example is the Northern Cities Shift (NCS), which was first described by Labov, Yaeger, and Steiner (1972) and explored again in Labov (1994), Gordon (2001, 2002), Labov et al. (2006). The study of such shifts raises profound questions for phonological theory concerning distinctiveness, mergers, near-mergers, and the inter-relationship of subchanges. It cannot be achieved without considering systems as a whole. This type of research has also raised important questions on the psychological status of phonemic contrasts. The normal expectation is that perception and production work in unison. However, Labov, Yaeger, and Steiner (1972) opened up a new avenue in discussing speakers who identified two words as the same despite making a contrast between them in their production. The proposal has met with resistance in some quarters as emphasized by Labov (1994, pp. 349–370) and yet there has been mounting evidence in favor of these cases of “near” or “apparent” mergers as they are called (Gordon, 2002, pp. 248–253). This is important in helping us understand changes. When a new norm merging two sounds appears, speakers might first (subconsciously) accept the idea that the two sounds should be merged, but their production may lag behind and catch up with the merger later. We might also have a situation where two systems co-exist, and the system with the opposition might make a comeback giving the impression of undoing a historical change. As Gordon puts it: “Speakers appear to be able somehow to hear subtle phonetic differences well enough to reproduce them but without enough conscious attention to know that they are actually hearing them.” (2002, p. 251).

These observations and interpretations would not have been possible without the commitment of Labov and many sociolinguists to the need for corpora and for acoustic analyses of the data they were studying. In the recent literature, descriptions have been based on automatic or semi-automatic spectrographic analyses (e.g., formant analyses for vowels). There has been a convergence, therefore, between sociolinguists and phoneticians, leading to a better understanding of phonological systems. There has also been a convergence between sociolinguistics and dialectology, which we return to in section 4.

4. Current Advances in Corpus Phonology

It should be clear that no single corpus (in the sense of collection of recordings with one method) is sufficient to provide data rich enough to allow us to explore the systematic phonological and phonetic features of a given variety. But if we consider that an ideal modern corpus is in fact a large database containing different types of subcorpora, we can establish a number of requirements that should ideally be met (see Gut & Voorman, 2014). Most of these requirements have been alluded to before but they will be expanded here.

The primary data should be of good quality, which means good enough to allow for acoustic analysis (see section 2 on audio formats). Ideally a subset should be in video form to allow for a study of paralinguistic signals such as gestures and postures (Birch, 2014). Multimodality is an important dimension of modern research. Modern tools such as ANVIL (Kipp, 2014), ELAN (Sloetjes, 2014) or EXMARaLDA (Schmidt & Wörner, 2014) are multimedia annotation programs that allow for the integration of speech and other signals. Note in this connection that sign languages, which are now widely agreed to possess a phonology (Brentari, 2012), can be treated via such systems [Sign Language Phonology].

  1. 1. Recordings from radio, television, films, plays or the internet can yield vast amounts of spoken data useful for qualitative and quantitative analysis and are therefore highly relevant for corpus phonology. However, the lack of control over the social identity of speakers requires that other methods also be used. An ideal modern database will include recordings of speakers who are socially defined (age, gender, educational qualifications, socioeconomic status, etc.).

  2. 2. Recordings must follow ethical guidelines. There should be no underhand recording. As pointed out by Birch (2014, p. 41), if contentious material has been recorded, a good solution is to replay the recording to the speaker(s) involved and seek approval to use the data.

  3. 3. Recordings must be as natural as possible and varied from the point of view of register. One popular method is to follow the classical Labovian method of data collection, which involves reading tasks besides conversations ranging from semi-directed to free interactions. Another method is to follow the route opened by the Survey of English usage (see section 3) and aim at a wide range of different situations. Recent work in sociolinguistics advocates the need to construct ‘ecological’ corpora of naturally occurring speech, but obtaining truly ecological data is extremely difficult: the quality of the recordings in certain settings can render phonetic treatment impossible (for example, background noise in a shop, a restaurant, or a factory); transcribing speech accurately is very difficult when more than two or three people are involved; and finally the data comprising ecological surveys often lack true comparability (but see Groupe ICOR, 2013; and Mondada & Traverso, 2016 for a defence of this approach). In the description of a language variety, there is ultimately a trade-off between spontaneously generated speech (taken to be the gold standard in this article) and elicited samples. Given that no corpus, however large, contains all examples that may be crucial for phonological theory, many researchers are convinced that some elicitations techniques (such as the reading of sentences) is required in addition to recordings of spontaneous speech.

  4. 4. The sound files must be accompanied by secondary data in the form of phonological annotations that refer to the raw data by time information (time-alignment). A number of tools, systems, or methods that are described in Durand, Gut, and Kristoffersen (2014) have become standard in the field and are used in a large number of research projects. A well-known example is that of Praat, devised by Paul Boersma and David Weenink at the University of Amsterdam. Praat is a computer program for analyzing, synthesizing, and manipulating speech and other sounds, and for creating publication-quality graphics (see Boersma, 2014; Brinckmann, 2014). A speech corpus typically consists of a set of sound files, each of which is paired with an annotation file (Roussarie & Post, 2014). Modern annotation files can be multilayered (for example, the tiers in Praat), which allows for simultaneous annotations of the same speech portion (for example, a segmental tier, a syllable tier, and an intonation tier). Many large-scale projects aim at having both an orthographic transcription of the recordings and a broad phonemic transcription. A great deal of current research is going into automatic recognition tools that yield data that can form the basis of further analyses (Strik & Cucchiarini, 2014).

  5. 5. The sound files and secondary data must also be accompanied by metadata giving information on parameters such as time of recording and identity of the participants. One area that has allowed progress in the construction of corpora has been the convergence on formats for the representation of both the secondary data and the metadata (Romary & Witt, 2014; Broeder & Van Uytvanck, 2014). One well-known example is the TEI (Text Encoding Initiative), which provides guidelines for the transcription of spoken language (Broeder & van Uytvanck, 2014; Romary & Witt, 2014; Sperberg-McQueen & Burnard, 1994). Common formats are necessary for archiving and disseminating corpora (Tchobanov, 2014; Wittenburg,Trilsbeek, & Wittenburg, 2014).

If the above steps are scrupulously followed, the corpus will be designed in such a way that a variety of research tools can be applied to the primary data, the secondary data, and the metadata. For instance, a good spoken corpus should allow for the integration of a morphosyntactic parser. This is necessary if we do not wish to limit phonology to the treatment of the citation forms of individual words.

A number of corpora have been built that attempt to take into account the requirements outlined above. One of the most successful applications of a corpus approach has been in the areas of language acquisition and language impairment. In the area of language acquisition, for instance, the PhonBank corpus provides a unique resource that is leading to a systematic analysis of the patterns present in first language acquisition (e.g., Rose, 2014; Rose & Inkelas, 2011; Tsay, 2014) [Child Phonology]. For some time now, the use of corpora has been extended to the phonological acquisition of a second language (e.g., Detey, Kondo, Racine, && Kawaguchi, 2014; Gut, 2009, 2014a,b). We have even reached a point where new initiatives allow for systematic data sharing between existing corpora. One example is TalkBank, which is a large database including ten subcomponents: AphasiaBank, BilingBank, CABank, CHILDES, ClassBank, DementiaBank, GestureBank, PhonBank, Tutoring, and TBIBank (MacWhinney, Bird, Cieri, & Martell, 2004; Rose & McWhinney, 2014). All of the TalkBank corpora have in common the fact that they use the CHAT data transcription format, which enables a thorough analysis with the CLAN programs (Computerized Language ANalysis).

But the use of corpora in phonology has also been extended to the more open data of ‘normal’ speech. In the United States, a number of high quality speech corpora meeting the desiderata set up in this paper have been and are being developed: see for example the Buckeye corpus (Pitt et al., 2007) and the Audio-Aligned and Parsed Corpus of Appalachian English at (Tortora, Santorini, Blanchette, & Diertani, 2016). The sociolinguistic work reported in section 4 has led to the construction of further corpora, which have been important for both synchronic and diachronic analysis of a number of languages. One good example for British English is the Diachronic Electronic Corpus of Tyneside English (DECTE; Beal, Corrigan, Mearns, & Moisl, 2014), which was built between 2000 and 2005 and is an extension of the Newcastle Electronic Corpus of Tyneside English (NECTE). NECTE is what is called a legacy corpus based on data collected for two sociolinguistic surveys conducted on Tyneside, northeast England, in 1969–1971 and in 1994. The authors have been pioneers in the construction of a unique electronic corpus of vernacular English, which is aligned, tagged for parts of speech, and fully compliant with international standards for encoding text, and the continuing work on the subcorpora now included within DECTE is of interest to all projects having to deal with recordings and metadata stretching back in time. A similar example is the corpus work done at the LANCHART Centre of the Danish National Research Foundation (Gregersen, Maegaard, & Pharao, 2014). The Centre has performed re-recordings of a number of informants from earlier studies of Danish speech, thus making it possible to study variation and change in real time. In French, the PFC program (Phonologie du français contemporain: usages, variation, et structure) has led to the recording and analysis of the speech of 450 informants using a classical Labovian method (reading aloud of a word-list and a passage, guided conversation, and free conversation). Classical areas of French phonology (segmental phonology, schwa, liaison) have been extensively studied, and analyses have been provided within frameworks ranging from Optimality Theory to usage-based accounts (Durand, 2014; Durand, Laks, Calderone, & Tchobanov,2011; Durand, Laks, & Lyche, 2014; Durand, Laks, & Lyche, 2016; Durand & Lyche, 2008; Gess, Lyche, & Meisenburg, 2012; Laks, Calderone, & &Celata, 2014). In Belgium, French has been thoroughly explored at all levels thanks to the construction of a large complex speech database at the VALIBEL centre in the Université Catholique de Louvain (Simon, Francard, & Hambye, 2014).

Much of the research going on in large cities under the banner of urban dialectology (Labov, 1966) initiated a convergence between phonology, sociolinguistics, and dialectology. This has led to new developments in the qualitative and quantitative exploration of dialectological data (Hagen & Simonsen, 2014; Kristoffersen & Simonsen, 2014; Labov et al., 2006, van Oostendorp, 2014; Wieling, 2012).

For the sake of completeness, we should also bear in mind various efforts surrounding endangered language documentation (e.g., DiCanio, Nam, Whalen, Bunnell, Amith, & Garcia, 2013) and note that national research bodies in many countries are now supporting the creation of large corpora including the spoken dimension: for one example in Asia, the reader is referred to the work commissioned by the National Institute of Japanese Language and Linguistics.

One area of phonology that has benefited enormously from the use of corpora has been the study of prosody (intonation, stress, rhythm, tempo), which we saw in section 3 was a primary object of investigation within corpora like the Survey of English Usage. Two notable examples in English have been the IViE corpus (Nolan & Post, 2014) and the Australian Map Task Corpus (Fletcher & Stirling, 2014). But other languages are receiving a comparable amount of attention (for example, in French see Lacheret et al., 2014; Lacheret & Simon, 2014; Lacheret, Simon, Goldman, & Avanzi, 2013).

There is no single theoretical approach to the analysis of the corpus data gathered within various research projects, but all stress that the quantitative dimension is paramount. If we wish to take variation into account (as underlined by sociolinguists and dialectologists, Laks, 2013), and if we also take on board the fact that many phenomena are non-discrete (as stressed by recent work in the tradition of Laboratory Phonology and by Cohn, 2010; Cohn, Fougeron, & Huffman, 2012; Pierrehumbert, Beckman, & Ladd,2000), we need at the very least to make full use of statistical methods of exploitations of large data-sets (see Moisl, 2014, for extensive references, as well as John & Bombien, 2014, for a powerful toolbox). Many researchers advocating the systematic use of corpora insist that phonological competence is much closer to surface forms than often assumed in the generative tradition initiated by Chomsky and Halle (1968). In the latter, the raw data is often treated as degenerate, whereas researchers working, for instance, within usage-based frameworks (Bybee, 2001, 2007, 2010; Laks et al., 2014) emphasize the richness of data in real interactions between humans. One claim often made is that usage-based grammar coupled with exemplar models (Goldinger, 1996; Pierrehumbert, 2001) allow for a more direct representation of variation and change in cognitive systems. Exemplars, it is argued, help us track the lexical diffusion of a sound change and, in particular, provide a direct representation of the effect that frequency has on reductive sound change (Bybee & Cacoullos, 2008).

In a similar but more formal vein, there are linguists who are exploring the possibility of using corpora to mechanically extract phonological generalizations. For instance, Goldsmith and Xanthos (2009) present algorithmic methods that allow them to examine the following questions: (a) Given a sample of data (transcribed symbolically) from a language, is it possible to infer which segments are vowels and which are consonants? (b) Is it possible to infer whether the language in question is characterized by processes of vowel harmony, and, if so, what patterns can we observe? (c) Can we draw inferences about the integration of segments into syllabic constituents? If we apply the suggested methods, it is argued that we can reconcile two traditions of research: the examination of phonotactics much explored within so-called ‘structural linguistics’ and the examination of alternations that has been characteristic of generative phonology (see too Goldsmith & Riggle, 2012). This type of work has been pursued in various quarters, and much of the ongoing research is an attempt to show how generalizations from large data sets can emerge on the basis of a small number of a priori assumptions about phonological structure and mental representations.

Within Optimality Theory (McCarthy & Prince, 1993; Prince & Smolensky, 2004), a number of researchers have attempted to make the model responsive to variation and quantitative information (see Coetzee & Pater, 2011, for an overview). Particularly interesting are approaches that posit a single grammatical system for each speaker, but where one assumes either floating constraints or partially ordered constraint rankings where the order is set only for a specific derivation (e.g., Anttila, 1997, 2007, 2012). Similar ideas can be found within the trend of Stochastic Optimality Theory (Boersma, 1998; Boersma & Hayes, 2001). For instance, constraints can be assigned a numerical index on a scale, but they have a particular range of movement along the scale. Thus, for any individual derivation, the exact location of the constraint on the scale varies and, as a result, the ranking of pairs of constraints will not always be the same (see Boersma & Hayes, 2001; Pater, 2009 for an exploration of weighted constraints). Here too, researchers are attempting to model the integration of quantitative data from various types of corpora while minimizing assumptions about the types of mechanisms required (Albright & Hayes, 2011; Hayes, 1999; Hayes & White, 2013; Hayes, Wilson, & Shisko, 2012).

The dust has not settled yet on the research reported here. Some researchers are resolutely empiricist (for instance, Chater, Clark, Goldsmith, & Perfors,2015), others adhere to two assumptions classically made within generative grammar. First, the principles and parameters defining phonological systems are not mechanically extractable from data sets however large they are and however complex the techniques involved. Second, much observable variation arises from the interaction of phonology with other modules such as a phonetic component, a paralinguistic component, or a social component (see de Lacy, 2009). One difficulty is the fact that much of the ongoing work, alluded to above, is not based on primary data (recordings) but on secondary data (symbolic transcriptions) with often no possible return to the primary data. It therefore becomes impossible to implement the iterated cycles of observation, description, and theorization, which are the hallmark of a corpus-based approach as presented here. It is nevertheless probable that there will be a convergence in the years ahead between different styles of approach, and that many more theoretical assumptions will be put to the acid test of real spoken corpora.

5. Further Reflections

The use of corpora in phonological and phonetic analysis will continue to gain ground over the next decades as reflected in the large number of conferences and workshops organized around this theme. As more and more spoken corpora become available, and as tools for the automatic treatment of language become more and more efficient, recourse to corpora will become a natural way of proceeding (Nguyen & Adda-Decker, 2013). Some of the research projects referred to in sections 3 and 4, and particularly in sociophonology and sociophonetics, are based on the assumption that investigating a corpus is the optimal way of conducting phonological research (for a trenchant defense of this point of view, see Laks, 2008). On the other hand, a great deal of work within generative phonology assumes, as argued by Scheer (2013), that spoken corpora are merely one source of information among many others (psycholinguistic and neurolinguistic experimentation, laboratory phonetics, the study of language acquisition and language impairment, the use of pronouncing dictionaries, off-the-cuff observations, intuitions, etc.).

It is difficult to deny that phonology does benefit from a multifaceted approach. For a start, since spoken corpora only give access to the acoustic output of speech events, by definition they exclude the production process. The production process can be investigated at various levels (from neural activity to articulation) that can give crucial information about the construction of speech. Work within laboratory phonology suggests that some patterns (for example, the deletion of consonants in certain clusters) may be better explained in terms of articulatory maneuvers than in auditory terms. Thus, the apparent deletion of [t]/[d] in American English could be due to mistiming in articulatory execution (Browman & Goldstein, 1990). More generally, the exploration of important hypotheses concerning the functioning of speech production and perception would not have been tested if researchers had limited themselves to the study of corpora. One famous example is the McGurk effect, which demonstrates an interaction between hearing and vision in speech perception. The illusion occurs when the auditory component of one sound is paired with the visual component of another sound, which leads to the perception of a third sound. Since the discovery of this effect, reported in McGurk and MacDonald (1976), much experimental work has examined its effect in relation to brain damage, aphasia, dyslexia, specific language impairment, autism, or Alzheimer’s disease. Our understanding of the psychological and neurological foundation for speech depends on this kind of testing (Astésano & Jucla, 2015; Kawahara, 2011; Solé, Beddor, & Ohala, 2007).

Phonologists who advocate a corpus standpoint would not deny the complementarity of approaches. They would point out, however, that what is at stake in many issues dividing linguists is the reliability of the data. Indeed, it is argued by some that many debates in modern phonology have been based on data that are second-hand and sometimes inherited from the prescriptive tradition or even from spurious data (see e.g., Durand, 2009; Laks, 2008; Morin, 1987, on French). By firming up the empirical basis of phonology, by allowing interpretations that are not purely qualitative but also quantitative, by permitting the exploration of variation and of gradient phenomena, by allowing a link-up between speech-events and their social setting, corpus phonology entails a reorientation of phonology. Time will tell whether it represents a new paradigm, or whether it will just prove to be an additional instrument in the toolkit of the modern phonologist.

Further Reading

Delais-Roussarie, E., & Yoo, H. (2014). Corpus research in phonetics and phonology: Methodological considerations. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 193–213). Oxford: Oxford University Press.Find this resource:

Durand, J. (2009). On the scope of linguistics: Data, intuitions, corpora. In Y. Kawaguchi, M. Minegishi, & J. Durand (Eds.), Corpus and variation in linguistic description and language education (pp. 25–52). Amsterdam: John Benjamins.Find this resource:

Durand, J., Gut, U., & Kristoffersen, G. (Eds.). (2014). The Oxford handbook of corpus phonology. Oxford: Oxford University Press.Find this resource:

Gut, U. (2009). Non-native speech: A corpus-based analysis of phonological and phonetic properties of L2 English and German. Wien, Germany: Peter Lang.Find this resource:

Gut, U., & Voorman, H. (2014). Corpus design. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 13–26). Oxford: Oxford University Press.Find this resource:

Moisl, H. (2014). Statistical corpus exploitation. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 110–132). Oxford: Oxford University Press.Find this resource:

Rose, Y. (2014). Corpus-based investigations of child phonological developments: formal and practical considerations. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 265–285). Oxford: Oxford University Press.Find this resource:

References

Albright, A., & Hayes, B. (2011). Learning and learnability in phonology. In J. Goldsmith, J. Riggle, & A. Yu (Eds.), Handbook of phonological theory (pp. 661–690). Chichester, U.K.: Blackwell.Find this resource:

Anttila, A. (1997). Deriving variation from grammar: A study of Finnish genitives. In F. Hinkens, R. van Hout, & L. Wetzels (Eds.), Variation, change and phonological theory (pp. 35–68). New York: John Benjamins.Find this resource:

Anttila, A. (2007). Variation and optionality. In D. P. Lacy (Ed.), The Cambridge handbook of phonology (pp. 519–536). Cambridge, U.K.: Cambridge University Press.Find this resource:

Anttila, A. (2012). Modeling phonological variation. In A. C. Cohn, C. Fougeron, M. Huffman (Eds.), The Oxford handbook of laboratory phonology (pp. 76–91). Oxford: Oxford University Press.Find this resource:

Astésano, C., & Jucla, M. (Eds.). (2015). Neuropsycholinguistic perspectives on language cognition (pp. 15–30). London: Psychology Press.Find this resource:

Beal, J. C., Corrigan, K. P., Mearns A. J., & Moisl, H. (2014). The Diachronic Electronic Corpus of Tyneside English: Annotation practices and dissemination strategies. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 517–533). Oxford: Oxford University Press.Find this resource:

Birch, B. (2014). Data collection. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 27–45). Oxford: Oxford University Press.Find this resource:

Boersma, P. (1998). Functional phonology. (Doctoral diss.), University of Amsterdam.Find this resource:

Boersma, P. (2014). The use of Praat in corpus research. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 342–360). Oxford: Oxford University Press.Find this resource:

Boersma, P., & Hayes, B. (2001). Empirical tests of the gradual learning algorithm. Linguistic Inquiry, 32(1), 45–86.Find this resource:

Brentari, D. (2012). Phonology. In R. Pfau, M. Steinbach, & B. Woll (Eds.), Sign language: An international handbook (pp. 21–54). Berlin: Mouton De Gruyter.Find this resource:

Brinckmann, C. (2014). Praat scripting. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 361–379). Oxford: Oxford University Press.Find this resource:

Broeder, D., & van Uytvanck, D. (2014). Metadata formats. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 150–165). Oxford: Oxford University Press.Find this resource:

Browman, C. P., & Goldstein, L. (1990). Tiers in articulatory phonology, with some implications for casual speech. In J. Kingston& M. E. Beckman (Eds.), Papers in laboratory phonology I: Between the grammar and physics of speech (pp. 341–376). Cambridge, U.K.: Cambridge University Press.Find this resource:

Burnard, L., & Aston, G. (1998). The BNC handbook: Exploring the British National Corpus. Edinburgh: Edinburgh University Press.Find this resource:

Burridge, K., & Kortmann, B. (Eds.). (2008). Varieties of English 3. The Pacific and Australasia. Berlin/New York: Mouton de Gruyter.Find this resource:

Bybee, J. (2001). Phonology and language use. Cambridge, U.K.: Cambridge University Press.Find this resource:

Bybee, J. (2007). Frequency of use and the organization of language. Oxford: Oxford University Press.Find this resource:

Bybee, J. (2010). Language, usage, and cognition. Cambridge, U.K.: Cambridge University Press.Find this resource:

Bybee, J., & Torres Cacoullos, R. (2008). Phonological and grammatical variation in exemplar models. Studies in Hispanic and Lusophone linguistics, 1(2), 399–413.Find this resource:

Chambers, J. F. K., Trudgill, P., & Schilling-Estes, N. (Eds.). (2002). Language variation and change. Oxford: Blackwell.Find this resource:

Chater, N., Clark, A., Goldsmith, J., & Perfors A. (2015). Empiricist approaches to language learning. Oxford: Oxford University Press.Find this resource:

Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.Find this resource:

Chomsky, N., & Halle, M. (1968). The sound pattern of English. New York: Harper & Row.Find this resource:

Coetzee, A. W., & Pater, J. (2011). The place of variation in phonological theory. In J. Goldsmith, J. Riggle, & A. Yu (Eds.), Handbook of phonological theory (pp. 405–434). Chichester, U.K.: Blackwell/Wiley.Find this resource:

Cohn, A. C. (2010). Laboratory phonology: Past successes and current questions, challenges, and goals. In C. Fougeron, B. Kühnert, M. D’Imperio, & N. Vallée (Eds.), Laboratory phonology 10 (pp. 3–30). New York: Mouton de Gruyter.Find this resource:

Cohn, A. C., Fougeron, C., & Huffman, M. (Eds.). (2012). The Oxford handbook of laboratory phonology. Oxford: Oxford University Press.Find this resource:

Cordereix, P. (2005). Les fonds sonores du département de l’Audiovisuel de la Bibliothèque nationale de France. Le temps des médias, 5/2(5), 253–264.Find this resource:

Crystal, D. (1969). Prosodic systems and intonation in English. Cambridge, U.K.: Cambridge University Press.Find this resource:

Crystal, D., & Quirk, R. (1964). Systems of prosodic and paralinguistic features in English. The Hague: Mouton.Find this resource:

DiCanio, C., Nam, H., Whalen, D. H., Bunnell, H. T., Amith, J. D., & Garcia, R. C. (2013). Using automatic alignment to analyze endangered language data: Testing the viability of untrained alignment. Journal of the Acoustical Society of America, 134(3), 2235–2246.Find this resource:

de Lacy, P. (2009). Phonological evidence. In Steve Parker (Ed.), Phonological argumentation: Essays on evidence and motivation (pp. 43–78). London: Equinox.Find this resource:

Delais-Roussarie, E., & Post, B. (2014). Corpus annotation: Methodology and transcription systems. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 46–88). Oxford: Oxford University Press.Find this resource:

Delais-Roussarie, E., & Yoo, H. (2014). Corpus research in phonetics and phonology: Methodological considerations. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 193–213). Oxford: Oxford University Press.Find this resource:

Detey, S., Kondo, M., Racine, I., & Kawaguchi, Y. (2014). A preliminary investigation of /CC/ clusters acquisition by Japanese learners of French using oral corpora: Methodological insights. In S. Ishikawa (Ed.), Learner corpus studies in Asia and the world (Vol. 2; pp. 215–225). Kobe, Japan: Kobe University Press.Find this resource:

Durand, J. (2009). On the scope of linguistics: Data, intuitions, corpora. In Kawaguchi, Y., Minegishi, M., & J. Durand (Eds.), Corpus and variation in linguistic description and language education (pp. 25–52). Amsterdam: John Benjamins.Find this resource:

Durand, J. (2014). Corpora, variation and French liaison. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 240–264). Oxford: Oxford University Press.Find this resource:

Durand, J., Gut, U., & Kristoffersen, G. (Eds.). (2014). The Oxford handbook of corpus phonology. Oxford: Oxford University Press.Find this resource:

Durand, J., Laks, B., Calderone, B., & Tchobanov, A. (2011). Que savons-nous de la liaison aujourd’hui? Langue Française, 169, 103–126.Find this resource:

Durand, J., Laks, B., & Lyche, C. (2014). French phonology from a corpus perspective: The PFC programme. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 486–497). Oxford: Oxford University Press.Find this resource:

Durand, J., Laks, B., & Lyche, C. (2016). Variation and corpora: Concepts and methods. In S. Detey, J. Durand, B. Laks, & C. Lyche (Eds.), Varieties of French (pp. 29–37). Oxford: Oxford University Press.Find this resource:

Durand, J., & Lyche, C. (2008). French liaison in the light of corpus data. Journal of French Language Studies, 18(1), 33–66.Find this resource:

Fletcher, J., & Stirling, L. (2014). Prosody and discourse in the Australian Map Task. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 562–575). Oxford: Oxford University Press.Find this resource:

Francis, W. N., & Kučera, H. (1979). Manual of information to accompany a standard sample of present-day edited American English, for use with digital computers. Providence, RI: Department of Linguistics, Brown University. Originally published in 1964, revised in 1971, revised and augmented in 1979.Find this resource:

Fries, C. C. (1952). The structure of English. New York: Harcourt Brace.Find this resource:

Fries, P. H. (2010). Linguistics and corpus linguistics. ICAME Journal, 34, 89–119.Find this resource:

Gess, R., Lyche, C., & Meisenburg, T. (Eds.). (2012). Phonological variation in French: Illustrations from three continents. Amsterdam: John Benjamins.Find this resource:

Goldinger, S. D. (1996). Word and voices: Episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology, 22, 1166–1183.Find this resource:

Goldman-Eisler, F. (1968). Psycholinguistics: Experiments in spontaneous speech. London: Academic Press.Find this resource:

Goldsmith, J., & Xanthos, A. (2009). Learning phonological categories. Language, 85(1), 1–35.Find this resource:

Goldsmith, J., & Riggle, J. (2012). Information theoretic approaches to phonology: The case of Finnish vowel harmony. Natural Language and Linguistic Theory, 30(3), 859–896.Find this resource:

Gordon, E., Maclagan, M., & Hay, J. (2004). The ONZE corpus. In J. Beal, K. Corrigan, & H. Moisl (Eds.), Models and methods in the handling of unconventional digital corpora, Vol. 2, Diachronic Corpora (pp. 82–104). Houndmills, U.K.: Palgrave Macmillan.Find this resource:

Gordon, M. J. (2001). Small-town values, big-city vowels: A study of the northern cities shift. Publications of the American Dialect Society 84, Durham, NC: Duke University Press.Find this resource:

Gordon, M. J. (2002). Investigating chain shifts and mergers. In J. K. Chambers, P. Trudgill, & N. Schilling-Estes (Eds.), The handbook of language variation and change (pp. 244–266). Oxford: Blackwell Publishing.Find this resource:

Gregersen, F., Maegaard, M., & Pharao, N. (2014). The LANCHART corpus. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 534–545). Oxford: Oxford University Press.Find this resource:

Grosjean, F., & Deschamps, A. (1972). Analyse des variables temporelles du français spontané. Phonetica, 26(3), 129–157.Find this resource:

Grosjean, F., & Deschamps, A. (1973). Analyse des variables temporelles du français spontané II. Comparaison du français oral dans la description avec l’anglais (description) et avec le français (interview radiophonique). Phonetica, 28(3–4), 191–226.Find this resource:

Grosjean, F., & Deschamps, A. (1975). Analyse contrastive des variables temporelles de l’anglais et du français: Vitesse de parole et variables composantes, phénomènes d’hésitation. Phonetica, 31, 144–184.Find this resource:

Groupe ICOR (H. Baldauf-Quilliatre, S. Bruxelles, S. Diao-Klaeger, E. Jouin-Chardon, S. Teston-Bonnard, V. Traverso). (2013). Oh là là: the contribution of the multimodal database CLAPI to the analysis of spoken French. In H. Tyne, V. André, A. Boulton, C. Benzitoun, & Y. Greub (Eds.), Ecological and data-driven perspectives in French language studies. Newcastle upon Tyne, U.K.: Cambridge Scholars.Find this resource:

Gut, U. (2009). Non-native speech: A corpus-based analysis of phonological and phonetic properties of L2 English and German. Vienna: Peter Lang.Find this resource:

Gut, U. (2014a). Corpus phonology and second language acquisition. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 286–301). Oxford: Oxford University Press.Find this resource:

Gut, U. (2014b). The LeaP corpus. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 166–190). Oxford: Oxford University Press.Find this resource:

Gut, U., & Voorman, H. (2014). Corpus design. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 13–26). Oxford: Oxford University Press.Find this resource:

Hafner, E. M., & Presswood, S. (1969). Strong inference and weak interactions. Science, 149, 503–509.Find this resource:

Hagen, K., & Gram Simonsen, H. (2014). Two Norwegian speech corpora: NoTa-Oslo and Taus. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 498–508). Oxford: Oxford University Press.Find this resource:

Hay, J., Mclagan, M., & Gordon, E. (2008). New Zealand English. Edinburgh: Edinburgh University Press.Find this resource:

Hayes, B. (1999). Phonetically-driven phonology: The role of optimality theory and inductive grounding. In M. Darnell, E. A. Moravscik, M. Noonan, F. Newmeyer, & K. Wheatley (Eds.), Functionalism and formalism in linguistic (Vol. 1; pp. 243–285). Amsterdam: John Benjamins.Find this resource:

Hayes, B., & Albright, A. (2011). Learning and learnability in phonology. In J. Goldsmith, J. Riggle, & A. Yu (Eds.), Handbook of phonological theory (pp. 661–690). Chichester, U.K.: Blackwell/Wiley.Find this resource:

Hayes, B., & White, J. (2013). Phonological naturalness and phonotactic learning. Linguistic Inquiry, 44, 45–75.Find this resource:

Hayes, B., Wilson, C., & Shisko, A. (2012). Maxent grammars for the metrics of Shakespeare and Milton. Language, 88(4), 691–731.Find this resource:

John, T., & Bombien, L. (2014). EMU. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 321–341). Oxford: Oxford University Press.Find this resource:

Kawahara, S. (2011). Experimental approaches in theoretical phonology. In M. van Oostendorp, C. J. Ewen, E. Hume, & Rice, K. (Eds.), The Blackwell companion to phonology (pp. 2283–2303). Oxford: Blackwell-Wiley.Find this resource:

Kipp, M. (2014). ANVIL: The video annotation research tool. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 420–436). Oxford: Oxford University Press.Find this resource:

Kortmann, B., & Upton, C. (Eds.). (2008). Varieties of English 1: The British Isles. New York: Mouton de Gruyter.Find this resource:

Kristoffersen, G., & Simonsen, H. G. (2014). A corpus-based study of apicalization of /s/ before /l/ in Oslo Norwegian. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 214–239). Oxford: Oxford University Press.Find this resource:

Lacheret, A., Simon, A., Goldman, J.-P., & Avanzi, M. (2013). Prominence perception and accent perception in French from phonetic processing to grammatical advance. Language Sciences, 39, 95–106.Find this resource:

Lacheret, A., Kahane, S., Beliao, J., Dister, A., Gerdes, K., Goldman, J.-P., et al. (2014). Rhapsodie: un Treebank annoté pour l’étude de l’interface syntaxe-prosodie en français parlé. In SHS Web of Conferences (Vol. 8; pp. 2675–2689). Paris: EDP Sciences.Find this resource:

Lacheret, A., & Simon, A. C. (2014). Annotation prosodique et bases de données phonologiques: Approche basée sur l’usage. In J. Durand, B. Laks, & G. Kristoffersen (Eds.), La phonologie du français: normes, périphéries, modélisation (pp. 301–317). Paris: Presses Universitaires de Paris Ouest.Find this resource:

Labov, W. (1966). The social stratification of English in New York City. Washington, DC: Center for Applied Linguistics.Find this resource:

Labov, W. (1972). Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.Find this resource:

Labov, W. (1994). Principles of linguistic change: Internal factors. Oxford: Blackwell.Find this resource:

Labov, W. (2001). Principles of linguistic change: Social factors. Oxford: Blackwell.Find this resource:

Labov, W., Ash, S., & Boberg, C. (2006). Atlas of North American phonology and phonetics. Berlin: Mouton de Gruyter.Find this resource:

Labov, W., Yaeger, M., & Steiner, R. (1972). A quantitative study of sound change in progress. Philadelphia: US Regional Survey.Find this resource:

Laks, B. (2008). Pour une phonologie de corpus. Journal of French Language Studies, 18, 3–32.Find this resource:

Laks, B. (2013). Why is there variation instead of nothing? Language Sciences, 39, 31–53.Find this resource:

Laks, B., Calderone, B., & Celata, C. (2014). French liaison and the lexical repository. In C. Celata & S. Calamai (Eds.), Advances in sociophonetics (pp. 29–54). Amsterdam: Benjamins.Find this resource:

MacWhinney, B., Bird, S., Cieri, C., & Martell, C. (2004). TalkBank: Building an open unified multimodal database of communicative interaction. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (pp. 525–528). Paris: European Language Resources Association.Find this resource:

McCarthy, J. J., & Prince, A. (1993). Prosodic morphology I: Constraint interaction and satisfaction. Rutgers Technical Report TR-3. New Brunswick, NJ: Rutgers University Center for Cognitive Science.Find this resource:

McGurk H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746–748.Find this resource:

Mesthrie, R. (Ed.). (2008). Varieties of English 4. Africa, South and Southeast Asia. New York: Mouton de Gruyter.Find this resource:

Mesthrie, R. (Ed.). (2011). The Cambridge handbook of sociolinguistics. Cambridge, U.K.: Cambridge University Press.Find this resource:

Moisl, H. (2014). Statistical corpus exploitation. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 110–132). Oxford: Oxford University Press.Find this resource:

Mondada, L., & Traverso, V. (2016). Beyond orality. In S. Detey, J. Durand, B. Laks, & C. Lyche (Eds.), Varieties of French (pp. 108–119). Oxford: Oxford University Press.Find this resource:

Morin, Y.-C. (1987). French data and phonological theory. Linguistics, 25, 815–843.Find this resource:

Nguyen N., & Adda-Decker, M. (Eds.). (2013). Analyse phonétique des grands corpus oraux: méthodes et outils. Paris: Hermès-Lavoisier.Find this resource:

Nolan, F., & Post, B. (2014). The IVie corpus. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 475–485). Oxford: Oxford University Press.Find this resource:

van Oostendorp, M. (2014). Phonological and phonetic research at the Meertens Institute. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 546–551). Oxford: Oxford University Press.Find this resource:

Pater, J. (2009). Weighted constraints in generative linguistics. Cognitive Science, 33, 999–1035.Find this resource:

Pierrehumbert, J. (2001). Exemplar dynamics: Word frequency, lenition and contrast. In J. Bybee& P. Hopper (Eds.), Frequency and the emergence of linguistic structure (pp. 137–158). Amsterdam: John Benjamins.Find this resource:

Pierrehumbert, J., Beckman, M. E., & Ladd, D. R. (2000). Conceptual foundations of phonology as a laboratory science. In N. Burton-Roberts, P. Carr, & G. Docherty (Eds.), Phonological knowledge: Conceptual and empirical issues (pp. 273–303). Oxford: Oxford University Press.Find this resource:

Pitt, M. A., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E., et al. (2007). Buckeye corpus of conversational speech. Columbus: Department of Psychology, Ohio State University.Find this resource:

Poplack, S. (1989). The care and handling of a mega corpus. In R. Fasold & D. Schiffrin (Eds.), Language change and variation (pp. 411–451). Amsterdam: Benjamins.Find this resource:

Prince, A., & Smolensky, P. (2004). Optimality theory: Constraint interaction in generative grammar. Malden, MA: Blackwell. Available also at Rutgers Opimality Archive 537.Find this resource:

Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A comprehensive grammar of the English language. London: Longman.Find this resource:

Romary, L., & Witt, A. (2014). Data formats for phonological corpora. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 166–190). Oxford: Oxford University Press.Find this resource:

Rose, Y. (2014). Corpus-based investigations of child phonological developments: Formal and practical considerations. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 265–285). Oxford: Oxford University Press.Find this resource:

Rose, Y., & Inkelas, S. (2011). The interpretation of phonological patterns in first language acquisition. In C. Ewen, E. Hume, M. van Oostendorp, & K. Rice (Eds.), The Blackwell companion to phonology (pp. 2414–2438). Malden, MA: Wiley-Blackwell.Find this resource:

Rose, Y., & McWhinney, B. (2014). The PhonBank project: Data and software-assisted methods for the study of phonology and phonological development. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 380–401). Oxford: Oxford University Press.Find this resource:

Sankoff, G. (1989). A quantitative paradigm for the study of communicative competence. In R. Bauman& J. Sherzer (Eds.), Explorations in the ethnography of speaking (2d ed.; pp. 18–49). Cambridge, U.K.: Cambridge University Press.Find this resource:

Scheer, T. (2013). The corpus: A tool among others. CORELA (Cognition, représentation, langage).

Simon, A. C., Francard, M., & Hambye, P. (2014). The VALIBEL speech database. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 552–561). Oxford: Oxford University Press.Find this resource:

Sloetjes, H. (2014). ELAN: Multimodal annotation application. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 305–320). Oxford: Oxford University Press.Find this resource:

Schmidt, T., & Wörner, K. (2014). EXMARaLDA. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 402–419). Oxford: Oxford University Press.Find this resource:

Schneider, E. W. (Ed.). (2008). Varieties of English 2: The Americas and the Caribbean. Berlin: Mouton de Gruyter.Find this resource:

Sinclair, J. (1992). The automatic analysis of corpora. In J. Svartvik (Ed.) Directions in corpus linguistics (Proceedings of Nobel symposium 82). Berlin: Mouton de Gruyter.Find this resource:

Solé, M.-J., Beddor, P. S., & Ohala, M. (2007). Experimental approaches to phonology. Oxford: Oxford University Press.Find this resource:

Sperberg-McQueen, C. M., & Burnard, L. (Eds.). (1994). Guidelines for electronic text encoding and interchange. TEI P3 Text Encoding Initiative. Revised reprint: Oxford, May 1999.Find this resource:

Strik, H., & Cucchiarini, C. (2014). On automatic phonological transcription of speech corpora. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 89–109). Oxford: Oxford University Press.Find this resource:

Svartvik, J., & Quirk, R. (Eds.). (1980). A corpus of English conversation. Lund, Sweden: CWK Gleerup.Find this resource:

Tchobanov, A. (2014). Web-based archiving and sharing of phonological corpora. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 437–472). Oxford: Oxford University Press.Find this resource:

Tortora, C., Santorini, B., Blanchette, F., & Diertani, C. E. A. (2016–). The Audio-Aligned and Parsed Corpus of Appalachian English (AAPCAppE).

Trudgill, P. (1974). The social differentiation of English in Norwich. Cambridge, U.K.: Cambridge University Press.Find this resource:

Trudgill, P. (2004). New dialect formation: The inevitability of colonial Englishes. Edinburgh: Edinburgh University Press.Find this resource:

Tsay, J. (2014). A phonological corpus of L1 acquisition of Taiwan Southern Min. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 562–575). Oxford: Oxford University Press.Find this resource:

Watson, J. D. (1969). The double helix. New York: New American Library.Find this resource:

Wieling, M. (2012). A quantitative approach to social and geographical variation (Doctoral diss.), Rijksuniversiteit Groningen: Groningen Dissertations in Linguistics, 103, Printed by Off Page, Amsterdam.Find this resource:

Wittenburg, P., Trilsbeek, P., & Wittenburg, F. (2014). Corpus archiving and dissemination. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 133–149). Oxford: Oxford University Press.Find this resource: