Learnability and Learning Algorithms in Phonology
Summary and Keywords
Phonological learnability deals with the formal properties of phonological languages and grammars, which are combined with algorithms that attempt to learn the language-specific aspects of those grammars. The classical learning task can be outlined as follows: Beginning at a predetermined initial state, the learner is exposed to positive evidence of legal strings and structures from the target language, and its goal is to reach a predetermined end state, where the grammar will produce or accept all and only the target language’s strings and structures. In addition, a phonological learner must also acquire a set of language-specific representations for morphemes, words and so on—and in many cases, the grammar and the representations must be acquired at the same time.
Phonological learnability research seeks to determine how the architecture of the grammar, and the workings of an associated learning algorithm, influence success in completing this learning task, i.e., in reaching the end-state grammar. One basic question is about convergence: Is the learning algorithm guaranteed to converge on an end-state grammar, or will it never stabilize? Is there a class of initial states, or a kind of learning data (evidence), which can prevent a learner from converging? Next is the question of success: Assuming the algorithm will reach an end state, will it match the target? In particular, will the learner ever acquire a grammar that deems grammatical a superset of the target language’s legal outputs? How can the learner avoid such superset end-state traps? Are learning biases advantageous or even crucial to success?
In assessing phonological learnability, the analysist also has many differences between potential learning algorithms to consider. At the core of any algorithm is its update rule, meaning its method(s) of changing the current grammar on the basis of evidence. Other key aspects of an algorithm include how it is triggered to learn, how it processes and/or stores the errors that it makes, and how it responds to noise or variability in the learning data. Ultimately, the choice of algorithm is also tied to the type of phonological grammar being learned, i.e., whether the generalizations being learned are couched within rules, features, parameters, constraints, rankings, and/or weightings.
1. The Questions and Goals of Phonological Learnability Research
This article summarizes several strands of research in phonological theory and adjacent theories, all revolving around the question of how it is possible to learn a phonology. To lay the groundwork, this section begins with three basic questions: (i) What is a phonology? (ii) What is the phonological learning task? (iii) What phonological data are used in learning? The following sections summarize some of the biggest challenges in phonological learning (Section 2), and provide a very short history of learnability research in the eras of rule- and parameter-based phonology (Section 3). The subsequent two sections delve somewhat deeper into learnability in Optimality Theory (Section 4) and other constraint-based phonologies (Section 5), and the final section concludes by drawing connections between phonological learning research and adjacent fields of study.
1.1 Two Descriptions of Language and Phonology
There are at least two conceptualizations of what a language is (including what a phonology is), which we will begin by distinguishing. In some sense both go back at least to Chomsky’s early work on formal languages; see especially Chomsky (1963) on the concepts of weak and strong generative capacity in language description.
All three languages illustrated here allow #fl and #fr onsets; Spanish does not allow any #s-stop clusters while English and Polish do; both English and Spanish avoid #vC onsets, while Polish allows several such onsets.1 While this table includes only real words, construing a language as its set of legal strings means including all possible strings, whether they happen to be real words or not; in this way, each language can be thought of as a total possible lexicon, as in (2):
From the second perspective, typically used in the linguistic and formal learning literatures, a language is the set of grammatical mechanisms—rules, processes, constraints, or the like—that generate a particular subset of the possible strings and structures. In our example, such mechanisms could include constraints as in (3):
In Optimality Theoretic terms (Prince & Smolensky, 2004; McCarthy & Prince, 1995), phonologies use constraints like (3) to generate legal strings and structures, given any possible input string or structure, through their relative ranking. Very roughly speaking, those constraints that are obeyed are ranked above those that enforce faithfulness (i.e., identity to the input); those that can be disobeyed are ranked below. Sample such rankings are shown in (4), where the indicates strings that this grammar can generate:
Under this view of a language’s phonology, there must also be mechanisms to determine how illegal strings and structures are repaired—while (4) will not generate a [vd] onset cluster, it also does not uniquely choose a surface form if a word-initial [vd] is attempted (but see much more in what follows).
At the outset of generative grammar, language was defined in this ‘strongly generative’ second way (Chomsky, 1956). A phonology is thus construed as a system of rewrite rules that generate a particular set of strings; all the strings it generates are grammatical, and anything it doesn’t generate is ungrammatical. From this perspective in what follows, the basic learning task is to acquire that language-specific system—of rules, constraints or the like.
1.2 The Learning Task and the Learnability Question(s)
One way to summarize the learning problem is as follows. The phonological learner begins in some universal initial state, with primitives in some configuration (both to be defined), and receives evidence about the target language. To be successful, the learner must use this evidence to reach a stable end state that mirrors the target.
Strictly speaking, the true ‘learnability’ question is whether or not succeeding at this learning is possible, and whether it can occur in some reasonable amount of time. Here we will frame this question in the terms of Valiant (1984)’s Probably Approximately Correct (PAC) model of learning (for PAC applied to phonological learning problems, see Lin, 2002; Riggle, 2006). In PAC and other computational learning approaches, there is a data set of ‘attributes,’ which can be thought of as input strings, each associated with either a + or – (i.e., grammatical or ungrammatical). Learning is aimed at a fixed target concept, which includes the end state’s associations between all the attributes and their + or – values (i.e., the grammar), and the learning options are set of hypotheses as to the true concept. The learnability question is whether there is an efficient algorithm that gets from some class of set of attributes to a hypothesis about the concept that is ‘probably approximately correct’—meaning that the hypothesis will usually make fewer than some threshold of errors when categorizing the data, as compared to the actual concept. To be ‘efficient,’ an algorithm must solve the learning problem in polynomial time (meaning that the upper bound on its running time is related via a polynomial to the size of the input data, i.e., the number of attributes).
In the phonological literature, the goal of a learning proposal is sometimes to prove learnability in this strict sense. The majority of such works, however, are somewhat more broad in their aims, or rather driven more by linguistic theorizing. Some aim to not just show that a particular language is learnable, but argue that the results of learning matches what adults actually know. Others aim to capture not just what the end state of learning looks like, but to provide a trajectory for learning that is observed longitudinally in human learners (see section 6). With such very different goals come very different assumptions, and different benchmarks for success—compare discussion in Prince and Tesar (2004), Heinz (2010), and Tessier (2009). Some phonological work rejects particular theoretical analyses with the claim that they are unlearnable, or at least that their learnability is unknown, e.g., Bermudez-Otero (2003), Ota (2004); other work argues that unattested consequences of theoretical analyses are unproblematic precisely because they are unlearnable (see especially Alderete, 2008). A related rhetorical tactic is to argue that a pattern is unstable in language change because it is unlearnable given assumptions about the learner, e.g., Bowers (2015).
1.3 Data and Evidence for the Phonological Learner
In the linguistic literature the traditional baseline view is that the learning data is all and only the observed target forms of the language, standardly referred to as positive evidence. As the name suggests, the alternative type of data that learners are assumed not to use as learning evidence would be negative evidence. Such negative evidence could either come from noticing a gap in the observed data—if the Spanish learner were to consider the fact they have never heard a [sk] or [vz] onset—or from being informed or corrected as to the Spanish method of pronouncing an input with an illegal cluster.2
The question of how much data the learner receives can be a rather fraught one, for multiple reasons. A first note is that the outputs seen by the learner will always represent just a sample of the full target language. Even if the learner is somehow guaranteed to hear all existing lexical items sufficiently many times before their grammar stabilizes, there will always be nonce words that the learner does not observe—some of which should be treated as legal (i.e., as ‘accidental gaps’) and others as illegal.
In some approaches the raw learning data will get some further processing before it is fed to the learner. For example, each observed output might be combined with others to provide more generalized or abstract data about the target; outputs might also be stored for comparison with later evidence, particularly to help backtrack or otherwise amend the learner’s grammatical hypotheses. In particular, the OT learning literature has developed a certain method of data processing that has been central to learning and that results in what might be characterized as easily-calculable negative evidence (see Section 4.1).
2. What Makes Phonological Learning Hard?
Perhaps the most basic learnability result with respect to formal languages is that without structure to the search space, learning is probably guaranteed to fail. Heinz and Riggle (2011, p. 62) provide a clear discussion of how even the most restricted class of languages in the Chomsky hierarchy—the regular languages such as described by classic SPE rules (Chomsky & Halle, 1968)—are not learnable using most computational learning methods (see also their article for pointers to the suite of original literature). As Heinz and Riggle discuss, what this can tell the linguist is that the learning search will fail if its hypothesis space is simply an unstructured list of all possible languages of the relevant type; if there is no inherent structure to the attributes within the fixed target concept, there is no guarantee that learning will succeed. Heinz (2010, p. 625) cites Gleitman’s (1990, p. 12) insight on the matter: “The trouble is that an observer who notices everything can learn nothing for there is no end of categories known and constructable to describe a situation.”
Of course, no phonologist has ever suggested that phonologies were truly unordered collections of random prohibitions on input strings; and indeed once certain types of structure are introduced to the language search space, many regular languages (even an infinite number thereof) and even more complex language types become learnable. In this way, phonological learnability research is ultimately concerned with determining those aspects of a grammatical framework that helpfully structure the learning space.
Once the search space has been sufficiently structured, however, the learner’s central difficulty can be determining which of the many possible hypotheses should be chosen. Ellison (1998) draws a useful distinction between algorithms whose goal is to find any solution compatible with the observed data, and those that seek an optimal solution. In the first case (PAC learning being an example), the learner will only be driven to change its current grammar hypothesis if it actively makes errors on observed data; if it’s not making errors (or making fewer than the learner’s threshold tolerates), learning is done.
In an optimizing scenario, however, biases beyond the grammar’s performance on observed data are also used to choose between hypotheses. In creating such biases, the linguist’s biggest consideration is usually the grammar’s treatment of data that the learner cannot reasonably expect to get—i.e., how to learn to reject the unobserved outputs without negative evidence. A heuristic for this bias can be characterized as follows: When choosing between grammar hypotheses that each accept all of the observed data as legal, the learner should adopt the one that rejects as much of the unobserved data as possible. This is usually characterized in Berwick (1985)’s terminology as the Subset Principle (see also especially Angluin’s, 1980, ‘tell-tale property’).
Looking back at the mini-typology illustrated in (2): What should a learner assume when observing onset [fl] and [fr]? They might be learning Polish, in which case many other cluster types will appear over time, and this will easily falsify their hypothesis that only Spanish-type clusters are allowed. But if the learner is in fact learning Spanish, there are no other fricative-onset clusters on the way. If the learner guesses they are learning Polish, they will never get any evidence that they are wrong, because observing that all the Spanish word onsets are legal in Spanish will not provide positive evidence that an English #[st] or Polish #[vd] would be illegal. Thus, the safest bet is to assume the most restrictive language—the one that deems legal the smallest set of total possible strings—meaning the grammar which describes Spanish, the smallest oval in (2), which will continue to rule out any other as-yet-unobserved onset clusters. (It should be noted that probability theory, and more specifically the well-established technique of Bayesian reasoning, does provide a way of approaching learning from the absence of data; see especially Jarosz, 2006, 2009, on expectation maximization; also Dunbar, Dillon, & Idsardi, 2013, and references therein; see also Section 5.2.)
As Prince and Tesar (2004) observe, the Subset Principle does not actually explain how the learner should proceed; rather the analyst’s goal is to identify a set of explicit procedures that will have the result of adhering as closely to the Principle as possible. Practically speaking, the first tactic has been to start with a very tightly controlled initial state, in a language which accepts as few surface strings as possible. Many different threads of research, however, emphasize that the search for subsets is a persistent problem (already raised by Ito & Mester, 1999); “the learner must guess the smallest possible language compatible with the input at each stage of the learning procedure” (Clark & Roberts, 1993, pp. 304–305).
Another central learning problem comes from ambiguity and ambiguous structure in the learning data. To the extent that generalizations about segments and structure are not directly encoded in the observed data, the learner will have to infer structure and connect it with their observed data simultaneously. Much of this discussion of ambiguous input considers the acquisition of footing and other structure associated with stress assignment. For example, what should the learner conclude from a word stressed like banana, containing three syllables with medial stress [σˈσσ] (example from, e.g., Tesar, 2004)? If the output is parsed into binary feet, this stress pattern could represent a word-initial iamb [(σˈσ)σ] or a word-final trochee [σ(ˈσσ)]; either assumption could mislead the learner, depending on whether a coherent analysis of the language’s full phonology requires an iambic or trochaic parse.
A different potential difficulty is the frequency or sparsity of input available to the learner. Since every set of observations is necessarily only a sample of the target, when can the learner trust they have a sufficiently large sample to draw generalizations? To use a real-life English example: Moreton (2002) discusses adult English listeners’ knowledge of onset clusters [bw, dw, gw]. In a corpus of 18.5 million spoken and written words, Moreton reports onset [bw] in zero English words (by type), onset [dw] in16 words and onset [gw] in 59. So what should the learner conclude about this set of clusters? That only [bw] should be ruled out, but that the other two voiced stops plus [w] are accepted? Or is it possible that zero [bw] represents an accidental gap in the lexicon, and not an absence to be maintained and enforced in the phonology? As a point of comparison, the same corpus also contains zero onset [dl], compared to the much more frequent 890 onset [bl] and 292 onset [gl]. From Moreton’s data on perceptual biases, it appears that adult English listeners have strongly learned that [dl] is an illegal English onset cluster, but are much more accepting of onset [bw]. (This example also raises the question of how the end state of learning is predetermined, i.e., using which aspects of human knowledge or behavior; see Section 6.)
A related concern is the order in which learning data are received. A central subquestion is whether the learner is supposed to create grammar hypotheses as data is received incrementally, or can simply wait to see all the relevant learning data before it starts. Learning algorithms are often referred to as batch learners if they treat the entire data set at once and online learners if they process one data point at a time. However, many batch learners can be used in an online fashion by simply storing errors incrementally—first feeding the algorithm one piece of data, then rerunning it with that datum and another, then rerunning with the first two plus a third, and so on.3
Finally, the learner will eventually need to acquire the relationships between their target phonology and the underlying representations (URs) of their target lexicon.4 To the extent that morphemes are underlyingly different from their surface output forms, URs provide another source of hidden structure that complicates the learner’s search for the correct grammar hypothesis—and so too it can complicate the attempt to learn restrictively. A fairly straightforward problem is derived contrast, in which predictable phonological structure (e.g., two allophones in complementary distribution) can appear to be contrastive (i.e., two different phonemes) if morphological context is not taken into account (see Section 4.3).
3. A Selective History of Learnability and Learning Algorithms in Phonology
3.1 The Learning Task in Rule-Based Phonology
In classic rule-based phonology—as popularly begins with The Sound Pattern of English (SPE: Chomsky & Halle, 1968)—the universals in phonological learning are a set of primitive features, and a rule-writing syntax with the general format A ➔ B / C __ D. Working with these universals, the full set of possible languages contains all the possible feature matrices (i.e., segments), in all of the configurations allowed by the rule syntax, and with each possible rule ordered either before or after every other possible one. Depending on the flavor of rule-based grammar adopted, the learner may also have to find the language’s phonemic inventory, from which to construct URs, as well as additional rules or restrictions on featural matrices in those URs (e.g., Karttunen, 1969). This is undoubtedly a very large search space, one which a learner would surely need structure to navigate.
In chapter 8 of SPE, Chomsky and Halle discuss learning under the assumption that all the target language evidence is processed by the learner instantaneously, with the simultaneous acknowledgment that this cannot be an accurate description of the human language learning experience. Both chapters 8 and 9 deal with methods of evaluating and comparing the merits of rules, either with explicit measures of simplicity, and/or via ‘marking conventions’ intended in part to impose some phonetic naturalness. These developments might constitute structure to guide the learning process, but their precise contribution(s) were not clear. Meanwhile, the rigorous SPE formalism also led to a flourishing of phonological analyses of child phonology, and thus to speculation on the development of phonological rule systems during L1 acquisition (most notably Smith, 1973).
A somewhat different conceptualization of rule-based phonology was developed as Natural Phonology (Stampe, 1969; Donegan & Stampe, 1979), which deals directly with the question of learning. In this approach, the initial state is a universal set of ‘natural’ rules which produce a very early set of simplified, reduced outputs. The basic learning task is a process of rule suppression, with positive evidence informing the learner which rules must be overcome; thus, the language search space is in fact all the subsets of the initial state grammar. (In practice, however, additional rules must be added to account for language-specific phonological patterns of many types.) A novel aspect of Natural Phonology is that it could make predictions about how the grammar should change from one learning stage to the next, when natural rules that were observed to be violated in the target grammar were suppressed—but not in fact about how and why the learner uses their positive evidence to suppress. It should also be noted that Natural Phonology was inherently interested in capturing not only the eventual learnability of grammars but also their expression in the phonological behaviors of children, who begin with a very simple, highly ‘restricted’ set of phonological production shapes (as originally emphasized by Jakobson, 1968).
3.2 Learning a Principle and Parameters Phonology
The Principles and Parameters framework, as applied to phonology, reconfigures the grammatical lay-out of the learning problem considerably. This approach sees the initial state’s universals as a set of fully formed, preset parameters, which can be set to multiple language-specific positions. The learner’s task is therefore to accurately set each parameter and related subparameter, for each module or level of phonological representation (syllable structure, segmental phonotactics, stress, and so on.)
With this degree of structure, Dresher (1999) (see also Dresher & Kaye, 1990) provides an algorithmic attempt to explain how the learner moves from stage to stage on the basis of observed data. Each parameter begins in a default setting, chosen to avoid superset language traps, from which it can be shifted by a certain type of positive evidence known as a cue. Each parameter’s cue describes a configuration of the target language, which can be observed by examining one or perhaps more surface forms. For instance, in Dresher (1999)’s metrical stress learning algorithm, the parameter that determines whether stress is assigned with reference to syllable weight or not defaults to quantity-insensitivity, and the cue for changing that setting is observing that words with the same numbers of syllables are assigned different stress patterns. Thus, observing that English trisyllabic words can bear main stress on their first, medial, or last syllable (ˈCa.na.da vs. Aˈman.da vs. kan.ga.ˈroo) will lead the learner to flip this parameter to the quantity-sensitivity setting. This flip of the parameter switch sets up other subparameters, (i.e., one which control what counts as a heavy syllable in the target language), which again default to one setting and which can be reset by a particular data cue.
Despite this degree of honed grammatical structure and the initial state’s defaults, this cue-based parameter learning can still be quite complex—because setting one parameter may have ramifications for others, in ways that are not immediately detectable. To adapt an example from Dresher (1999), suppose the learner has determined that their target language has set the weight-based parameter to Quantity-Sensitive. What should be the default setting for categorizing a syllable as heavy? If the parameter begins with coda consonants as heavy, this may mislead the learner elsewhere—in particular, when trying to set the parameter for whether the main stress foot is left- vs. right-aligned. If VC] counts as heavy, the words in (5a) will look like they demonstrate evidence for setting the Main Stress parameter to Left—however, if VC] counts as light, these words are too short to demonstrate anything about main stress (5b):
In Dresher’s account, this interdependence of parameters and triggers means that when a default setting is changed, the learner must return all other dependent parameters settings to their default. In this way, the learning mechanism can self-correct by undoing decisions for which it now knows its data could have been faulty—even though that data itself is not stored anywhere in the learning procedure. This reliance on what we might call grammatical ‘memory’ rather than data memory will contrast with some of the algorithms to follow.
4. Learnability and Learning Algorithms in Optimality Theory
The founding document of Optimality Theory (OT: Prince & Smolensky, 2004) already introduces the learnability of OT grammars as a central topic for research within the framework. From that beginning, learning and learning algorithms have remained very prominent in the development of constraint-based phonologies; for a recent comprehensive discussion, see Jarosz (2016). The remainder of this summary article will be primarily concerned with learning results in the OT literature and its more recent developments. A reader without a basic familiarity with OT is encouraged to consult an introductory text such as McCarthy (2002) or Kager (1999).
In classic OT, the primitives available to the learner are universal constraints and two mechanisms: GEN for creating possible output strings to evaluate, and EVAL for using any ranking of the constraints to choose a winning output given any possible input (in the learnability context, see Heinz, Kobele, & Riggle, 2009, and references therein.) With this grammatical structure, the learning task is to choose from among all the possible constraint rankings to find one that matches the target.5 Note that the set of crucially-different rankings—i.e., those that define different languages in our initial legal-surface-strings sense—is a rather smaller subset of all the possible rankings, because most constraints do not crucially conflict.
4.1 Errors and OT Constraint Reranking
How does the OT learner process data to treat as learning evidence? The standard format is a distillation of the comparison between two different strings: an observed output from the target language (called the winner), compared with the output that is chosen by the current grammar hypothesis (called the loser, given that in the target grammar it will eventually lose.) When the target output and learner’s current output are not the same, there is something to be learned from that difference. In that sense these learners are error-driven (as in Tesar, 1995) as opposed to Dresher’s cue-driven learner (see also Yang, 2002). Notice that EVAL can only choose an output with reference to an input, so an input must be hypothesized; a useful starting point is to assume the identity map (put this way in Prince & Tesar, 2004), meaning that the target language output is also adopted as the inferred input.
To see this in action, suppose that the learner’s current ranking imposes onset cluster simplification. The tableau in (6) shows how this ranking maps an English [sk] onset cluster, using two of the constraints from (3) along with two constraints that regulate epenthesis (Dep) and deletion (Max).6 The resulting learning data is represented in (7), comparing the intended winner (i) [skul] with the intended loser and current error (ii) [kul].
The format in (7)—which here will be called a comparative mark-data pair—compares the winner and loser’s violations of each constraint. In the case of [skul] vs. [kul], the first constraint *CC-Onset is violated by the winner (which has a complex onset) but is satisfied by the loser (which has only a singleton onset), so *CC-Onset ‘prefers the Loser,’ indicated with an L. Conversely, Max is satisfied by the winner (because its mapping from /skul/ ➔ [skul] deletes nothing) but is violated by the loser, so Max prefers the Winner and is labeled W. Notice the Dep is satisfied equally by both (i) and (ii), so it is marked with an equivocal e.
With a comparative mark-data pair, the learner can now reason about moving from the current grammar in some way towards the target. Prince and Smolensky (1993, p. 148) provide the goal: As rephrased by Prince and Tesar (2004), if at least one of the constraints preferring the winner dominates all of the constraints preferring the loser, then the winner will be optimal. Thus, a variety of learning algorithms have been proposed that can rearrange constraints like the ones in (7) so that (whether immediately or eventually) at least one of the W-preferring constraints (here meaning just Max) dominate all the L-preferring constraints (here, *CC-Onset and *s[stop]-Onset).
Foundational work in this regard produced a Recursive Constraint Demotion algorithm (RCD: Tesar & Smolensky, 1998, 2000; see also Tesar, 1997) and much subsequent work has analyzed the formal properties of OT learning using this and similar algorithms, and the conditions under which they are guaranteed to converge on the target (e.g., Tesar, 1995; Magri, 2012). A large part of this work focuses on the concomitant problem of assigning hidden structure to the winner. Recalling the stress assignment discussion of Section 3.2, if the learner is constructing a mark-data pair for the observed form Amanda, is that winner to be footed as [(A.ˈman.)da]? or [A(ˈman.da)]? Or indeed should the syllable boundaries be [A.(ˈma.nda)] ? The approach in Tesar and Smolensky (1998, 2000) is to interweave two processes: The constraint demotion algorithm that works from mdps to rankings, and another algorithm called Robust Interpretive Parsing (RIP), which uses the current grammar hypothesis to assign the optimal structure to the observed output, to be fixed again if the algorithm so requires (see also Jarosz, 2013, and Section 5.3).
What of the initial state and the quest for restrictiveness? The original RCD procedure is not inherently tied to any particular initial state ranking; it is only a method to ‘resolve’ mark-data pairs, ensuring that their winners best their losers. It is not, however, guaranteed to find an optimal solution, meaning a ranking that is the most restrictive in rejecting unseen data. This task—of imposing a Subset Principle heuristic on the OT learner—has been cashed out by placing biases on the different families of constraints found in CON. The first such bias is to rank Markedness constraints higher than Faithfulness constraints, both at the initial state (Smolensky, 1996), but more generally at every stage of learning (especially Prince & Tesar, 2004; Hayes, 2004). The general problem of finding the most restrictive ranking consistent with a set of mark-data pairs turns out to be more complicated, however, always due to varying species of Faithfulness constraints (see especially Hayes, 2004; Tessier, 2009; Magri, 2014; also McCarthy, 1998; Alderete & Tesar, 2002).
Given the tight coupling between constraint structure and the learning algorithms described thus far, it should not be surprising that changes to the OT framework can have serious consequences for its learning success. As one example, a serial variant known as Harmonic Serialism has been used in several empirical domains to better capture phonological typologies than the original massively-parallel OT framework (see, e.g., McCarthy, 2000, 2008; Jesney, 2009; Pruitt, 2010). But Harmonic Serialism changes the way that surface forms can be related to inputs, with the potential for multiple derivations along the way; this turns out to significantly complicate aspects of the search for a restrictive ranking that RCD can easily handle in parallel OT (Tessier & Jesney, 2014).
4.2 OT Learning and Variation
Variation can be defined as inconsistency across mark data pairs, as shown in (8). This configuration either means that some meta-error has been made (e.g., the comparative mark-data pairs are from two different dialects), or that the target is inherently variable as to whether it deletes or maintains onset /sk/ clusters.
Multiple OT approaches have been taken to the question of how variability affects, or rather can be overcome by, the phonological learner. One technique is to use the locus of inconsistency as evidence to learn underlying representations—i.e., to overturn the identity map assumption made thus far and edit the lexicon so that it contains unfaithful underlying forms (Tesar, Alderete, Horwood, Merchant, Nishitani, & Prince, 2003). Another method is to let such incompatible mark data pairs teach the learner that more constraints are necessary—in particular, that some existing constraints should be cloned and then relativized to subparts of the lexicon (Pater, 2005; Becker, Ketrez, & Nevins, 2011).
A different method for dealing with variation in an OT framework was introduced by Boersma (1998) and Boersma and Hayes (2001): to reimagine constraint rankings not just as dominance hierarchies but as a series of points on a number line, and with each such point representing not a constraint’s unique location above or below every other one, but the mean of a normal distribution. This Stochastic OT of Boersma (1998) is for one thing a theory of variation in end state grammars, whereby the distance between the means of any two constraints captures how likely they are to be ranked in one or the other order on any run-time use of the grammar. From the learning perspective, however, adding this numerical component to the representation of phonological constraints has had a fundamental impact on learning and on phonological theory more generally: on which, see section 5 below.
4.3 OT Learning and Underlying Representations
One of the many simplifying assumptions made above is that the observed outputs of the target language are as identical to their inputs as they can be—i.e., perhaps not syllabified or assigned metrical feet but otherwise fully faithful. An important alternative to this aspect of the learning problem is pursued in Jarosz (2006), which presents an Expectation Maximization procedure for learning OT rankings and underlying forms simultaneously, using a probabilistic grammar (see Section 5) and an explicit reliance on the entire Rich Base of possible inputs when learning.
The assumption that observed outputs are simply faithful surface versions of their inputs may represent a useful and even necessary first approximation of the target language, as it means data can be processed before any morphological alternations are identified—but once morphology is discovered, the assumption will break down. The learner of a language with final obstruent devoicing (Dutch, for example) will initially learn via the identity map that voiced obstruents are in general permitted (as in the mdp below), but not yet have any knowledge of additional voiced obstruents in underlying forms.
If it turns out that [-ə] is a suffix, the learner will realize that some of the surface forms in (10b) share a common morpheme. If [bat] and [badə] are identified as the singular and plural of the same noun, for instance, then the learner knows that one of the mappings /bat/ ➔ [bat] or /bad‑ə/ ➔ [badə] has to be wrong.7 A general strategy to resolve this UR inconsistency is to use the existing phonotactic grammar, and the power of EVAL, to try out UR hypotheses looking for a consistent system (as illustrated in, e.g., Kager, 1999; see also Alderete, Brasoveanu, Merchant, Prince, & Tesar, 2005; Tesar & Prince, 2007). One simple and perhaps insightful starting point is to try each surface allomorph in turn and assess whether the current grammar can predict its unfaithful variant in context (also as in Albright, 2002). In this case, hypothesizing /bat/ as the UR will not work, because the current grammar will not be able to generate a plural mapping /bat‑ə/ ➔ [bad‑ə]; but hypothesizing /bad/ will correctly succeed in understanding why a potential voiced coda in the isolation form is devoiced as in (11):
While this illustration works easily, the consequences of morphophonological alternations for both UR learning and the grammar itself are much more complicated. For one thing, while the phonology of a language’s morphologically complex words is often in keeping with its static phonotactics, it also often diverges (see examples in, e.g., Tessier, 2016), making necessary more subtle search mechanisms for UR discovery. For another, phonotactic learning before morphological awareness can actually lead to the adoption of an overly permissive grammar. A frequently discussed example (Hayes, 2004, and many references therein) comes from the interaction of Canadian raising (which applies to diphthongs before a voiceless obstruent like [t]) and flapping at a morpheme boundary (which turns some /t/s into [d]s). In some dialects, this interaction creates a minimal pair with raised and unraised diphthongs (given in bold below), illustrated for simplicity in (12) with a rule-based derivation:
In the absence of insight into the suffix [əɹ] the minimal pair in (12) will necessarily teach the learner that [aɪ] and [ʌɪ] are separate phonemes and that their contrast should be maintained faithfully in the target language. Thus, it must somehow be the case that learners can undo grammatical assumptions as morphology is acquired—see especially McCarthy (1998), Hayes (2004) on ranking biases to accomplish this. More broadly, Tessier (2009) argues that the unavoidable need for grammatical backtracking over the course of learning requires that the learner store their errors, so that they can pinpoint the evidence for parts of their current grammar hypothesis and learn from its edits.
5. Learning Constraint-Based Phonologies Beyond Classic OT
5.1 GLA Learning
As mentioned above, Stochastic OT (Boersma, 1998; Boersma & Hayes, 2001) allows the learner to respond to mark-data pairs not with wholesale rerankings but with a numerical change in constraint ranking values: increasing the value of constraints which prefer the winner and decreasing those that prefer the loser. This Gradual Learning Algorithm (GLA) allows a learner to respond to each error with a very small update, so that observable changes in the grammar’s output may not emerge until a large body of evidence for a particular reranking has accumulated; this is sketched in (13):
As it turns out, this GLA learner is equivalent to a well-studied algorithm in the psychological statistical learning literature, known as the Perceptron (Rosenblatt, 1958), and very closely related to a machine learning approach described in Jäger (2007) as Stochastic Gradient Ascent (typically used to learn Maximum Entropy grammars; see Section 5.3).
Notice that the GLA and their ilk are not designed to use their previous errors in any way—they are learned from once and cast aside, or at least are not available for backtracking as discussed above. Rather it is designed to move cautiously, being resilient to the occasional incorrect example or misparsing. Its parameters include the number of learning trials it is fed, the speed with which constraints are demoted or promoted by its update rule (also known as its plasticity) and the rate at which plasticity is reduced over time (aimed at creating increased fossilization of grammar as learning progresses).
As it is implemented above, Pater (2008) demonstrates that the GLA for learning stochastic OT is not guaranteed to converge; it is quite possible to design a set of consistent mdps that will never teach the GLA to stabilize on its target grammar; see Magri (2012) for a revised update rule that converges reliably. Boersma and Pater (2016) also demonstrate the GLA can fail when combined with Tesar and Smolensky’s (1998) RIP procedure for hypothesizing input structure, though Jarosz (2013, in press) provides two significantly-improved such procedures. In the meantime, GLA-type learning has also been combined with other evaluation methods, with perhaps greater success—to which we now turn.
5.2 Learning with Weighted Constraints
The main alternative approach to constraint-based phonological grammar uses a different procedure for evaluating candidates to select the optima, whereby constraints are not just hierarchically ranked but instead weighted. In Harmonic Grammar (HG: Legendre, Miyata, & Smolensky, 1990; Smolensky & Legendre, 2006; Pater, 2009; Potts, Pater, Jesney, Bhatt, & Becker, 2010), each candidate is assigned a harmony score, calculated by multiplying the number of each constraint’s violations by its weight, and taking their sum. With each violation represented as −1, the winning output is the candidate with the highest score (i.e., the negative number closest to zero):
HG learning algorithms proceed along the lines already discussed, using observed errors to assign constraint weights to eventually reach a targetlike end state. Much of this work has adapted an HG-GLA learner (although cf. Potts et al., 2010) and found improvements on previous learners: For example, Pater (2008) and Boersma and Pater (2016) show how the HG-GLA will converge in the cases where the OT-GLA is trapped; Jesney and Tessier (2011) demonstrate how the HG-GLA needs fewer biases on specific families of constraints to learn restrictively. In a different vein, see Bane, Riggle, and Sonderegger (2010) for the computational result that the complexity of learning a Harmonic Grammar increases with the number of constraints at the same rate as in OT. On the other hand, Magri (2016) argues that HG is inferior to OT in that some of its computational robustness (e.g., its efficiency or resilience to errors) is only maintained under certain assumptions about the constraint set.
Another well-studied approach to learning a constraint-based phonology is to capture phonological generalizations about the observed language as a function of their entropy. Goldwater and Johnson (2003) introduced Maximum Entropy grammars into the phonological learning literature by demonstrating how GLA learning could be adapted to MaxEnt learning, as the latter was already mathematically well studied (see also Jäger, 2007). This type of model uses weighted sums of violations to calculate scores as in HG; then for each candidate, the constant e (the base of the natural log) is raised to that candidate’s Harmony score—see (15):
These values can be compared to assign relative probabilities for each candidate, where the bigger the value, the more likely the surface form. If choosing only one output, these constraints weights combine to choose the same winner [kul] in a MaxEnt grammar as they did in the OT and HG approaches, but this grammar has other properties which may make it useful for capturing phonological knowledge. As an example, see Moore-Cantwell and Pater (2016) on using MaxEnt to learn lexically-specific patterns that extend probabilistically to nonce words; see also Hayes and Wilson (2008) discussed in section 5.3, and other arguments in Daland (Forthcoming). On the other hand, MaxEnt grammars do represent a fundamental shift in a phonology’s architecture (e.g., the loss of harmonic bounding) and so may yet demonstrate other differences in their learning and typological consequences.
5.3 Constraint Induction in Learning
The discussion of learning with phonological constraints has thus far assumed that the constraints themselves are provided as primitives to the learner, defining the search space at the outset. However, various approaches to learning also tackle constraint construction. One reason for this is the observation that the phonologies of both children and adults appear to obey constraints that are unlikely to be universal (for adults, this discussion goes back at least to the ‘crazy rules’ of Bach & Harms, 1972; for children, see, e.g., Becker & Tessier, 2011, and references therein). Another rationale follows from the idea that any structure that can be successfully deduced by the learner should not be prewired, leading to the question of how much constraint induction is possible from positive evidence alone.
A highly influential take on phonological constraint induction is introduced in Hayes and Wilson (2008), to be used in ‘purely phonotactic’ model of phonological knowledge. The resulting ‘UCLA Phonotactic Learner (PL)’ is not building a grammar that maps underlying representations onto surface ones, but rather one that evaluates surface forms as to their well-formedness in the target language. In classic OT terms, this means that the UCLA-PL is attempting to induce (and later weight) markedness constraints whose restrictions characterize legal surface forms. For this procedure, the learning data is again simply a set of observed surface forms; the learning primitives are SPE-style features and a highly-restricted structural template for constraints, which focuses the learner on n-grams of adjacent segments (or featurally-defined subsets of segments on adjacent tiers; for a different approach to phonotactic learning with non-adjacent segments see Heinz, 2010).
The UCLA-PL works from natural classes of features, comparing their observed distribution in the sampled language data to what their expected distribution would be if they were combined at random. Of the very many combinatorial options, the learner prefers constraints that are general and yet accurate, i.e., constraints that ban relatively large natural classes with relatively few features, that would have been expected to be frequent but are observed to be rare. Recent work with similar aims is reported in Doyle, Bicknell, and Levy (2014), which uses Bayesian reasoning to induce markedness constraints from distributional data—although with predefined input–output mappings and faith constraints, and the restriction that constraints be binary in their violations profile.
6. Conclusions: Connections Between Learning Algorithms and Other Domains
The discussion throughout this article has been focused on learning primarily from the perspective of phonological theory itself—that is, beginning with the theoretical frameworks proposed by phonologists and assessing successes and failures in learning each. By way of conclusion, it is worth noting that parallel versions of this overview could be written (or indeed have already been written) from at least two other perspectives.
The first viewpoint treats phonological learnability as a topic of computational study. While many of the insights and methods discussed herein originate in the domains of machine learning, statistical learning and mathematical modeling, there is much more to be said about the necessary and sufficient properties of a phonological theory that makes its learning task possible, provable, feasible, efficient, and so on.
One common point of contact throughout generative phonology’s development has been the use of finite state machines, used to represent the workings of both rule-based and constraint- based regular languages. Such machines come in different types: Finite state acceptors are machines that simply parse a string if it meets the conditions of the language and rejects it otherwise; finite state transducers read strings from one ‘tape’ to another, and so can be used to generate an output string from an input one. As mentioned above, Johnson (1972), and also Kaplan and Kay (1994), pointed out that SPE rules (at least prior to the phonological ‘cycle’) can in fact be captured with finite state transducers; see also Karttunen (1991). Within OT, Eisner (2000) and especially Riggle (2004) use finite state machines to understand the properties of OT constraints (‘rational’ i.e., regular or otherwise) as well as the structure and complexity of OT’s language and candidate space. As another example, Heinz (2006) describes a finite state acceptor that induces surface phonotactic constraints with reference to natural classes of segments (as discussed above with respect to Hayes & Wilson, 2008, whose implementation of their UCLA-PL also uses finite state machines for one aspect of constraint evaluation).
A second parallel discussion deals with the intersection between learning algorithms for theoretical linguistic purposes and evidence of learning collected from the behavior of real humans, child or adult. As noted in section 3, there was much vigorous research on child speech and its phonological rule-governed nature (e.g., Smith, 1973; Ingram, 1974; Macken, 1980). With the advent of OT came fresh efforts to capture child phonologies using typologically-motivated constraints (e.g., Gnanadesikan, 2004; Goad, 1997; Pater & Barlow, 2003), and to model children’s development over time, using and adapting learning algorithms discussed above (e.g., Curtin & Zuraw, 2001; Levelt & van de Vijver, 2004; Tessier, 2009, 2013; Jesney & Tessier, 2011; Becker & Tessier, 2011; Jarosz, 2010). As computational methods become more integrated with grammatical theorizing, learning algorithms and the learnability consequences of particular frameworks may play an increasingly large role in the development of phonological theory.
Boersma, P., & Hayes, B. (2001). Empirical tests of the Gradual Learning Algorithm. Linguistic Inquiry, 32, 45–86.Find this resource:
Dresher, B. E. (1999). Charting the learning path: Cues to parameter setting. Linguistic Inquiry, 30(1), 27–67.Find this resource:
Hayes, B. (2004). Phonological acquisition in Optimality Theory: The early stages. In R. Kager, J. Pater, & W. Zonneveld (Eds.), Fixing priorities: Constraints in phonological acquisition. Cambridge, U.K.: Cambridge University Press.Find this resource:
Hayes, B., & Wilson, C. (2008). A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry, 39(3), 379–440.Find this resource:
Heinz, J., & Riggle, J. (2011). Learnability. In M. van Oostendorp, C. Ewen, B. Hume, & K. Rice (Eds.), Blackwell companion to phonology. Wiley-Blackwell.Find this resource:
Jarosz, G. 2016. Learning with violable constraints. In J. Lidz, W. Snyder, & J. Pater (Eds.), The Oxford handbook of developmental linguistics. Oxford: Oxford University Press.Find this resource:
Jesney, K., & Tessier, A.-M. (2011). Biases in Harmonic Grammar: The road to restrictive learning. Natural Language and Linguistic Theory, 29, 251–290.Find this resource:
Pater, J. (2009). Weighted constraints in generative linguistics. Cognitive Science, 33, 999–1035.Find this resource:
Prince, A., & Tesar, B. (2004). Learning phonotactic distributions. In R. Kager, J. Pater, & W. Zonneveld (Eds.), Fixing priorities: Constraints in phonological acquisition. Cambridge, U.K.: Cambridge University Press.Find this resource:
Smolensky, P. (1996). On the comprehension/production dilemma in child language. Linguistic Inquiry, 27, 720–731.Find this resource:
Albright, A. (2002). The identification of bases in morphological paradigms. PhD diss., UCLA.Find this resource:
Albright, A., & Hayes, B. (2003). Rules vs. analogy in English past tenses: A computational/experimental study. Cognition, 90(2), 119–161.Find this resource:
Alderete, J. (2008). Using learnability as a filter on computable functions: A new approach to Anderson and Browne’s generalization. Lingua, 118, 1177–1220.Find this resource:
Alderete, J., Brasoveanu, A., Merchant, N., Prince, N., & Tesar, B. (2005). Contrast analysis aids the learning of phonological underlying forms. In J. Alderete, A. Kochetov, & C.‑H. Han (Eds.), Proceedings of the 24th West Coast Conference on Formal Linguistics (pp. 34–42). Somerville, MA: Cascadilla Proceedings Project.Find this resource:
Alderete, J., & Tesar, B. (2002). Learning covert phonological interaction: An analysis of the problem posed by the interaction of stress and epenthesis. Report no. RuCCS-TR-72. Piscataway, NJ: Rutgers Center for Cognitive Science.Find this resource:
Angluin, D. (1980). Inductive inference of formal languages from positive data. Information and Control, 45, 117–135.Find this resource:
Bach, E., & Harms R. T. (1972). How do languages get crazy rules? In R. Stockwell & R. Macaulay (eds.), Linguistic change and generative theory (pp. 1–21). Bloomington: Indiana University Press.Find this resource:
Bane, M., Riggle, J., & Sonderegger, M. (2010). The VC dimension of constraint-based grammars. Lingua, 120, 1194–1208.Find this resource:
Barlow, J. (2005). Sonority effects in the production of consonant clusters by Spanish-speaking children. In D. Eddington (Ed.), Selected proceedings of the 6th Conference on the Acquisition of Spanish and Portuguese as First and Second Languages. Somerville, MA: Cascadilla.Find this resource:
Bermúdez-Otero, R. (2003). The acquisition of phonological opacity. In J. Spenader, A. Eriksson, & Ö. Dahl (Eds.), Variation within Optimality Theory: Proceedings of the Stockholm Workshop on ‘Variation within Optimality Theory’ (pp. 25–36). Stockholm: Department of Linguistics, Stockholm University.Find this resource:
Becker, M., & Gouskova, M. (2016). Source-oriented generalizations as grammar inference in Russian vowel deletion. Linguistic Inquiry, 47(3), 391–425.Find this resource:
Becker, M., Ketrez, N., & Nevins, A. (2011). The surfeit of the stimulus: Analytic biases filter lexical statistics in Turkish laryngeal alternations. Language, 87(1), 84–125.Find this resource:
Becker, M., & Tessier, A.-M. (2011). Trajectories of faithfulness in child-specific phonology. Phonology, 28, 163–196.Find this resource:
Berwick, R. (1985). The acquisition of syntactic knowledge. Cambridge, MA: MIT Press.Find this resource:
Boersma, P. (1998). Functional phonology: Formalizing the interactions between articulatory and perceptual drives. PhD diss., University of Amsterdam.Find this resource:
Boersma, P., & Hayes, B. (2001). Empirical tests of the Gradual Learning Algorithm. Linguistic Inquiry, 32, 45–86.Find this resource:
Boersma, P., & Pater, J. (2016). Convergence properties of a gradual learning algorithm for Harmonic Grammar. In J. McCarthy & J. Pater (Eds.), Harmonic Grammar and Harmonic Serialism (pp. 389–434). London: Equinox Press.Find this resource:
Bowers, D. (2015). A system for morphophonological learning and its implications for language change. PhD diss., UCLA.Find this resource:
Chomsky, N. (1956). Three models for the description of language. IRE Transactions on Information Theory (pp. 113–124).Find this resource:
Chomsky, N. (1963). Formal properties of grammars. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology, Volume II (pp. 323–418). New York: Wiley.Find this resource:
Chomsky, N., & Halle, M. (1968). The sound pattern of English. New York: Harper & Row.Find this resource:
Clark, R., & Roberts, I. (1993). A Computational Model of Language Learnability and Language Change. Linguistic Inquiry, 24, 299–345. Available at http://groups.lis.illinois.edu/amag/langev/pubtype/article_LinguisticInquiry.html.Find this resource:
Curtin, S., & Zuraw, K. (2001). Explaining constraint demotion in a developing system. In B. Skarabela, S. Fish, & A. H.‑J. Do (Eds.), Proceedings of the 26th Annual Boston University Conference on Language Development. Somerville, MA: Cascadilla.Find this resource:
Daland, R. (Forthcoming). Long words: The problem they pose for phonotactic grammars, and the solution for maxent Harmonic Grammars. Phonology.Find this resource:
Donegan, P., & Stampe, D. (1979). The study of natural phonology. In D. Dinnsen Current Approaches to Phonological Theory (pp. 126–173). Bloomington, IN: IU Press.Find this resource:
Doyle, G., Bicknell, K., & Levy, R. (2014). Nonparametric learning of phonological constraints in Optimality Theory. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 1094–1103). Retrieved from http://acl2014.org/acl2014/P14-1/index.html.Find this resource:
Dresher, B. E. (1999). Charting the learning path: Cues to parameter setting. Linguistic Inquiry, 30(1), 27–67.Find this resource:
Dresher, B. E., & Kaye, J. (1990). A computational learning model for metrical phonology. Cognition, 34, 137–195.Find this resource:
Dunbar, W., Dillon, B., & Idsardi, W. (2013). A Bayesian evaluation of the cost of abstractness. In Sanz, Laka, & Tanenhaus (Eds.), Language down the garden path: The cognitive and biological basis for linguistic structure (pp. 360–383). Oxford: Oxford University Press.Find this resource:
Eisner, J. (2000). Easy and hard constraint ranking in optimality theory: Algorithms and complexity. In J. Eisner, L. Karttunen, & A. Thériault (Eds.), Finite-state phonology: Proceedings of the 5th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON) (pp. 22–33). Conference held in Luxembourg, August 2000.Find this resource:
Ellison, M. (1998). The iterative learning of phonological constraints. Computational Linguistics, 20(3): 1–32.Find this resource:
Gleitman, L. (1990). The structural sources of verb meanings. Language Acquisition, 1, 3–55.Find this resource:
Gnanadesikan, A. (2004). Markedness and faithfulness constraints in child phonology. In R. Kager, J. Pater, & W. Zonneveld (Eds.), Fixing priorities: Constraints in phonological acquisition. Cambridge, U.K.: Cambridge University Press.Find this resource:
Goldwater, S., & Johnson, M. (2003). Learning OT constraint rankings using a maximum entropy model. In J. Spenader, A. Eriksson, & Ö. Dahl (Eds.) Proceedings of the Stockholm Workshop on Variation within Optimality Theory, Stockholm University (pp. 111–120).Find this resource:
Goad, H. (1997). Consonant harmony in child language: An Optimality-Theoretic account. In S. J. Hannahs & M. Young-Scholten (Eds.), Focus on phonological acquisition (pp. 113–142). Amsterdam: Benjamins.Find this resource:
Hayes, B. (2004). Phonological acquisition in Optimality Theory: The early stages. In R. Kager, J. Pater, & W Zonneveld (Eds.), Fixing priorities: Constraints in phonological acquisition. Cambridge, U.K.: Cambridge University Press.Find this resource:
Hayes, B., & Wilson, C. (2008). A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry, 39(3), 379–440.Find this resource:
Heinz, J. (2006). Learning phonotactic grammars from surface forms. In D. Baumer, D. Montero, & M. Scanlon (Eds.), Proceedings of WCCFL25 (pp. 186–194). Somerville, MA: Cascadilla.Find this resource:
Heinz, J. (2010). Learning long-distance phonotactics. Linguistic Inquiry, 41(4), 623–661.Find this resource:
Heinz, J., Kobele, G., & Riggle, J. (2009). Evaluating the complexity of Optimality Theory. Linguistic Inquiry, 40, 277–288.Find this resource:
Heinz, J., & Riggle, J. (2011). Learnability. In M. van Oostendorp, C. Ewen, B. Hume, & K. Rice (Eds.), Blackwell companion to phonology. Hoboken, NJ: Wiley-Blackwell.Find this resource:
Ingram, D. (1974). Phonological rules in young children. Journal of Child Language, 1, 49–64.Find this resource:
Ito, J., & Mester, A. (1999). The structure of the phonological lexicon. In N. Tsujimura (Ed.), The handbook of Japanese linguistics (pp. 62–100). Malden, MA: Blackwell Publishers.Find this resource:
Jäger, G. (2007). Maximum entropy models and Stochastic Optimality Theory. In A. Zaenen, J. Simpson, T. Holloway King, J. Grimshaw, J. Maling, & C. Manning (Eds.), Architectures, rules, and preferences: Variation on themes by Joan Bresnan (pp. 467–479). Stanford, CA: CSLI Publications.Find this resource:
Jakobson, R. (1968). Child language, aphasia and phonological universals. Translated by A. R. Kuler. Original first published in 1941. The Hague: Mouton.Find this resource:
Jarosz, G. (2006). Rich lexicons and restrictive grammars: Maximum likelihood learning in Optimality Theory. PhD diss, Johns Hopkins University, Baltimore, MD.Find this resource:
Jarosz, G. (2009). Learning phonology with stochastic partial orders. Talk given at Third North East Computational Phonology Meeting. Cambridge, MA: MIT.Find this resource:
Jarosz, G. (2010). Implicational markedness and frequency in constraint-based computational models of phonological learning. Journal of Child Language, 37(3), 565–606.Find this resource:
Jarosz, G. (2013). Learning with hidden structure in Optimality Theory and Harmonic Grammar: Beyond robust interpretive parsing. Phonology, 30(1), 27–71.Find this resource:
Jarosz, G. (2016). Learning with violable constraints. In J. Lidz, W. Snyder, & J. Pater (Eds.), The Oxford handbook of developmental linguistics. Oxford: Oxford University Press.Find this resource:
Jarosz, B. (in press). Investigating the efficiency of parsing strategies for the Gradual Learning Algorithm. To appear in Dimensions of Stress. Cambridge University Press.Find this resource:
Jesney, K. (2009). Positional faithfulness, non-locality, and the Harmonic Serialism solution. In S. Lima, K. Mullin, & B. Smith (Eds.), Proceedings of NELS 39 (pp. 403–416). Amherst, MA: GLSA.Find this resource:
Jesney, K., &, Tessier, A.‑M. (2011). Biases in Harmonic Grammar: The road to restrictive learning. Natural Language and Linguistic Theory, 29, 251–290.Find this resource:
Johnson, C. D. (1972). Formal aspects of phonological description. The Hague: Mouton.Find this resource:
Kager, R. (1999). Optimality Theory. Cambridge University Press.Find this resource:
Kaplan, R., & Kay, M. (1994). Regular models of phonological rule systems. Computational Linguistics, 20(3), 331–378.Find this resource:
Karttunen, F. (1969). A note on morpheme structure in generative phonology. Proceedings of the 1969 conference on Computational Linguistics (pp. 1–10).Find this resource:
Karttunen, L. (1991). Finite-state constraints. Manuscript, Stanford University. Available at http://web.stanford.edu/~laurik/fsmbook/LSA-207/Articles/fsc91.pdf.
Legendre, G., Miyata, Y., & Smolensky, P. (1990). Harmonic Grammar—a formal multilevel connectionist theory of linguistic wellformedness: An application. In Proceedings of the twelfth annual conference of the Cognitive Science Society (pp. 884–891). Cambridge, MA: Lawrence Erlbaum.Find this resource:
Levelt, C., & van der Vijver, R. (2004). Syllable types in cross-linguistic and developmental grammars. In R. Kager, J. Pater, & W Zonneveld (Eds.), Fixing priorities: Constraints in phonological acquisition. Cambridge, U.K.: Cambridge University Press.Find this resource:
Lin, Y. (2002). Probably approximately correct learning of constraint ranking. MA thesis, University of California, Los Angeles.Find this resource:
Macken, M. (1980). The child’s lexical representation: The “puzzle–puddle–pickle” evidence. Journal of Linguistics, 16(1), 1–17.Find this resource:
Magri, B. (2012). Convergence of error-driven ranking algorithms. In Phonology, 29(2), 213–269.Find this resource:
Magri, G. (2014). The error-driven ranking model of the acquisition of phonotactics: How to keep the faithfulness constraints at bay. Proceedings of MORPHFSM.Find this resource:
Magri, G. (2016). Error-driven learning in OT and HG: A comparison. Phonology, 33(3), 493–532.Find this resource:
MacWhinney, B. (1975). Rules, rote and analogy in morphological formations by Hungarian children. Journal of Child Language, 2, 65–77.Find this resource:
McCarthy, J. J. (1998). Morpheme structure constraints and paradigm occultation. In M. C. Gruber, D. Higgins, K. Olson, & T. Wysocki (Eds.), Proceedings of the Chicago Linguistic Society 5, Vol. II: The Panels. Chicago: CLS.Find this resource:
McCarthy, J. J. (2000). Harmonic Serialism and Parallelism. In M. Hirotani, A. Coetzee, N. Hall and J. Y. Kim (Eds.), Proceedings of the North East Linguistic Society (Vol. 30). Amherst, MA: GLSA.Find this resource:
McCarthy, J. J. (2002). A thematic guide to Optimality Theory. Cambridge, U.K.: CUP.Find this resource:
McCarthy, J. J. (2008). The gradual path to cluster simplification. Phonology, 25, 271–319.Find this resource:
McCarthy, J. J., & Prince, A. (1995). Faithfulness and reduplicative identity. In J. Beckman, S. Urbanczyk, & L. Walsh Dickey (Eds.), University of Massachusetts Occasional Papers in Linguistics 18: Papers in Optimality Theory (pp. 249–384). Amherst, MA: GLSA.Find this resource:
Moore-Cantwell, C., & Pater, J. (2016). Gradient exceptionality in Maximum Entropy Grammars with lexically-specific constraints. Catalan Journal of Linguistics, 15.Find this resource:
Moreton, E. (2002). Structural constraints in the perception of English stop-sonorant clusters. Cognition, 84, 55–71.Find this resource:
Ota, M. (2004). The learnability of the stratified phonological lexicon. Journal of Japanese Linguistics, 20, 19–40.Find this resource:
Pajak, B., & Bakovic, E. (2010). Assimilation, anti-gemination and contingent optionality: The phonology of monoconsonantal proclitics in Polish. Natural Language and Linguistic Theory, 28(3), 643–680.Find this resource:
Pater, J. (2005). Learning a stratified grammar. In A. Brugos, M. R. Clark-Cotton, & S. Ha (Eds.), Proceedings of the 29th Boston University Conference on Language Development (pp. 482–492). Somerville, MA: Cascadilla.Find this resource:
Pater, J. (2008). Gradual learning and convergence. Linguistic Inquiry, 39(2), 334–345.Find this resource:
Pater, J. (2009). Weighted constraints in generative linguistics. Cognitive Science, 33, 999–1035.Find this resource:
Pater, J., & Barlow, J. (2003). Constraint conflict in cluster reduction. Journal of Child Language, 30, 487–526.Find this resource:
Pater, J., Staubs, R., Jesnsey, K., & Smith, B. (2012). Learning probabilities over underlying representations. In Proceedings of the Twelfth Meeting of ACL-SIGMORPHON: Computational Research in Phonetics, Phonology, and Morphology. Retrieved from http://aclweb.org/anthology/sigmorphon.html#2012_0.Find this resource:
Peperkamp, S., Le Calvez, R., Nadal, J. P., & Dupoux, E. (2006). The acquisition of allophonic rules: Statistical learning with linguistic constraints. Cognition, 101(3), B31–B41.Find this resource:
Potts, C., Pater, J., Jesney, K., Bhatt, R., & Becker, M. (2010). Harmonic Grammar with linear programming: From linear systems to linguistic typology. Phonology, 27, 77–117.Find this resource:
Prince, A., & Smolensky, P. (2004). Optimality Theory: Constraint interaction in generative grammar. Originally published in 1993. Oxford: Blackwell.Find this resource:
Prince, A., & Tesar, B. (2004). Learning phonotactic distributions. In R. Kager, J. Pater, & W Zonneveld (Eds.), Fixing priorities: Constraints in phonological acquisition. Cambridge, U.K.: Cambridge University Press.Find this resource:
Pruitt, K. (2010). Serialism and locality in constraint-based metrical parsing. Phonology, 27(3), 481–526.Find this resource:
Riggle, J. (2004). Generation, Recognition and Learning in Finite State Optimality Theory. Ph. Diss., University of Chicago.Find this resource:
Riggle, J. (2006). Probably approximately correct learning in Optimality Theory. Manuscript, University of Chicago.Find this resource:
Rosenblatt, M. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408.Find this resource:
Smith, N. V. (1973). The acquisition of phonology: A case study. Cambridge, U.K.: Cambridge University Press.Find this resource:
Smolensky, P. (1996). On the comprehension/production dilemma in child language. Linguistic Inquiry, 27, 720–731.Find this resource:
Smolensky, P., & Legendre, G. (2006). The harmonic mind. MIT Press.Find this resource:
Stampe, D. (1969). The acquisition of phonetic representation. Papers from the Fifth Regional Meeting, Chicago Linguistics Society (pp. 443–454). Chicago: Chicago Linguistic Society.Find this resource:
Tesar, B. (1995). Computational Optimality Theory. PhD diss, University of Colorado, Boulder, CO.Find this resource:
Tesar, B. (1997). An iterative strategy for learning metrical stress in Optimality Theory. In The Proceedings of the 21St Annual Boston University Conference on Language Development (pp. 615–626). Somerville, MA: Cascadilla.Find this resource:
Tesar, B. (2004). Using inconsistency detection to overcome structural ambiguity. Linguistic Inquiry, 35(2), 219–253.Find this resource:
Tesar, B., Alderete, J., Horwood, G., Merchant, N., Nishitani, K., & Prince, A. (2003). Surgery in language learning. In G. Garding & M. Tsujimura (Eds.) Proceedings of the Twenty-Second West Coast Conference on Formal Linguistics (pp. 477–490). Somerville, MA: Cascadilla.Find this resource:
Tesar, B., & Prince, A. (2007). Using phonotactics to learn phonological alternations. CLS, 39(2), 209–237.Find this resource:
Tesar, B., & Smolensky, P. (1998). Learnability in Optimality Theory. Linguistic Inquiry, 29, 229–268Find this resource:
Tesar, B., & Smolensky, P. (2000). Learnability in Optimality Theory. Cambridge, MA: MIT Press.Find this resource:
Tessier, A.‑M. (2009). Frequency of violation and constraint-based phonological learning. Lingua, 119, 6–38.Find this resource:
Tessier, A.‑M. (2013). The nature of regressions in the acquisition of phonological grammars. Proceedings of Phonology 2013. Retrieved from http://journals.linguisticsociety.org/proceedings/index.php/amphonology/issue/view/1.Find this resource:
Tessier, A.‑M. (2016). Morpho-phonological acquisition. In J. Lidz, W. Snyder, & J. Pater (Eds.), The Oxford handbook of developmental linguistics. Oxford: Oxford University Press.Find this resource:
Tessier, A.‑M., & Jesney, K. (2014). Learning in Harmonic Serialism and the necessity of a richer base. Phonology, 31(1), 155–178.Find this resource:
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 1984, 1134–1142.Find this resource:
Yang, C. (2002). Knowledge and learning in natural language. Oxford: OUP.Find this resource:
(2.) For instance, by being hearing the name of the Star Trek character in both Spanish—“es.Scottie”—and English.
(3.) A necessarily batch learner might be one which needs to calculate some expected measures across a large training data set before it can generate anything fruitful; on this point see Jaeger (2007), and also discussion in Hayes and Wilson (2008) about their minimal training data set.
(4.) Because of space considerations, this article will not discuss the many impressive algorithms that aim to capture morphophonological alternations themselves—see for example MacWhinney (1975), Albright and Hayes (2003), Peperkamp, Le Calvez, Nadal, and Dupoux (2006), and Becker and Gouskova (2016), among many others.
(5.) The question of the initial state for constraint ranking is discussed at length below.