Show Summary Details

Page of

PRINTED FROM the OXFORD RESEARCH ENCYCLOPEDIA, LINGUISTICS ( (c) Oxford University Press USA, 2018. All Rights Reserved. Personal use only; commercial use is strictly prohibited (for details see Privacy Policy and Legal Notice).

Subscriber: null; date: 15 November 2018

Connectionism in Linguistic Theory

Summary and Keywords

Connectionism is an important theoretical framework for the study of human cognition and behavior. Also known as Parallel Distributed Processing (PDP) or Artificial Neural Networks (ANN), connectionism advocates that learning, representation, and processing of information in mind are parallel, distributed, and interactive in nature. It argues for the emergence of human cognition as the outcome of large networks of interactive processing units operating simultaneously. Inspired by findings from neural science and artificial intelligence, connectionism is a powerful computational tool, and it has had profound impact on many areas of research, including linguistics. Since the beginning of connectionism, many connectionist models have been developed to account for a wide range of important linguistic phenomena observed in monolingual research, such as speech perception, speech production, semantic representation, and early lexical development in children. Recently, the application of connectionism to bilingual research has also gathered momentum. Connectionist models are often precise in the specification of modeling parameters and flexible in the manipulation of relevant variables in the model to address relevant theoretical questions, therefore they can provide significant advantages in testing mechanisms underlying language processes.

Keywords: connectionism, neural networks, parallel distributed processing (PDP), language development, language processing, bilingualism, simple recurrent network (SRN), self-organizing map (SOM), DevLex models

1. Organization of the Article

Computational models have played a vital role in understanding human cognitive and linguistic behaviors. They offer particular advantages in dealing with complex interactions between variables that are often intertwined in natural language learning situations, because researchers can systematically bring target variables under tight experimental control to test theoretically relevant hypotheses (McClelland, 2009). Since 1980s, a specific class of computational models has been proven to be particularly useful to language studies. This is the class of models that are inspired by connectionism, Parallel Distributed Processing (PDP), or artificial neural networks.

The goal of this article is to guide the readers through some important issues in connectionism, and to give an integrative review on how connectionist models can be used effectively in language studies. In what follows, I will first introduce the basic concepts and philosophy of connectionist models and then describe some basic types of models along with some methodological considerations for constructing them. An overview of several connectionist language models will then be given, and the focus will be on a couple of influential ones in both monolingual and bilingual areas. I will also provide a critical analysis of a couple of challenges that face researchers in the field. The article will conclude with some further readings and online resources that will be useful for new researchers.

2. Basic Concepts

Some basic ideas of connectionism can be traced back to the 1940s, at the dawn of both neuroscience and computer science. However, it was not until in the mid-1980s that connectionist perspectives became popular in the study of human cognition and language, after the publications of the two volumes regarding the Parallel Distributed Processing (PDP) framework (McClelland, Rumelhart, & PDP research group, 1986; Rumelhart, McClelland, & PDP research group, 1986). Today connectionism has become a powerful tool as well as a conceptual framework for researchers to understand many important issues in human cognition and language.

Particularly, connectionism emphasizes ‘brain-style computation,’ in that connectionist networks process information in the way similar to that of the real brain, albeit in a simplified form. The human brain is a huge network made up of billions of nerve cells (neurons) and trillions of connections among them. A neuron has dendrites, to receive signals from other neurons, and axons, to send spikes to other neurons in the format of action potentials. Depending on the amount of signals it receives, a single neuron could be either ON (firing) or OFF (not firing), and it is the rate of firing of the individual neurons that defines their neural activities. Neurons are “connected” through synapses, which are actually tiny gaps with chemicals, called neurotransmitters, passing through to transmit signals. Different synapses often have different levels of strengths and effectiveness of signal transmission between neurons. Metaphorically, they are like doors between neurons, which could be widely open, open to certain extent, or even closed, with no transmitting of information occurring through the synapse. At the individual neuronal level, the electro-chemical processes for information transmission are relatively simple, going from cell bodies to axons and passing through synapses to reach the dendrites of another neuron, involving action potential propagation and neurotransmitter trafficking along the chain. It is at the network level that information processing becomes more interesting, given that a single neuron is usually connected to thousands of other neurons. The current combination of the synaptic strengths determines the neuronal teamwork, though the synaptic strengths are not fixed and can change dynamically in response to their learning environment. The ability of the human brain to derive the ‘optimal’ strengths for a neural network in solving any given problem is the basis of neural information processing that has inspired connectionist theories of learning, memory, and language.

With these considerations of brain features, connectionists can build artificial neural networks with two fundamental components: simple processing elements (often labeled by synonyms like units, nodes, or artificial neurons), and connections among these processing elements. Like real neurons, a node receives input from other units and can send output to other nodes. The input signals are accumulated and further transformed via a mathematical function to determine the activation value of the node. A given connectionist network can have varying numbers of nodes, many of which are connected, so that activations can spread from node to node via the corresponding connections. Like real synapses, the connections can have different levels of strength (weights), which can be adjusted according to certain learning algorithms, thereby modulating the amount of activation a source node can influence a target node. In this way, the network can develop unique combinations of weights and activation patterns of nodes in representing different input patterns from the learning environment. Unlike traditional computer programs that are dedicated to specific tasks and are fixed a priori, the weights and activation patterns in most connectionist networks are allowed to adapt continuously during learning. It is these adaptive changes that make connectionist networks interesting as models of human behavior. Each individual neuron in the brain (or a node in the model) is not very powerful, but a simultaneously activated neural network makes human cognition possible and makes connectionist models powerful in simulating human cognition. It is worth noting, though, that the real brain is an extremely complicated system, and the process described above omits many biological details. The connectionist models are simulations rather than “reduplications” of human cognitive functions.

There are two basic concepts closely related to connectionist networks. A key idea is emergentism, which indicates that thoughts (e.g., percepts, concepts, semantics) emerge from parallel interactions among the computing neurons in our brain. Another key feature of connectionist networks is its “dynamic” feature, which indicates the “non-static” status of a network, in that the network’s internal representation dynamically changes in response to the complexity and demands of the learning environment. Extending these two ideas into linguistics, connectionist linguistic models often embrace the philosophy that static linguistic representations (e.g., words, concepts, grammatical structures) are emergent properties, and can be dynamically acquired from the input environment (e.g., the speech data received by the learner).

3. Network Structure and Learning Rules

To build connectionist models one needs to select the architecture of the network and determine what learning algorithms to use to adjust connection weights. The learning algorithms can be classified roughly into two big groups: supervised and unsupervised learning. Supervised learning algorithms often calculate explicit error signals to adjust the weights of the connectionist networks while unsupervised learning does not rely on explicit error signals.

In linguistic research, a popular connectionist architecture with supervised learning is a network with information feeding forward through multiple layers of nodes (usually three layers corresponding to input, hidden, and output layers; see Figure 1a). The input layer receives information from input patterns (e.g., representations of acoustic features of phonemes), the output layer provides output patterns produced by the network (e.g., classifications of phonemes according to their sound), and the hidden layer forms the network’s internal representations as a result of the network’s learning to map input to output (e.g., the phonological similarities between phonemes like /b/ and /p/). The most influential supervised learning algorithm in psychological and cognitive studies is “backpropagation” or “backward propagation of the errors” (Rumelhart, Hinton, & Williams, 1986). According to backpropagation, each time the network is trained with an input-to-output mapping, the discrepancy (or error, δ‎) between the actual output (produced by the network based on the current connection weights) and the target output (provided by the researcher) is calculated, and is propagated back to the network so that the relevant connection weights can be changed relative to the amount of error. It is as if an external teacher informs the network about the error it makes and helps it to correct the error (hence supervised learning). Continuous weight adjustments in this way lead the network to fine-tune its connection weights in response to regularities in the input–output relationships. At the end of learning, the network derives a set of weight values that allows it to take on any pattern in the input and produce the desired pattern in the output.

Another important connectionist architecture based on supervised learning is the Simple Recurrent Network (SRN; Elman, 1990). A typical SRN is like a classic three-layer backpropagation (BP) network, but with a recurrent layer of context units (see Figure 1b). During training, temporally extended sequences of input patterns (e.g., a sequence of phonemes within a word) are sent to the network, and the goal or target of the network is often to predict the upcoming items in the sequences (i.e., predicting the next phoneme given the current phoneme input in the word). With the use of context units, an SRN model can keep a copy of the hidden-unit activations at a prior point in time, and then provide this copy along with the new input to the current stage of learning (hence “recurrent” connections). This method brings into the system a dynamic memory buffer and enables the connectionist networks to effectively capture the temporal order of information. Given that language unfolds in time, the SRN therefore provides a simple but powerful mechanism for identifying structural constraints in continuous streams of linguistic input. It is important to know that, other than predicting the upcoming items in a sequence, an SRN can also be trained to associate different linguistic aspects through the network’s input-output pairs (e.g., a sentence and the scene/situation that it describes, Frank, Haselager, & van Rooij, 2009; see next two sections for more on linguistic models based on SRN).

Connectionism in Linguistic TheoryClick to view larger

Figure 1. Two basic connectionist network architectures based on supervised learning. (a) A multilayer network trained by backpropagation; (b) A Simple Recurrent Network with a context layer, which keeps a copy of the hidden unit activations at a prior point in time.

Different from supervised learning models, unsupervised learning models use no explicit error signal at the output level to adjust the weights. There have been several different types of unsupervised learning algorithms developed (see Hinton & Sejnowski, 1999). Among them, a popular type is the self-organizing map (or SOM; Kohonen, 2001).

The SOM usually consists of a two-dimensional topographic map for the organization of input representations, where each node is a unit on the map that receives input via the input-to-map connections (see Figure 2). At each training step of SOM, an input pattern (e.g., the phonological or semantic information of a word) is randomly picked out and presented to the network, which activates many units on the map, initially randomly. The SOM algorithm starts out by identifying all the incoming connection weights to each and every unit on the map, and for each unit, compares the combination of weights (called the “weight vector”) with the combination of values in the input pattern (the “input vector”). If the unit’s weight vector and the input vector are similar or identical by chance, the unit will receive the highest activation and is declared the best matching unit (BMU). Once a unit becomes highly active for a given input, its weight vector and that of units in its neighborhood are adjusted, such that they become more similar to the input and hence will respond to the same or similar inputs more strongly the next time. The size of the neighborhood of a BMU usually decreases as learning progresses based either on a predetermined time table (i.e., standard SOM) or on a specific function of learning outcome (i.e., the self-adjusted neighborhood function introduced in the DevLex-II model by Li, Zhao, & MacWhinney, 2007). This process continues until all the input patterns elicit specific response units in the map. As a result of this self-organizing process, the statistical structure implicit in the input is captured by the topographic structure of the SOM and can be visualized on a 2-D map as meaningful clusters. This topographic structure is a salient feature of SOM-based models since several areas in the cortex (e.g., the primary motor and somatosensory areas) are known to form topographic maps, which means that nearby locations in the brain represent adjacent areas on the body when stimulated.

Connectionism in Linguistic TheoryClick to view larger

Figure 2. A connectionist network architecture based on unsupervised learning: A self-organizing map (SOM) with 81 units (9 x 9). Here the black nodes indicate the Best Matching Unit (BMU) corresponding to the input (the semantic or phonological representation of a word), and the gray nodes are the BMU’s neighbors.

Unsupervised learning may also be realized in the brain, as captured by Hebbian learning, which is clearly related to long-term potentiation (LTP) in biological systems and emphasizes the principle that “neurons that fire together wire together” (Hebb, 1949). The Hebbian learning rule can be expressed simply as ∆wkl = β‎ .α‎k. α‎l, where β‎ is a constant learning rate, and ∆wkl refers to change of weights from input k to l and α‎k and α‎l the associated activations of neurons k and l. The equation indicates that the connection strengths between neurons k and l will be increased as a function of their concurrent activities. Although Hebbian learning is not part of the SOM model, different self-organizing maps can be linked via adaptive connections trained by the Hebbian learning rule to simulate complex cognitive or linguistic behaviors.

In the following two sections, I would like to give a birds’-eye view of several connectionist models of various linguistic aspects in both monolingual and bilingual settings. Detailed discussions will be given to a couple of classic models influential in the field.

4. Connectionist Monolingual Models

4.1 Models of Word Processing

Many connectionist models have been applied to account for language processing in one language. McClelland and Rumelhart (1981) introduced an Interactive Activation (IA) model of visual word perception. In their model, there are three levels of nodes: (1) features of a letter such as curves, straight lines, or crossbars; (2) individual letters; and (3) words. Information at all three levels can interact with each other during word recognition, in both “bottom-up” (features to letters to words) and “top-down” (words to letters to features) fashions. Within levels, nodes compete for activation (thus inhibiting each other); across levels, nodes either inhibit or excite each other. These inhibitory and excitatory connections give rise to the appropriate activation of patterns that capture the word recognition process. The IA model was able to account for several empirical findings, such as the Word Superiority effect. But the weights in this model are fixed and thus lack connectionist learning mechanisms. Extending features of the IA model to speech, McClelland and Elman (1986) developed the TRACE model, which was the first large-scale interactive activation model of speech perception that provided a framework for investigating the dynamic interactions across levels of acoustic features, phonemes, and words as auditory signals unfold in the linguistic context. Since these early models, a large number of connectionist models emerged to simulate both normal and impaired reading of English words. For example, in 1989, Seidenberg and McClelland introduced the Triangle Model of word learning, which has the orthography of words mapped to both the phonology and the meaning of words (hence the triangle) via layers of hidden units. A detailed discussion of connectionist models of language processing can be found in Rohde and Plaut (2003).

4.2 Models of Learning Morphology

Morphology learning has received special treatment in connectionism since Rumelhart and McClelland’s pioneering model in 1986 on the acquisition of the English past tense. Empirical studies have shown that young children tend to follow a “U-shape” trajectory in learning the irregular past tenses in English: at first, they seem to have mastered the correct (but a few) inflectional forms of verbs (e.g., went, broke); at the second stage, they start to produce errors that were not seen at an earlier stage (e.g., saying goed as the past tense of go, or breaked as the past tense of break); finally, children recover from these errors, along with the correct use of other regular and irregular past tenses. This type of U-shaped phenomenon has been traditionally interpreted by investigators as indicating two totally separate mechanisms underlying children’s acquisition of verb past tenses: one dictionary-like rote mapping for irregular verbs, and the other the explicit representation of a rule of adding suffix -ed to regular verbs (Pinker & Price, 1988). According to this “dual mechanisms” theory, at first, children learn past tenses only by rote memory, hence the correct use of a few irregular forms such as went and broke. The errors at the second stage represent the child’s discovery of a general rule controlling the formation of English regular past tenses; at this point they apply this rule over-generally to any new verb, including irregular verbs, thus leading to the “over-generalized” errors. In the end, children realize that there are rules and exceptions to the rule, finally acquiring both the regular and irregular forms by using the dual mechanisms.

While this dual-mechanism hypothesis seemed to account for the empirical data, Rumelhart and McClelland (1986) introduced a simple two layer feed-forward connectionist model (without hidden layer) to simulate this U-shaped pattern of acquisition. The R&M model was a pattern associator that makes the strong connection between a verb’s stem and its phonological form of past tense, given the input of the verb stem. There were no dual mechanisms built into the model, only statistical associations for input (verb stems)—output (past tenses of verbs) relationships. Yet the R&M model was able to show the “U-shaped” patterns of learning, based on a single mechanism with no explicit rules. The R&M model provided a new way of thinking about linguistic knowledge, that is, that the notion of a rule is not a necessary component of language representation, only a useful way of language description. In 1991, Plunkett and Marchman introduced a three layer (with a hidden layer) feed-forward network trained by a back-propagation algorithm to simulate the past tense problem. In this network, the input layer receives input patterns representing the phonological information of verb stems, the output layer provides output patterns representing phonological information of past tenses, and the hidden layer forms the network’s internal representations as a result of the network’s learning. Using this model, they were able to identify the “U-shaped” trajectory and simulate over-generalization errors produced by children. More important, the model captured the essential role of type versus token frequencies of verbs in the learning of English past tenses, avoiding many of the problems of the R&M model criticized by some researchers (e.g., staged input for regular vs. irregular forms, which was unrealistic in child learning; see Pinker & Price, 1988).

Several other connectionist models have also been developed to simulate the learning of other types of grammatical morphology, for example, plural formation of nouns (Plunkett & Juola, 1999), grammatical and lexical aspect (Zhao & Li, 2009) and reversive prefixes of verbs (Li & MacWhinney, 1996). These models all demonstrate that single mechanisms, such as those embodied in connectionist learning, can account for the acquisition of complex grammatical structures, without recourse to the existence of separate mechanisms of linguistic rules (e.g., regular past tense formation) versus associative exceptions (e.g., irregular past tense mappings) as suggested by Pinker and colleagues.

4.3 Models of Learning Syntax

Learning of syntax holds a special status in linguistics, as syntactic structures (e.g., hierarchical recursive structures) have formed the backbone of modern linguistic theories (e.g., generative linguistics). Many connectionist researchers believe that linguistic representations could emerge from the processing of linguistic input in ways that are learned in neural networks (see the R&M model above on linguistic rules). However, it has been challenging to connectionists to demonstrate that their models can learn syntactic structures. To address this issue, Elman (1993) has successfully applied the Simple Recurrent Network (SRN) to learn the hierarchical recursive structure of English sentences. In his simulations, he would provide simple sentences first to the SRN (e.g., the mice quickly ran away), followed by complex ones (the mice the cat chases quickly ran away), and then the most complex structures (the mice the cat the dog bites chases quickly ran away). The model’s success relied on 1) the recurrent architecture in which temporal order information can be recorded (see section 3. “Network Structure and Learning Rules” above for details), and 2) the ‘starting small’ principle, in which the network was trained on incrementally more and more complex sentences, with hierarchical recursive structures seen in relative clauses with center embedding. These characteristics of the model show that connectionist models can develop temporal representations based on mechanisms similar to human working memory, and at the same time can simulate stages of syntactic acquisition by separating the contributions of memory and language in realistic learning.

Another important aspect related to learning syntax is the famous “systematicity” debate. Fodor and Pylyshyn (1988) have argued that connectionism cannot serve as a valid architecture of human cognition because connectionist networks lack “systematicity,” which is claimed to be a fundamental feature of the human mind and refers to our ability to apply correct generalizations from a limited number of seen examples to possibly infinite unseen instances. Since the start of the debate, many attempts have been made to demonstrate that connectionist linguistic models can show systematicity. For example, Frank, Haselager, & van Rooij (2009) applied an SRN to learn to comprehend sentences. Specifically, they created a “microlanguage” and a “microworld,” and trained the network to map the sentences from the microlanguage (as the input) to the corresponding events from the microworld (as the output target). The authors then systematically tested their trained network with novel sentences describing new situations. The results showed that the network had learned the syntax of the microlanguage and could generalize the knowledge to comprehend new sentence-event pairs previously unseen by it (a sign of systematicity). Similarly, Farkaš and Crocker (2008) also demonstrated systematicity in a connectionist model of sentence processing, based on a self-organizing map equipped with a recurrent component. These simulation results have weakened Fodor and Pylyshyn’s criticism of lacking systematicity in connectionism.

4.4 Models of Learning Semantic Structures

The learning of semantic and lexical categories or structures in the mental lexicon is another important research topic in connectionism, closely related to the learning of syntactic structures. Elman (1990) successfully used an SRN to capture semantic categories like nouns, verbs, and adjectives as language unfolds in time. Particularly, he applied hierarchical clustering analyses on the activation patterns of units in the hidden layer and demonstrated the emergence of semantic categories in the internal representation of the network. Another classic work was from Ritter and Kohonen (1989), which showed how self-organizing maps can extract topographically structured semantic categories from the linguistic input. The authors tested a single SOM, with inputs representing the meaning of words that were generated from two methods. The first method is a feature-based method, according to which a word’s meaning is represented by a vector and each dimension of this vector represents a possible descriptive feature or attribute of the concept. The value of the dimension could be zero or one, indicating the absence (0) or presence (1) of a particular feature for the target word. For example, the representations of dove and hen are very similar, except one dimension represents the flying feature (dove = 1, hen = 0). Specifically, Ritter and Kohonen generated a detailed representation of 16 animals based on 13 attributes, trained a SOM with the 16 animal words, and found that the network was able to form topographically organized representations of semantic categories associated with the 16 animal words. Their second method of representing meanings of words is a statistics-based method, according to which the researchers generated a corpus consisting of three-word sentences randomly formed from a list of nouns, verbs, and adverbs (e.g., Dog drinks fast). A trigram window is applied to the corpus, and the co-occurrence frequencies of the word in the middle of the trigram with its two closest neighbors are calculated. This generates a co-occurrence matrix, which forms the basis of each word’s “average context vector,” a combination of the average of all the words preceding the target word and that of all the words following it. The researchers then used these vectors as input to the SOM, and training on the SOM again indicated topographically structured semantic and grammatical categories on the map.

4.5 Models of Lexical Development

Children’s early lexical development is another lively debated area in language development. Several connectionist models have been developed to simulate the phenomenon of vocabulary spurt, the suddenly acceleration of word learning when a child is about 18–22 months old. Plunkett, Sinha, Møller, and Strandsby (1992) presented a multi-layered auto-associative network that showed vocabulary spurt in both comprehension and production. Regier (2005) provided a connectionist model that accounted for the occurrence of rapid vocabulary learning by highlighting children’s increased ability of selective attention as vocabulary increases.

A number of connectionist models of lexical development have also been based on SOM. Specifically, Li and colleagues have introduced a series of DevLex (Developmental Lexicon) models, which have provided a SOM-based developmental framework of language acquisition in accounting for patterns of early lexical development (Li, Farkaš, & MacWhinney, 2004; Li, Zhao, & MacWhinney, 2007). The original DevLex model was introduced in 2004, and it includes two growing SOMs, connected by links trained through Hebbian learning. The model was able to capture emergence of semantic categories and age of acquisition effects in incremental lexical learning. The newest development of DevLex models is DevLex-II (Li, Zhao, & MacWhinney, 2007), which successfully simulates patterns of early lexical development, including vocabulary spurt. As seen in Figure 3, DevLex-II includes three basic levels for the representation and organization of linguistic information: phonological content, semantic content, and the output sequence of the lexicon. The core of the model is an SOM that handles lexical-semantic representation. This SOM is connected to two other SOMs, one for input (auditory) phonology, and another for articulatory sequences of output phonology. Upon training of the network, the semantic representation, input phonology, and output phonemic sequence of a word are simultaneously presented to and processed by the network. This process can be analogous to a child’s analysis of a word’s semantic, phonological, and phonemic information upon hearing a word. The associative connections between maps are trained via the Hebbian learning rule, which states that the weights of the associative connections between co-activated nodes on two maps will become increasingly stronger as learning and training progress (i.e., if they are frequently co-activated). The interconnected SOMs, along with the Hebbian associative weights, allow the researchers to simulate word comprehension (the activation of a word meaning via links from the phonological SOM to the semantic SOM) and word production (the activation of phonetic sequences via links from the semantic SOM to output sequence SOM).

Connectionism in Linguistic TheoryClick to view larger

Figure 3. A sketch of the DevLex-II model.

Trained with inputs derived from a real linguistic environment (CHILDES parental corpus; MacWhinney, 2000), the DevLex II model has been able to account for a wide variety of patterns in early lexical development, including vocabulary spurt. Specifically, in DevLex-II, developmental changes such as the vocabulary spurt are modulated by lexical organization along with input characteristics. The onset of the vocabulary spurt is triggered by structured representations in the semantic, phonological, and phonemic organizations. Once these patterns have consolidated, the associative connections between maps can be reliably strengthened through learning to capture the systematic structural relationships between forms and meanings. At this point, word learning is no longer hampered by uncertainty and confusion on the maps, and the vocabulary spurt occurs. This approach is highly consistent with recent perspectives in developmental psychology and cognitive neuroscience, according to which early learning has a cascading effect on later development and learning itself shapes the cognitive and neural structures.

5. Connectionist Bilingual Models

In contrast to the flourish of research in connectionist modeling of one language, there have been only a handful of neural network models designed specifically to account for bilingualism, although this number is rapidly growing as can be seen in a recent special issue of Bilingualism: Language and Cognition, edited by Li (2013). In this section, I will briefly review a few key connectionist models in bilingual processing, bilingual development, and the interactions of bilinguals’ two languages.

5.1 Models of Bilingual Processing

One of the best-known computational models of bilingualism is the Bilingual Interactive Activation (BIA) model (Dijkstra & van Heuven, 1998). The BIA model is a localist model based on the Interactive Activation (IA) model that we discussed in the last section. Incorporating the design features of the IA model into the study of bilingualism, the BIA model consists of four levels of nodes: features, letters, words, and languages. Special to the BIA model, apart from the modeling of two different lexicons, are the pre-determined language nodes (e.g., one for English and one for Dutch). Language nodes in BIA function as an important mechanism for the selection or inhibition of words in one or the other language. BIA argues for and implements the language-independent access hypothesis, according to which words from different languages are simultaneously activated during word recognition. The BIA model was later expanded as the BIA+ model (Dijkstra & van Heuven, 2002), in which nonlinguistic properties such as decision and task schemas were added to the original BIA model to account for a variety of empirical patterns. It is worth noting that, although the BIA models have found a great deal of empirical evidence in bilingual processing, they do not incorporate dynamic learning mechanisms to allow for effects of developmental changes because they were designed to capture proficient bilingual speakers’ mental representation.

Based on the famous Triangle model (Seidenberg & McClelland, 1989), Filippi, Karaminis, and Thomas (2014) presented a connectionist model of language switching in bilingual production. Their work was based on simulations of behavioral experiments in which Italian-English adult bilinguals were asked to name visually presented words in the two languages (and their language switching costs were measured). The results from both their modeling and experimentation demonstrated a pattern of asymmetry: bilinguals showed a greater cost in reaction time when naming involved switching from their less fluent language to their dominant language than switching in the reverse direction. The authors argued that this switching asymmetry arises from the competition between the bilingual’s two languages during production and the resolution of this competition.

5.2 Models of Bilingual Development

Recently, developmental issues in bilingualism (such as second language development and the age of acquisition (AoA) effects) have attracted the attention of many connectionists. Hereafter, I will discuss some connectionist bilingual development models.

Thomas (1997) used a Bilingual Single Network (BSN) model to learn the orthography-to-semantics mapping in visual word recognition. The BSN used a standard three-layer network with the back-propagation algorithm to transform a word’s orthography (input) to a word’s semantic representation (output). It was trained on vocabulary sets from two simplified artificial languages, and the network was exposed to both materials, either in a balanced condition (equal amount of training) or an unbalanced condition (L1 trained three times as often as L2). The language membership of these pseudo words was explicitly marked in both the input and output layer of the network. After learning, the network’s internal representation (the activation pattern of the hidden units) of each pseudo word was analyzed by a Principal Component Analysis (PCA). This analysis indicated that under both conditions, the network was able to develop distinct internal representations for L1 vs. L2, although in the unbalanced condition the L2 words were less clearly represented, compared with those in the balanced condition.

Incorporating sentence-level input, French (1998) tested a Bilingual SRN (BSRN) model that was also trained on artificially generated sentences of the N-V-N structure in two artificial languages. The network was exposed to bilingual input as in the BSN model, but with the two artificial languages intermixed at the sentence rather than the word level, and with the input having a certain probability of switching from one language to the other. In addition, no explicit language membership marker was included in the model (i.e., no “language nodes” as in the BIA model). The model adopted the architecture of SRN model, and the task was to predict the next word given the current word input in the sentence. Simulations with the BSRN model showed that distinct patterns of the two target languages emerged after training: words from the two languages became separated in the network’s internal representations (the hidden-nodes activations) as analyzed by PCA. The model provided support to the hypothesis that the bilingual input environment itself (mixed bilingual sentences in this case) is a sufficient condition for the development of a distinct mental representation of each language, without invoking separate processing or storage mechanisms for the different languages.

The BSN and BSRN models were based on supervised learning algorithms. To explore unsupervised learning for bilingualism, Li and Farkaš (2002) proposed a self-organizing model of bilingual processing (SOMBIP), in which training data derived from actual linguistic corpus were used for the model. Through Hebbian learning, the SOMBIP model connects two SOM maps: one trained on phonological representations, and the other on semantic representations. The phonological representations of words were based on articulatory features of phonemes, whereas the semantic representations were derived from the extraction of co-occurrence statistics in child-directed, bilingual, parental speech. Chinese and English were the two target languages, and the activation patterns (the distribution of BMUs corresponding to the words) on the two SOMs were analyzed. Simulation results from SOMBIP indicated that the simultaneous learning of Chinese and English led to distinct lexical representations for the two languages, as well as structured semantic and phonological representations within each language. Consistent with the general patterns of BSN and BSRN, these results suggest that natural bilingual input contains sufficient information for the learner to differentiate the two languages. An interesting aspect is that SOMBIP provides a different way to assess proficiency. By having the network exposed to fewer sentences in L2, the model simulates a novice learner having limited linguistic experience, and this differs from a pre-determined, more artificial training schedule in simulating learner differences (e.g., BSN’s balanced vs. unbalanced training schedule, in which L1 was trained three times as often as L2 for the latter). This more natural way of modeling proficiency, interestingly, yielded comparable results to those from the unbalanced BSN: the ‘novice’ network’s representation of the L2 was more compressed and less clearly delineated, compared to the ‘proficient’ network. Specifically, compared with L1 words, the L2 words occupied only a small portion of the semantic map, and many confusions and overlaps of BMUs occurred within the L2 area.

Zhao and Li (2010) extended their DevLex-II model to study the development of bilingual’s lexical representation under different age of acquisition (AoA) of L2. The model was trained to learn 1,000 English and Chinese words, with different onset times for the learning of the second language (L2, either English or Chinese) relative to that of the first language (L1, either Chinese or English). There were three learning scenarios: simultaneous learning of L1 and L2; early L2 learning; and late L2 learning. For simultaneous learning, the two lexicons were presented to the network and trained in parallel. For early L2 learning, the onset time of L2 input to the model was slightly delayed relative to that of L1 input. For late L2 learning, the onset time of L2 input was significantly delayed relative to that of L1. Specifically, the simultaneous learning situation is analogous to a situation in which children are raised in a bilingual family and receive linguistic inputs from the two languages simultaneously. The early learning situation could be compared to the situation in which bilinguals acquire their L2 early in life (e.g., in early childhood), while the late learning situation compares to that of a bilingual’s learning of L2 later in life (e.g., after puberty). The modeling results indicate that, when the learning of L2 is early relative to that of L1, functionally distinct lexical representations may be established for both languages. However, when the learning of L2 is significantly delayed relative to that of L1, the structural consolidation of the L1 lexicon becomes a hindrance to the L2 for establishing its distinct and independent lexical representation. These findings from DevLex-II provide a computational account of the age of acquisition effect by referencing the dynamic interaction and competition between the two languages competing in the bilingual mind. The reduced plasticity in L2 learning could be seen as the result of structural changes due to the learning system’s experience with L1, in that L1 consolidation adversely impacts the system’s ability to reorganize and restructure L2 relative to L1.

DevLex-II was also extended to simulate cross-language semantic priming in connection with the age of acquisition effect (Zhao & Li, 2013). Cross-language priming has been a vital empirical method in the literature for testing semantic representations in bilinguals, and many studies have shown that in such a paradigm bilinguals respond faster to translation equivalents or semantically related words across languages than to unrelated pairs of words from the two languages (named translation priming and semantic priming, respectively). In addition, the translation priming effects are always larger than the semantic priming. It has also been observed that priming effects are stronger if participants are presented with L1 words as primes and L2 words as targets (i.e., the L1-to-L2 direction of priming), as compared with the situation in which participants are presented with L2 words as primes and L1 words as targets (i.e., the L2-to-L1 direction of priming). More interestingly, such “priming asymmetry” decreased as a function of the effect of age of acquisition; for example, it was larger in the late L2 learning situation than in the early L2 learning situation (see Dimitropoulou, Duñabeitia, & Carreiras, 2011 for a review). Zhao and Li implemented a spreading activation mechanism in DevLex-II. As a result, the model clearly displayed cross-language priming patterns consistent with the empirical literature mentioned above: such that the stronger translation priming effects compared with semantic priming, and the “priming asymmetry”.

6. Critical Issues

Voices against connectionism as a valid model of human language and learning have been there since its resurgence in the 1980’s. As the result, some critical issues for neural network modeling have been identified by connectionists, and special attentions have been put on them. In the following, I will introduce a couple of them.

6.1 Representing Linguistic Features

Some early connectionist models were criticized as “toy models” that lacked linguistic and psychological reality. Critics raised the issue of whether results from such models could make direct contact with the statistical properties of natural linguistic input to which the learner or language user was exposed. Connectionist language researchers have taken this issue seriously. Many of them agree that “input representativeness” is crucial for computational modeling of language (Christiansen & Chater, 2001), and they have been concerned with how to accurately represent various linguistic aspects in their models. A crude way to represent lexical entries is to use the so-called ‘localist’ representation, according to which a single, unitary processing unit in the network, randomly picked by the modeler, is assigned a numerical value to represent a linguistic item (e.g., the meaning, sound, or other linguistic property of a word).

A different way, embraced by the PDP models in general, is to represent lexical entries as distributed representations, according to which a given lexical item is represented by multiple nodes and their weighted connections, as a distributed pattern of activation of relevant micro-features. For example, methods to derive distributed semantic representations of words can be roughly classified into two groups. One is the feature-based representation, in which empirical data are often used to help generate the features describing the meaning of words (e.g., McRae, Cree, Seidenberg, & McNorgan, 2005). The other is the corpus-based representation that derives meaning of words through co-occurrence statistics in large-scale linguistic corpora. Hyperspace Analogue to Language (HAL, Burgess & Lund, 1997) and Latent Semantic Analysis (LSA, Landauer & Dumais 1997) are two widely used corpus-based semantic representation methods. Zhao, Li, and Kohonen (2011) recently developed the Contextual Self-Organizing Map, a software package that can derive corpus-based semantic representations based on word co-occurrences in multiple languages, and the method has been shown to capture unique linguistic features of different languages.

It is important that researchers choose an appropriate method to represent linguistic features based on careful evaluations of their simulation goals. The localist representation definitely has the simplicity and efficiency for modeling, given its one-to-one mapping between linguistic entities and units. However, if one wants to simulate effects like similarity-based semantic priming, the distributed representations might be a better choice because they can easily capture the similarities among concepts (see Zhao & Li, 2013 for an example).

6.2 Plasticity and Stability of the Models

Many classic connectionist models have been challenged for their plasticity (or lack of it) to learn new knowledge. When faced with a sequential learning task (such as learning a second language), they may encounter a critical issue called “plasticity-stability” dilemma (see Mermillod, Bugaiska, & Bonin, 2013). Particularly, a connectionist network needs plasticity to learn new knowledge, but too much plasticity often causes it to lose its stability for old knowledge; conversely, a network that is too stable often cannot adapt itself very well to the new learning task. Taken second language learning as an example, if we train a network to acquire an L1 lexicon with 500 words and then train it on another 500 words in L2, in many traditional networks, the addition of L2 words may disrupt the network’s knowledge of L1. In other words, the network will lose its representation of L1 words because of learning new words. Such a “catastrophic interference” (French, 1999) of course, is unlike human learning. Researchers should be aware that remedies may be needed for this problem depending on their research goals. For example, some connectionist models have adjustable network structures, or involve dynamic unit growth, such as the cascade-correction network (Shultz & Fahlman, 2010), the constructivist neural network (Ruh & Westermann, 2009), and the Growing Neural Gas model (Fritzke, 1995). Our DevLex models (Li, Farkaš, & MacWhinney, 2004; Li, Zhao, & MacWhinney, 2007; Zhao & Li, 2010) also integrated certain features (like the self-adjusted neighborhood function and the growing SOMs) to resolve the plasticity-stability problem. Similar example can also be found in the interconnected growing SOMs model from Cao, Li, Fang, Kaufmann, and Kröger (2014).

6.3 Network Structure Selection and Parameters Setup

From the previous discussions, it should have become clear to readers that choosing appropriate network structures is a key to the success simulations of their specific linguistic questions. For example, if the researcher is interested in semantic representation and organization, a SOM-based architecture might be highly appropriate and relevant given its topography-preserving feature. But if the researcher is interested in simulating the processing of temporally ordered components (e.g., syntax), a network with a recurrent portion resembling the SRN structure might be a better candidate. Other connectionist network structures, such as “hybrid” models that combine both supervised and unsupervised learning methods, and models that have adjustable structures or involve dynamic unit growth (see discussion in section 6.2 “Plasticity and Stability of the Models”), may also be considered based on the researchers’ simulation needs.

Regardless of the type of learning algorithm a researcher uses, every model will involve a set of “parameters” that need to be considered and set up. Among them, there are research variables whose effects on the network’s output are of interest to the researchers. A benefit of computational modeling is that researchers can easily bring a research variable under tight experimental control by systematically manipulating different levels of the variable, and they can see the effects while holding other variables constant. In this way, we can investigate the outcome caused by the use of different levels of a specific research variable, which may be difficult to manipulate in the natural environment.

Other than the research variables, there are also some “free parameters” in a model that often need to be adjusted by the modelers. For example, how large should the size of the network be to adequately learn the target aspect of the second language? What value should be assigned to the learning rate? Decisions need to be made by the researchers in advance on such practical questions before a simulation is run. Often, there are no correct answers to these questions, as each model involves a different degree of complexity and task difficulty, and the researcher needs to use experience based on previous models and conventional wisdom in setting up appropriate values of the free parameters. Inevitably, criticism may arise toward such practice in a particular model. One caution here is that the researcher should avoid introducing too many free parameters in a simulation project. Although more free parameters usually mean better fitting of the model to the target data, it may compromise the external validity of the network in relating to the phenomena being simulated. As in empirical studies, findings from overly tight controlled experiments with too many variables under consideration may not generalize to other situations.

7. Theoretical Linguistics Inspired by Connectionism

In the previous sections, the usefulness of connectionist network models for language studies has been demonstrated. At this point, it should become clear that computational models, especially connectionist models, have much to offer to the understanding of linguistic behavior. Connectionism argues in favor of a computational approach to the complexities associated with the dynamical interactions of the learning system with the linguistic environment. From the many connectionist language studies of last three decades, researchers has demonstrated the biological plausibility and psychological reality of the idea that complex human linguistic behavior can emerge from simple neural-chemical processes. In this perspective, the rules of language can be considered as the product of neural network processes in the long human evolutionary history, in much the same way as a hexagonal structure emerges from the simple acts of honeybees packing small amounts of honey into the honeycomb from multiple directions (Bates, 1984).

Many influential modern linguistic theories, such as the Optimality Theory, can find their roots in connectionism. Indeed the core of the OT theory is the computation of the optimal representation with maximum harmony, which has its similarity to the search of minimum error in connectionist models (see Smolensky & Legrendre, 2006). Bybee (2001) argues that phonological patterns can be understood as emergent consequences of language usage (e.g., frequency of a word and its co-occurence with other words), and she argues against domain-specific modules in the brain for language processing. These ideas are highly consistent with connectionism. With the profound impact of connectionism in the last three decades, it has become increasingly clear that language can no longer be studied as a pure innate symbolic system, as advocated in classic works on linguistics, and it is expected that more linguistic theories and studies will be inspired by connectionism.

Further Reading

Comparison of Neural Network Simulators (2015, May 20). Emergent simulation environment.

Elman, J., Bates, A., Johnson, A., Karmiloff-Smith, A., Parisi, D., & Plunkett, K. (1996). Rethinking innateness: A connectionist perspective on development. Cambridge, MA: MIT Press.Find this resource:

    Kohonen, T. (2001). Self-organizing maps (3d ed.). Berlin, Germany: Springer.Find this resource:

      Li, P. (2009). Lexical organization and competition in first and second languages: Computational and neural mechanisms. Cognitive Science, 33, 629–664.Find this resource:

        Li, P., & Zhao, X. (2012). Connectionism. In M. Aronoff (Ed.), Oxford Bibliographies Online: Linguistics. New York: Oxford University Press.Find this resource:

          Li, P., & Zhao, X. (2013). Self-organizing map models of language acquisition. Frontiers in Psychology, 4, 828.Find this resource:

            MacWhinney, B., & Li, P. (2008). Neurolinguistic computational models. In B. Stemmer & W. Whitaker (Eds.), Handbook of the neuroscience of language (pp. 229–236). London: Elsevier Science Publisher.Find this resource:

              McClelland, J. (2009). The place of modeling in cognitive science. Topics in Cognitive Science, 1, 11–28.Find this resource:

                McClelland, J. (2014) Explorations in parallel distributed processing: A handbook of models, programs, and exercises.

                Ritter, H., & Kohonen, T. (1989). Self-organizing semantic maps. Biological Cybernetics, 61, 241–254.Find this resource:

                  Rumelhart, D. (1989). The architecture of mind: A connectionist approach. In M. Posner (Ed.), Foundations of cognitive science. Cambridge, MA: MIT Press.Find this resource:

                    Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning internal representations by error propagation. In D. Rumelhart, J. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructures of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press.Find this resource:

                      Spencer, J. P., Thomas, M. S. C., & McClelland, J. L. (Eds.) (2009). Toward a unified theory of development: Connectionism and dynamic systems theory re-considered. Oxford: Oxford University Press.Find this resource:

                        Spitzer, M. (1999). The mind within the net: Models of learning, thinking, and acting. Cambridge, MA: MIT Press.Find this resource:

                          Westermann, G., Ruh, N., & Plunkett, K. (2009). Connectionist approaches to language learning. Linguistics, 47, 413–452.Find this resource:


                            Bates, E. (1984). Bioprograms and the innateness hypothesis. Behavioral and Brain Sciences, 7, 188–190.Find this resource:

                              Burgess, C., & Lund, K. (1997). Modeling parsing constraints with high-dimensional context space. Language and Cognitive Processes, 12, 177–210.Find this resource:

                                Bybee, J. (2001). Phonology and language use. Cambridge, U.K.: Cambridge University Press.Find this resource:

                                  Cao, M., Li, A., Fang, Q., Kaufmann, E., & Kröger, B. J. (2014). Interconnected growing self-organizing maps for auditory and semantic acquisition modeling. Frontiers in Psychology, 5, 236.Find this resource:

                                    Christiansen, M. H., & Chater, N. (2001). Connectionist psycholinguistics: Capturing the empirical data. Trends in Cognitive Sciences, 5, 82–88.Find this resource:

                                      Dijkstra, T., & van Heuven, W. (1998). The BIA model and bilingual word recognition. In J. Grainger & A. M. Jacobs (Eds.), Localist connectionist approaches to human cognition (pp. 189–225). Mahwah, NJ: Erlbaum.Find this resource:

                                        Dijkstra, T., & Van Heuven, W. (2002). The architecture of the bilingual word recognition system: From identification to decision. Bilingualism: Language and Cognition, 5(3), 175–197.Find this resource:

                                          Dimitropoulou, M., Duñabeitia, J. A., & Carreiras, M. (2011). Two words, one meaning: Evidence of automatic co-activation of translation equivalents. Frontiers in Psychology, 2, 188.Find this resource:

                                            Elman, J. (1990). Finding structure in time. Cognitive Science, 14, 179–211.Find this resource:

                                              Elman, J. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48(1), 71–99.Find this resource:

                                                Farkaš, I., & Crocker, M. W. (2008). Syntactic systematicity in sentence processing with a recurrent self-organizing network. Neurocomputing, 71(7), 1172–1179.Find this resource:

                                                  Filippi, R., Karaminis, T., & Thomas, M. (2014). Language switching in bilingual production: Empirical data and computational modelling. Bilingualism: Language and Cognition, 17, 294–315.Find this resource:

                                                    Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 28, 3–71.Find this resource:

                                                      Frank, S. L., Haselager, W. F., & van Rooij, I. (2009). Connectionist semantic systematicity. Cognition, 110(3), 358–379.Find this resource:

                                                        French, R. M. (1998). A simple recurrent network model of bilingual memory. In M. A. Gernsbacher & S. J. Derry (Eds.), Proceedings of the 20th Annual Conference of the Cognitive Science Society (pp. 368–373). Mahwah, NJ: Erlbaum.Find this resource:

                                                          French, R. M. (1999). Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3, 128–135.Find this resource:

                                                            Fritzke, B. (1995). A growing neural gas network learns topologies. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems 7: Proceedings of the 1994 conference (pp. 625–632). Cambridge, MA: MIT Press.Find this resource:

                                                              Hebb, D. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley.Find this resource:

                                                                Hinton, G. E., & Sejnowski, T. J. (1999). Unsupervised learning: Foundations of neural computation. Cambridge, MA: MIT press.Find this resource:

                                                                  Kohonen, T. (2001). Self-organizing maps (3d ed.). Berlin, Germany: Springer.Find this resource:

                                                                    Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.Find this resource:

                                                                      Li, P. (2013). Computational modeling of bilingualism. [Special issue]. Bilingualism: Language and Cognition, 16(2), 241–366.Find this resource:

                                                                        Li, P., & Farkaš, I. (2002). A self-organizing connectionist model of bilingual processing. In R. Heredia & J. Altarriba (Eds.), Bilingual sentence processing (pp. 59–85). North-Holland, Netherlands: Elsevier Science Publisher.Find this resource:

                                                                          Li, P., Farkaš, I., & MacWhinney, B. (2004). Early lexical development in a self-organizing neural network. Neural Networks, 17, 1345–1362.Find this resource:

                                                                            Li, P., & MacWhinney, B. (1996). Cryptotype, overgeneralization, and competition: A connectionist model of the learning of English reversive prefixes. Connection Science, 8(1), 3–30.Find this resource:

                                                                              Li, P, Zhao, X., & MacWhinney, B. (2007). Dynamic self-organization and early lexical development in children. Cognitive Science: A Multidisciplinary Journal, 31, 581–612.Find this resource:

                                                                                MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk, transcription, format and programs (Vol. 1). Mahwah, NJ: Lawrence Erlbaum.Find this resource:

                                                                                  McClelland, J., & Rumelhart, D. (1981). An interactive activation model of context effects in letter perception: Part 1. An account of basic findings. Psychological Review, 88, 375–407.Find this resource:

                                                                                    McClelland, J., Rumelhart, D., & the PDP Research Group (1986). Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 2). Cambridge, MA: MIT Press.Find this resource:

                                                                                      McClelland, J. L. (2009). The place of modeling in cognitive science. Topics in Cognitive Science, 1, 11–28.Find this resource:

                                                                                        McClelland, J. L., & Elman, J. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1–86.Find this resource:

                                                                                          McRae, K., Cree, G. S., Seidenberg, M. S., & McNorgan, C. (2005). Semantic feature production norms for a large set of living and nonliving things. Behavior Research Methods, 37(4), 547–559.Find this resource:

                                                                                            Mermillod, M., Bugaiska, A., & Bonin, P. (2013). The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in Psychology, 4, 504.Find this resource:

                                                                                              Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28, 73–193.Find this resource:

                                                                                                Plunkett, K., & Juola, P. (1999). A connectionist model of English past tense and plural morphology. Cognitive Science, 23(4), 463–490.Find this resource:

                                                                                                  Plunkett, K., & Marchman, V. (1991). U-shaped learning and frequency effects in a multi-layered perceptron: Implications for child language acquisition. Cognition, 38, 43–102.Find this resource:

                                                                                                    Plunkett, K., Sinha, C., Møller, M. F., & Strandsby, O. (1992). Symbol grounding or the emergence of symbols? Vocabulary growth in children and a connectionist net. Connection Science, 4(3–4), 293–312.Find this resource:

                                                                                                      Regier, T. (2005). The emergence of words: Attentional learning in form and meaning. Cognitive Science, 29(6), 819–865.Find this resource:

                                                                                                        Ritter, H., & Kohonen, T. (1989). Self-organizing semantic maps. Biological Cybernetics, 61, 241–254.Find this resource:

                                                                                                          Rohde, D. L. T., & Plaut, D. C. (2003). Connectionist models of language processing. Cognitive Studies, 10, 10–28.Find this resource:

                                                                                                            Ruh, N., & Westermann, G. (2009) Simulating German verb inflection with a constructivist neural network. In J. Mayor, N. Ruh, & K. Plunkett (Eds.), Connectionist models of behavior and cognition (Vol. 2, pp. 313–324). London: World Scientific.Find this resource:

                                                                                                              Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning internal representations by error propagation. In D. Rumelhart, J. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructures of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press.Find this resource:

                                                                                                                Rumelhart, D., & McClelland, J. L. (1986). On learning the past tenses of English verbs. In J. L. McClelland, D. E. Rumelhart, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 2), Psychological and biological models (pp. 216–271). Cambridge, MA: MIT Press.Find this resource:

                                                                                                                  Rumelhart, D. E., McClelland, J. L., & the PDP Research Group (Eds.) (1986). Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1), Foundations. Cambridge, MA: MIT Press.Find this resource:

                                                                                                                    Seidenberg, M. S., & McClelland, J. L. (1989). A distributed developmental model of word recognition and naming. Psychological Review, 96, 523–568.Find this resource:

                                                                                                                      Shultz, T. R., & Fahlman, S. E. (2010). Cascade-correlation. In C. Sammut, G. I. Webb (Eds.), Encyclopedia of machine learning (Part 4/C, 139–147). Heidelberg, Germany: Springer-Verlag.Find this resource:

                                                                                                                        Smolensky, P., & Legrendre, G. (Eds.) (2006). The harmonic mind: From neural computation to optimality-theoretic grammar (2 vols.). Cambridge, MA: MIT Press.Find this resource:

                                                                                                                          Thomas, M. S. C. (1997). Connectionist networks and knowledge representation: The case of bilingual lexical processing. Doctoral diss., Oxford University.Find this resource:

                                                                                                                            Zhao, X., & Li, P. (2009). The acquisition of lexical and grammatical aspect in a developmental lexicon model. Linguistics, 47, 1075–1112.Find this resource:

                                                                                                                              Zhao, X., & Li, P. (2010). Bilingual lexical interactions in an unsupervised neural network model. International Journal of Bilingual Education and Bilingualism, 13, 505–524.Find this resource:

                                                                                                                                Zhao, X., & Li, P. (2013). Simulating cross-language priming with a dynamic computational model of the lexicon. Bilingualism: Language and Cognition, 16, 288–303.Find this resource:

                                                                                                                                  Zhao, X., Li, P., & Kohonen, T. (2011). Contextual self-organizing map: Software for constructing semantic representation. Behavior Research Methods, 43, 77–88.Find this resource: