date: 20 March 2018

Languages of the World

Summary and Keywords

About 7,000 languages are spoken around the world today. The actual number depends on where the line is drawn between language and dialect—an arbitrary decision, because languages are always in flux. But specialists applying a reasonably uniform criterion across the globe count well over 2,000 languages in Asia and Africa, while Europe has just shy of 300. In between are the Pacific region, with over 1,300 languages, and the Americas, with just over 1,000. Languages spoken natively by over a million speakers number around 250, but the vast majority have very few speakers. Something like half are thought likely to disappear over the next few decades, as speakers of endangered languages turn to more widely spoken ones.

The languages of the world are grouped into perhaps 430 language families, based on their origin, as determined by comparing similarities among languages and deducing how they evolved from earlier ones. As with languages, there’s quite a lot of disagreement about the number of language families, reflecting our meager knowledge of many present-day languages and even sparser knowledge of their history. The figure 430 comes from, which actually lists them all. While the world’s language families may well go back to a smaller number of original languages, even to a single mother tongue, scholars disagree on how far back current methods permit us to trace the history of languages.

While it is normal for languages to borrow from other languages, occasionally a totally new language is created by mixing elements of two distinct languages to such a degree that we would not want to identify one of the source languages as the mother tongue. This is what led to the development of Media Lengua, a language of Ecuador formed through contact among speakers of Spanish and speakers of Quechua. In this language, practically all the word stems are from Spanish, while all of the endings are from Quechua. Just a handful of languages have come into being in this way, but less extreme forms of language mixture have resulted in over a hundred pidgins and creoles currently spoken in many parts of the world. Most arose during Europe’s colonial era, when European colonists used their language to communicate with local inhabitants, who in turn blended vocabulary from the European language with grammar largely from their native language.

Also among the languages of the world are about 300 sign languages used mainly in communicating among and with the deaf. The structure of sign languages typically has little historical connection to the structure of nearby spoken languages.

Some languages have been constructed expressly, often by a single individual, to meet communication demands among speakers with no common language. Esperanto, designed to serve as a universal language and used as a second language by some two million, according to some estimates, is the prime example, but it is only one among several hundred would-be international auxiliary languages.

This essay surveys the languages of the world continent by continent, ending with descriptions of sign languages and of pidgins and creoles. A set of references grouped by section appears at the very end. The main source for data on language classification, numbers of languages, and speakers is the 19th edition of Ethnologue (see Resources), except where a different source is cited.

Keywords: languages, language family, language history, language classification, sign language, pidgin, creole

1. Europe

1.1 Indo-European

Most of Europe’s languages belong to the Indo-European family, which has the following branches: Celtic, Germanic, Italic, Greek, Albanian, Balto-Slavic, Armenian, Indo-Iranian, Anatolian, and Tocharian.

1.1.1 Celtic

Celtic, which extended across much of Europe as far east as present-day Turkey 2,000 years ago, has undergone gradual contraction since the ascendance of the Romans in Europe, and with the spread of English and French the Celtic languages have long been confined to parts of Britain, Ireland, and western France. The two main branches of modern Celtic are Brythonic and Goidelic. In the Brythonic branch are Welsh, Cornish, and Breton; the Goidelic branch includes Irish, Scottish Gaelic, and Manx.

Gaulish, a third branch, went extinct but has recently undergone restoration attempts, as have Manx and Cornish, which also were extinct. In fact, all present-day Celtic languages have seen revitalization efforts. This is happening even with Welsh—hardly an endangered language with 562,000 speakers in the 2011 census. Currently, Wales has school programs aimed at getting a greater proportion of ethnic Welsh, who number nearly 2,400,000, to learn to speak the language. The same is happening with Breton, spoken by over 200,000 in Brittany in northwestern France, but “no longer exclusively, predominately, or even commonly used by the population in any city, town, or village in Brittany,” according to Adkins (2013). As in Wales, school programs in Brittany since at least the 1970s have aimed to get young people speaking a variety of their ethnic tongue.

1.1.2 Germanic

Germanic’s two branches, North and West, were once grouped into a superbranch called Northwest Germanic, once paired with the Gothic branch that went extinct, largely in the Middle Ages, though isolated traces of Crimean Gothic remained until the late 18th century. The North Germanic languages are Swedish, Danish, Norwegian, Icelandic, and Faroese. West Germanic includes English, German, Dutch, and Frisian. Two of these are paired with a sister language that is also spoken by significant numbers: Dutch with Afrikaans, and German with Yiddish.

1.1.3 Italic

This is the ancestral branch of the modern Romance languages, all descended from a colloquial form of Latin. About 2,500 years ago, the Italic branch included not just Latin but also Oscan, Umbrian, and Faliscan, but these languages have no modern descendants. The modern descendants of Latin include French, Catalan, Spanish, Portuguese, Italian, Romanian, Sardinian, Romansch, Ladin, Friulian, Occitan, and Judeo-Spanish.

1.1.4 Greek and Albanian

Modern Greek is the only descendant of this branch, also called Hellenic. Albanian, similarly, is the only descendant of the Albanian branch.

1.1.5 Balto-Slavic

This group has Baltic and Slavic subbranches. The official languages of Baltic countries Lithuania and Latvia make up the Baltic subbranch. Slavic has three divisions: Eastern (Russian, Ukrainian, and Belarusian), Southern (Serbo-Croatian, Macedonian, Slovenian, and Bulgarian), and Western (Polish, Czech, Slovak, and Sorbian).

1.1.6 Indo-Iranian

The languages of this branch are spoken in Asia. See section 3.1.

1.1.7 Armenian

Armenia is here considered a language of Europe, though a good case could be made for including it in Asia. Like Greek and Albanian, the Armenian branch has just one language, with a major division between Eastern and Western dialects. The standard language of Armenia is in the Eastern Armenian group, which also includes the dialects of Armenian communities in Iran, Russia, Georgia, and their environs. Texts from Armenian Cilicia from the 11th to the 14th centuries ce are the first to show a differentiated Western dialect. Many dialects of Western Armenian were obliterated by the Armenian genocide, but the Western Armenian standard and its dialects are found in Turkey (especially Istanbul), the Levant, and émigré communities in the West. Armenian is of special interest to linguists because of retentions from Indo-European, notably all seven of its noun cases and the irregular retention of initial laryngeals.

1.1.8 Anatolian and Tocharian

The languages of this branch were spoken in Asia. See section 3.1.1.

1.2 Uralic

Three important languages in this family are Finnish, Estonian, and Hungarian. These three are traditionally grouped into a branch called Finno-Ugric. But while Finnish and Estonian are closely related members of the Finnic branch of Uralic, Hungarian’s membership in a sister branch to Finnic is under challenge; Ethnologue has dropped Finno-Ugric from its listing and now casts Hungarian as a separate Uralic entity. See Salminen (2002) for arguments. The remaining languages of Uralic are smaller ones found in northern parts of Europe and Asia.

1.3 Caucasus Area

The area of the Caucasus Mountains and its environs between the Caspian and Black Seas includes Georgia, Armenia, and Azerbaijan and parts of neighboring countries. This relatively small region may have up to around 40 highly diverse languages, falling into three families, Nakho-Dagestanian, Abkhazo-Adyghean, and Kartvelian. The most important Nakho-Dagestanian language is Chechen. Abkhaz-Adyghean is made up of Abkhaz and Adyghe and is best known among linguists for systems with 60 or more contrasting consonants but very few vowels. The major Kartvelian language is Georgian, with four million speakers. Ethnologue combines Nakho-Dagestanian and Abkhaz-Adyghean into a single family, North Caucasian. Nakho-Dagestanian and Abkhaz-Adyghean are also known by the respective names Northwest (or simply West) Caucasian and Northeast (or East) Caucasian.

1.4 Basque

Basque is an isolate spoken in the Western Pyrenees by about half a million, some in France but most in Spain. Its history in this location is widely thought to go back several millennia, antedating the more recent Indo-European migrations to the region. There have attempts to identify Basque with a wide variety of groups, including Kartvelian, Afro-Asiatic, and Iberian, but without attracting much support. Recent DNA evidence reinforces the notion of Basque descent from an ancient population of farmers and hunters (Günther et al., 2015).

1.5 Turkish

Turkish, a language of Europe and Asia, belongs to the Turkic group, described in the section on Asia.

2. Africa

Africa’s extraordinary linguistic diversity is threatened by the possible extinction of half or more of its languages, which some predict by the end of the century due to competition from other languages. The current count exceeds 2,000 languages, grouped into just a few families.

The most revolutionary aspects of Greenberg’s (1955, 1963) classification of African language families largely stand, though with many adjustments by later experts in the different languages. Many other questions still remain open. For example, Greenberg recognized Khoisan as a family, but later scholars have tended to set a higher bar for establishing genetic relationships, leading most to reject it as a family and to defer judgment on particular groupings into branches. The unity of Nilo-Saharan is also called into question, and despite detailed comparative work by Bender (1996–1997) and Ehret (2001), some reject Nilo-Saharan as a valid genetic unit. For Niger-Congo, the status of some member branches—Kordofanian, Mande, Dogon, and Ijoid—has been challenged, though Niger-Congo itself is widely recognized as a valid family.

The Afro-Asiatic family is well established, though there are debates about subgrouping. For example, do Semitic, Berber, and Cushitic together form a separate branch, as Bender (1996–1997) contends? Within Cushitic, Greenberg’s classification included Omotic, which many now regard as a distinct branch, while Glottolog fails to recognize Omotic as an established group at all. Within Niger-Congo, there are a number of unanswered questions, many revolving around the constituency of its most complex branch, Benue-Congo, which uncontroversially includes all the Bantu languages and many more. Among the changes, the Kwa languages are now reduced to what Greenberg called Western Kwa, and the remaining languages have been moved from Greenberg’s Kwa into distinct branches, though experts still differ on their precise classification. For details and references, see Bendor-Samuel and Hartell (1989) and the references in Nordhoff et al. (2013).

2.1 Afro-Asiatic

This is the northernmost family, with a few hundred languages spanning all of North Africa and the Middle East, as well as two smaller areas of sub-Saharan Africa. The six branches of Afro-Asiatic are Semitic, Berber, Chadic, Cushitic, Omotic, and Egyptian. The Semitic branch has 78 languages, including Arabic, the first language of up to 300 million throughout North Africa and widely spoken in the Middle East. Among the world’s languages, Arabic ranks fourth in the number of speakers. Other important Semitic languages are Hebrew, which shares official status in Israel with Arabic, and several Ethiopic languages. Amharic, the official language of Ethiopia and the first language of 21 million, is a South Ethiopic language. In the North Ethiopic branch is Tigrigna, an official language of Eritrea spoken by 7 million.

The term Afro-Asiatic was used by Joseph Greenberg to replace the designation Hamito-Semitic, which posited a division between the Semitic branch (named for Biblical figure Shem) and a putative branch named for Biblical figure Ham. The notion that Hamitic languages formed a unified branch seemingly reflected factors like speakers’ typical occupations and a lighter skin color than black Africans to the south. Greenberg argued that extraneous factors like these had no place in language classification, which should be based solely on linguistic data. Comparing languages from the different groups classed as Hamitic, Greenberg concluded that the evidence did not support their grouping into a single branch.

The Berber branch of Afro-Asiatic is spoken in the foothills of the Atlas Mountain in Morocco and Algeria and, spottily, in neighboring countries. Cushitic gets its name from Cush, the son of Ham. The several dozen languages of this group are spoken mainly in Ethiopia and Somalia, with a few in Kenya and Tanzania. Chadic languages are mainly spoken in the countries surrounding Lake Chad and are dominant in northern Nigeria, numbering close to 200 in all. By far the most widely spoken is Hausa, with 25 million native speakers. The languages of the Omotic branch, numbering over two dozen, are all spoken in southwestern Ethiopia. The Egyptian branch, thanks to hieroglyphs, can be traced back before 3,000 bce. Ancient Egyptian was the ancestor of Coptic, spoken in Egypt, but over time was replaced by Arabic until Coptic died out, roughly 400 years ago. Since then Coptic has survived as a liturgical language.

2.2 Nilo-Saharan

The approximately 200 languages occupy a band extending from the Nile region to the Sahara desert. For a relatively small family, they are quite diverse typologically, leaving some doubt as to whether the Nilotic and Saharan branches really deserve to be grouped into a family. Reflecting this, Glottolog divides them into two separate families, Nilotic and Saharan.

2.3 Niger-Congo

The great majority of languages in sub-Saharan Africa are members of the Niger-Congo family. Its 1,538 languages make it the world’s largest language family, and only the Indo-European and Sino-Tibetan language families have more speakers than Niger-Congo. Ideas about the respective genetic affiliations of well-known groups within Niger-Congo have changed substantially over the last half-century. This has been the case with Kwa, Mande, Gur, Atlantic, and Benue-Congo, among others. To date, the truly remarkable event in the classification of this family remains Greenberg’s (1955, 1963) demonstration that Bantu—a group of 538 languages covering most of Central and Southern Africa—was, along with other languages called Bantoid, a subgroup within a group now called East Benue-Congo, most of whose other languages are spoken in Nigeria and Cameroon. This discovery—which took ten years before gaining the wide acceptance it has today—not only challenged earlier assumptions about linguistic classification but also opened the door to hypotheses about Bantu origins. The currently accepted view is that Bantu originated in southeastern Nigeria and expanded east and south from there.

2.4 Khoisan

Among the languages of the world, some are poorly studied and go back so far in time that it is hard to trace their genetic origins. This is the case with Khoisan, which is generally not recognized as an established family but as a set of 27 languages—some with just a handful of speakers—that are likely not to belong to the other three established families of African languages. Ermisch (2008) presents what is known, along with the residual problems.

2.5 Austronesian

Off the southeastern coast of Africa is the island of Madagascar, home to Malagasy, a Malayo-Polynesian language brought over by the island’s earliest settlers maybe 1,500 years ago. For more on Malayo-Polynesian, see the subsection on Austronesian in the section on Oceania.

3. Asia

Asia is home to 60% of the world’s population and nearly 30% of the world’s languages. These are grouped into just a handful of major families, leaving out several important isolates, and due to long periods of contact, there’s less diversity than one might expect. The downside is that the contact situation has made it difficult to classify genetic relationships with certainty in some important cases. And it’s worth mentioning some areal features for various subregions.

3.1 Indo-European

The Indo-European languages of Europe were discussed in section 2. This section describes the Indo-European languages of Asia.

3.1.1 Anatolian and Tocharian

Both of these branches are long extinct. Anatolian’s replacement by Greek is linked to the conquests of Alexander the Great. The Tocharian branch became extinct with the expansion of Turkic Uyghur tribes in the 9th century ce. Tocharian manuscripts from a few centuries prior to extinction, uncovered in the early 20th century, provided information that led scholars to reassess key assumptions about Proto-Indo-European and its descendants. Anatolian inscriptions from a much earlier era, about two millennia prior, similarly reshaped what had been known. Gamkrelidze and Ivanov (1990) offer a highly readable synthesis and summary of research presented in Gamkrelidze and Ivanov (1990).

3.1.2 Indo-Iranian

Indo-Iranian has two large branches, Indo-Aryan and Iranian. Among the over two hundred Indo-Aryan languages, Hindi and Urdu are official languages of India and Pakistan, respectively, and many consider them dialects of a single language. Kachru’s (2008) linguistic sketch describes Hindi and Urdu as closely related, mentioning the special case of Hindustani, an essentially colloquial language that has been called a co-dialect of Hindi and Urdu. Hindustani is the language once promoted by Gandhi and the Indian National Congress as a tool of national unity. For the Hindustani controversy, see Kachru (2008).

The largest language of the Iranian component of Indo-Iranian is Persian, with estimates exceeding 50 million native speakers in Iran. Written records of Old Persian go back to the 6th century bce. Other important languages in the Iranian branch are Pashto, mainly spoken in Afghanistan and Pakistan, and Kurdish, mainly spoken in Turkey, Iraq, and Iran.

3.2 Turkic

The approximately 40 languages of this family extend from Macedonia to Siberia, Central Asia, and western China. Despite the vastness of this area, the languages themselves are typologically quite similar: agglutinative, with vowel harmony involving both backness and rounding.

3.3 Mongolic

The Mongolic languages are a group of about a dozen spoken in Mongolia and in adjacent areas of the Russian Federation and China. Mongolian, with over six million speakers, is by far the largest language in the family and the official language both of Mongolia and of the Inner Mongolian Autonomous Region of China.

3.4 Tungusic

The 11 languages of this family are scattered through Siberia, the Far East of Russia, and northwestern China, but most are endangered and some are nearly extinct. That includes Manchu, the language of the founders of the Qing Dynasty, which ruled China for nearly three centuries up to 1912. The 2016 edition of Ethnologue lists only 20 speakers for Manchu, though over ten million are ethnically Manchu.

3.5 Altaic Area

The Altaic area extends from Turkey across the Altai Mountain area of Central and East Asia to Siberia. Altaic has been regarded by some as a family comprising Turkic, Mongolic, and Tungusic, and for a few even including as distant members Japonic and Korean. Versions of the Altaic hypothesis still have adherents, even though this notion has been cast into doubt as criteria have been challenged and evidence has been rejected as based largely on shared typological similarities, a position summarized in Unger (1990). Despite this, adherents continue to make a case, among them Miller (1991), Georg et al. (1999), and Robbeets (2005). The more conservative consensus is that many resemblances among languages in this linguistic area could have come from language contact rather than a shared ancestor. This view is reflected in Ethnologue and Glottolog, among others.

3.6 Dravidian

Dravidian languages are spoken primarily in southern India, though some are also found further north in the Indian subcontinent. The major literary languages are Tamil, Malayalam, Kannada, and Telugu, each one the first language of tens of millions. More is known about the history of Dravidian than about many other language families, thanks to the long literary periods of the four major languages.

Questions have been raised about Dravidian similarities to Uralic and Altaic, among several others. Austerlitz (1971) dismissed these, and Krishnamurti (2003), briefly surveying archeological and DNA literature along with linguistic evidence in his foundational work on Dravidian, seconds the conclusion that the linguistic arguments behind the proposed genetic relationships are tenuous and speculative.

Dravidian morphology is mainly agglutinative but lacks the long strings of affixes found in other agglutinative languages. The typical word order is SOV. Dravidian’s three-way contrast in coronal stops (dental, alveolar, and retroflex) can be traced back to proto-Dravidian. Sanskrit, an Indo-Aryan language, owes its retroflex consonants to Dravidian, from which they are thought to have spread by diffusion.

3.7 Sino-Tibetan

The languages of this family are spoken in China, the Himalayas, and Burma. The division into Chinese and Tibeto-Burman branches is customary, as espoused by Matisoff (2003), though a few experts, including van Driem (2007), still question the grouping of Sinitic as a separate sister branch to Tibeto-Burman, along with many particulars. Tibeto-Burman, with well over 400 languages, is especially problematic because of the inaccessibility of many languages in the Himalayas, not to mention that van Driem (2015, p. 141) finds them “endangered with imminent extinction.” Overall, the lower-level groupings within Tibeto-Burman are more certain than the higher-level ones, leading van Driem (2001) to posit a “Fallen Leaves” model that recognizes clumps of closely related languages without identifying where on the family tree they fell from. Still, Ethnologue offers a full family tree. Sino-Tibetan was at one time thought to include languages farther south, such as the Tai-Kadai languages and the Hmong-Mien (Miao-Yao) languages, but the similarities among these languages are probably better attributed to areal diffusion, including massive lexical borrowing from Chinese.

3.7.1 Chinese

Member languages of the Chinese (or Sinitic) branch are sometimes called dialects, especially in China, but this stretches the normal meaning of the term “dialect” too far, since the 14 languages that make up Chinese are far from mutually intelligible, even though they share the same writing system and many grammatical properties. Each of the Chinese languages of course has dialects. Ethnologue lists five major dialects for Mandarin (which also goes by the name Guanhua): Huabei Guanhua (Northern Mandarin), Xibei Guanhua (Northwestern Mandarin), Xinan Guanhua (Southwestern Mandarin), Jinghuai Guanhua (Eastern Mandarin), and Jiangxia Guanhua (Lower Yangtze Mandarin). Other sources divide the dialects differently, due not only to differences of linguistic and geographical criteria but also to centuries of diffusion of linguistic features. For discussion, see Kurpaska (2010) and Yan (2006). With over a billion speakers total, Mandarin’s dialects have many subdialects as well.

Linguistic diffusion is the general pattern in the historical development of Chinese, due to over a dozen massive population movements going back to the 7th century bce and continuing to the present, each migration involving hundreds of thousands and often millions of people. Complicating these scenarios is the fact that in most cases, the migrations were to areas already settled by speakers of Chinese or other languages, often resulting in language mixture. The history of these migrations and their linguistic effects is traced by LaPolla (2001).

3.7.2 Tibeto-Burman

As already noted, most of the languages of this branch are endangered. As a group, they have many linguistic traits in common, including SOV order and agglutinative verb structure. Two word-order exceptions are the Karenic languages (Myanmar) and Bai (China), which have the SVO order characteristic of Sinitic, though unlike Sinitic, Karen and Bai are also relatively agglutinative. Karen and Bai both stand out enough from the rest of Tibeto-Burman to inspire attempts to classify them outside of Tibeto-Burman proper. Benedict’s (1976) proposed sister to Sinitic, labeled Tibeto-Karenic, with Tibeto-Burman as a daughter, has been ruled out, while more recently several scholars have taken up the case for linking Bai with Sinitic. See Wang (2005) for a brief survey with references.

3.8 Austro-Asiatic

The Austro-Asiatic family extends across south Asia from India to Vietnam. The Munda branch is found in northeastern India, surrounded by Indo-European and Dravidian languages that have influenced its languages greatly over the ages. Typologically they are agglutinative, with SOV word order, making them typologically very different from the rest of the family. Austro-Asiatic includes two important national languages, Vietnamese and Khmer (Cambodian). These two languages were grouped, along with many others, into a branch called Mon-Khmer, a grouping still accepted by Ethnologue but challenged by Sidwell (2009).

Vietnamese has borrowed massively from Chinese and was originally written with Chinese characters. Vietnamese and a few others in this family have developed phonological tones, and still others are thought to be in the process of developing them.

3.9 Hmong-Mien (Miao-Yao) and Tai-Kadai

These two families were once regarded as branches of Sino-Tibetan, and the languages of both families show many influences from Chinese. The Hmong-Mien (Miao-Yao) languages are spoken in scattered areas across southern China and nearby countries of Southeast Asia. The Tai-Kadai languages extend from China south to Thailand, Laos, Myanmar, and Vietnam and include the national languages Thai and Lao. Both families share a number of typological traits: most of their languages are SVO with isolating morphology and contrastive tone that is associated with creaky or breathy voice quality.

3.10 Paleosiberian Area

The name Paleosiberian applies to a set of four languages or language groups of Siberia with no established genetic relationship but sharing some typological features—agglutinative word structure and, with exceptions, SOV word order.

One of these is Ket, unrelated to any extant language and reduced to about 200 speakers, but once a member of the Yeniseian family and unlike the rest of Paleosiberian in several respects. It is tonal and has a highly agglutinative verbal system with complex agreement patterns—features that make it look like Na-Dene in North America. The case for a genetic relationship between the two has been made by Vajda (2010, 2011). For arguments pro and con, see Kari and Potter (2010), Campbell (2011), and Kiparsky (2014, pp. 65–67). Implications of this finding for Beringian migrations are pursued by Sicoli and Holton (2014).

Also in the Paleosiberian area are the Chukotko-Kamchatkan and Yukaghir families and Nivkh, a language with perhaps 200 speakers.

3.11 Korean and Japanese

Two of the major languages of East Asia, Korean and Japanese, are widely considered isolates, or nearly so in the case of Japanese, by far the dominant language in Japonic, a family of twelve languages. The remaining 11 languages of Japonic are the Ryukyuan group of the Ryukyu Islands. Some versions of the Altaic hypothesis include Korean and Japanese in a family with Turkic, Mongolic, and Tungusic. Another isolate of Asia is Burushaski (northeastern Pakistan).

4. Oceania

Oceania, which includes Australia and most of the island territories of the central and southern Pacific and Indian oceans, is home to the Austronesian family and to two very large language groups, the Australian and the Papuan groups.

4.1 Austronesian

The 1,250+ languages of this family are distributed across Oceania from Madagascar to Easter Island and total well over 350 million speakers. All but 25 of these languages are Malayo-Polynesian; the rest are aboriginal languages of Taiwan.

The dominant category, Central-Eastern Malayo-Polynesian, has well over half of the languages classified as Malayo-Polynesian but only a few million speakers total, and it is not generally accepted as a valid linguistic grouping. The remaining Malayo-Polynesian languages are found in 17 smaller groups, some of whose languages are widely spoken and highly important politically. Among these are:

  • - Javanese, the language of nearly 90 million, centered in Java, Indonesia.

  • - Filipino, an official language of the Philippines used by close to 50 million, including L2 speakers, as the national language of Philippines. The variety associated with native speakers, who number over 20 million, is called Tagalog.

  • - Sundanese, the language of about 34 million in Java.

  • - Malay, an official language of Malaysia along with Mandarin and English, is the language of more than 50 million.

  • - Malagasy, spoken by 18 million.

Blust (2013) offers a recent and comprehensive account of the linguistic and anthropological aspects of this family, including internal linguistic groupings, the linguistic structure of its languages, sociolinguistic considerations, and archeological evidence backing up the linguistic groupings. Adelaar and Himmelmann (2005) cover a similar range of topics.

4.2 Papuan Languages

Estimates run to as many as a thousand languages in an area about a quarter of the size of India, making New Guinea the most linguistically diverse region in the world (Foley, 2000, p. 357). Major groupings have been proposed by Greenberg (1971), Wurm (1982), and Ross (2005). Greenberg put all the languages into a single family and included some others from outside New Guinea, but the evidence for this has not generally been deemed credible. Wurm (1982) posited 10 Papuan phyla plus isolates, based entirely on lexicostatistic and typological evidence that others found unconvincing (Foley, 1986). A more recent grouping by Ross (2005), based essentially on evidence from pronouns, has also failed to find wide acceptance. One is left for now with Foley’s (1986) classification, with several dozen families and a similar number of isolates. Correlated with this is extreme typological variation across the families, with morphological types ranging from isolating to polysynthetic. Foley’s Papuan families average about 25 members in size, with the exception of Trans-New Guinea, with 482 member languages in Ethnologue, a figure that experts agree is subject to much revision because the family’s boundaries with others remain unclear. The uncertainty is reflected in Glottalog, which lists only Nuclear Trans New Guinea, with 315 languages.

4.3 Australia

This continent has been inhabited for 50,000 years, but the time frame for language classification is limited to the last 5,000 or so. As a result, we know very little about the historical connections among Australia’s languages. Worse, the number of vigorous Aboriginal languages today is a fraction of what it was before Europeans settled there in the 18th century. Of the 250-odd languages of Australia in 1788, more than half are extinct, and of the remainder, fewer than two dozen are used and learned by the youngest generation.

Beginning with Hale (1966), many sources divide the continent’s original languages into two groups, Pama-Nynngan and Non-Pama-Nynngan, but even this rudimentary grouping is complicated by large-scale phonological and grammatical diffusion. Dixon, author of many standard reference works on Australian languages, among them Dixon (2002), diverges markedly from the others by simply dividing the languages into 50 groups representing different areas, though among them some genetic clusters may be found. For Dixon, Pama-Nyungan “cannot be supported as a genetic group. Nor is it a useful typological grouping.” (Dixon, 2002, p. 53). The problem with applying standard methods toward reconstructing a language tree for Australia, as Dixon sees it, is that Australia is unique, in part to due widespread diffusion, whereby a language “will tend to become more like its neighbors” (Dixon, 2002, p. 448). For alternative studies from a vantage point that differs markedly, see Bowern and Koch (2004).

Phonologically, Australian languages tend to be simple in some ways—usually with three-vowel systems—and complex in others, with as many as four contrasting articulations among the coronal consonants. Morphologically, Pama-Nyungan languages have noun class systems and verbal concord prefixes, and some have extensive noun incorporation constructions. Outside Pama-Nyungan, morphology, especially in nouns, is of a more simple agglutinative type, with suffixes but no prefixes. Most Australian languages have split ergativity, a common pattern being ergative-absolutive alignment for nouns but nominative-accusative alignment for pronouns. Word order tends to be very free, but there is evidence that clauses are best analyzed as verb-final; see Mushin and Baker (2008).

5. The Americas

The past and present states of indigenous languages in the Americas are entirely different as a result of colonization by Europeans. North America is estimated to have been host at one time to nearly 300 distinct languages (Mithun, 1999, p. 1). Since then, over a hundred have gone extinct, and practically all of the rest are endangered. The 2010 U.S. Census Bureau report found 169 Native North American languages to be spoken in the home, with a total speaking population of less than half a million. By far the largest is Navajo, with nearly 170,000.

Central and South America are home to a few much larger languages, spoken by several million. Still, language endangerment is also the rule there. Of perhaps 1,700 pre-Columbian languages, fewer than 700 remain (Campbell, 1997) and of these, most are spoken by populations of several thousand or fewer.

The languages of the Americas are often divided into three geographical areas: North America, Mesomerica, and South America. Greenberg’s (1987) classification grouped the languages into three “super-families” that he called Eskimo-Aleut, Na-Dene, and Amerind. Of these, the most controversial is Amerind, a grouping widely contested for reasons summarized by Campbell (2012, p. 19), drawing on Paul Rivet’s classification of South American languages in the first half of the 20th century: “Greenberg’s subgroups have been met with skepticism for a number of reasons, including the underanalyzed nature of the presented data, the perpetuation of old misunderstandings [. . .], and the fact that recent findings may suggest entirely different groupings.”

5.1 North America

The approximately 300 surviving languages of native North America are grouped by Golla et al. (2007) into 14 major families and 19 minor families, with an additional 25 isolates. The major families are Eskimo-Aleut, Na-Dene, Algic, Wakashan, Salishan, Utian, Plateau, Cochimi-Yuman, Uto-Aztecan, Kiowa-Tanoan, Siouan-Catawba, Caddoan, Muskogean, and Iroquoian. These and the remaining groupings in Golla et al. (2007) represent a compromise rather than a consensus, and it is unclear whether any individual, including the authors themselves, accepts them in toto.

5.1.1 Eskimo-Aleut

The Aleut branch has just one language, variously called Aleut or Unangax̂ and spoken by 155 in the Aleutian and Pribilof islands (Alaska) and the Commander Islands (Siberia). Eskimo has two branches, Inuit and Yupik. Because the term Eskimo is deemed offensive by many, especially in Canada and Greenland, Yupik-Inuit is sometimes used instead.

5.1.2 Na-Dene

The name Na-Dene is perhaps on its way to being phased out, having been replaced in Ethnologue by Eyak-Athabaskan and in Glottolog by Athabaskan-Eyak-Tlingit. Along with two small languages of Alaska, the family includes Athabaskan, a group of 42 languages widely distributed across the western United States and western Canada. At one time Na-Dene was thought to include Haida (Sapir, 1915), but this view has been abandoned by most (Schoonmaker et al., 1997).

The largest Athabaskan language is Navajo, a member of the Apachean group. Its morphology is widely studied for its complex prefix system, which might lead it to be classified as agglutinative, were it not for complex, overlapping dependencies that are more characteristic of fusional languages. Like many Athabaskan languages, Navajo is tonal, yet proto-Athabaskan lacked tone, and tone seems to have developed independently in many Athabaskan languages from constricted vowels (Campbell, 1997, p. 113).

5.1.3 Algic

This family has some three dozen forty languages, all but two in the Algonquian branch, distributed across a wide expanse of eastern Canada and the northeastern United States. The two outliers are in California, Yurok and the now-extinct Wiyot. Algonquian languages extend from eastern Canada and the eastern United States to the Rocky Mountains. The largest languages of this group are Cree, spoken by well over 100,000 and spanning a vast area of Canada from Labrador to Alberta and the Northwest Territories, and Ojibwa, with more than 50,000 speakers, extending across southern Canada and from Ontario to the Rocky Mountains and south into the United States, especially Minnesota.

5.1.4 Wakashan

Wakashan, a family of seven languages in British Columbia, was assigned by Edward Sapir (in a 1929 Encyclopedia Britannica entry) to a putative stock called Mosan that also included the Salishan family (section 5.1.5). Sapir’s conjecture was based on a long list of shared grammatical similarities. But Beck (2000), echoing Campbell (1997), finds little lexical similarity and concludes that that one is dealing with a Sprachbund (Thomason & Kaufman, 1992), a set of languages whose common features have arisen from contact rather than from shared genetic origins.

5.1.5 Salishan

The 26 languages of this family are spoken in the coastal regions and in the region immediately to the east in British Columbia and in nearby areas in the United States. One of typological distinctions of Salishan languages is an extremely rich set of consonant contrasts—up to six pharyngeal consonants, contrasting velars and uvulars, and a full set of ejectives.

5.1.6 Utian

Approximately a dozen languages in the Utian family of central and northern California are divided into two branches, Miwok and Costanoan.

5.1.7 Plateau

Also known as Plateau Penutian, this group of four languages in the Pacific Northwest includes Klamath and Nez Percé.

5.1.8 Cochimi-Yuman

Also called Yuman, this group of eight small languages, which also includes the extinct Cochimi, is spoken in Arizona and neighboring parts of California and Mexico.

5.1.9 Uto-Aztecan

About 60 languages make up this family. The 13 languages of the Northern branch are spoken in the western United States. Among them is Hopi, spoken by 6,700 in and around northeastern Arizona. The Southern branch has 48 languages, almost all of them in Mexico.

5.1.10 Kiowa-Tanoan

Speakers of the five languages making up this family live in the southwestern United States.

5.1.11 Siouan-Catawba

This family, also called Siouan, includes Catawba, a language of South Carolina, which lost its last native speaker in the 20th century but is being revived as a second language by ethnic Catawbas. Total speakers for the Siouan family number under 35,000, but among its 14 languages is Dakota, the third largest indigenous language of North America and nearly tied for second place with Yupik, with close to 19,000 speakers. Dakota is spoken in North and South Dakota and neighboring areas.

5.1.12 Caddoan

This group of five languages, each with just a handful of speakers, may possibly form a super-family with Iroquoian and Siouan, based on comparative work (Chafe, 1976), but the relationship is not considered established (Mithun, 1999, p. 305).

5.1.13 Muskogean

Traces of this family of six languages, roughly estimated at around 150,000 speakers, are still found in the southeastern United States, but forced relocations by the U.S. government in the 1830s drove many Muskogean tribes from their homeland. Included were the Choctaw and Chickasaw Nations, now situated in Oklahoma.

5.1.14 Iroquoian

Seven members of this family are severely endangered. Of the remaining two, Mohawk is estimated to have 540 speakers in the Canadian provinces Ontario and Quebec, and Cherokee has over 11,500 speakers in the 2010 U.S. Census Bureau Report, mainly in Oklahoma but also near their pre-relocation lands in North Carolina.

5.2 Mexico and Central America

5.2.1 Uto-Aztecan

The Southern branch of this family includes 28 varieties of Nahuatl in Mexico and one in El Salvador that altogether number 1.5 million, according to the 2010 census. Nahuatl traces its origins to the Aztecs who dominated the area for many centuries.

5.2.2 Mayan

The approximately 30 languages comprising Mayan are spoken mainly in Guatemala and Mexico, as well as in Belize and Honduras. Estimates of the number of speakers of Mayan languages run to six million, with well over half that number in Guatemala. The most important Mayan languages of Guatemala are K’iche’, with 2,330,000 speakers; Q’eqchi’ with 800,000; Mam with 530,000; and Kaqchikel with 450,000. In Mexico, Yucatec Maya is spoken by more than 700,000, and a few others are spoken by well over a hundred thousand. The languages are still centered around the original Maya homeland in Guatemala and on the Yucatan Peninsula.

Among the noteworthy achievements of early Maya civilization were temples, pyramids, and the only writing system developed in the Americas before the coming of the European explorers. Decipherment of the writing system has offered a direct glimpse into the Mayan protolanguage and makes a fascinating story, recounted by Coe (1999).

5.2.3 Otomanguean

This is a large family of 177 languages spoken in central and southern Mexico. In the Eastern Otomanguean branch are the Mixtecan languages, including Trique and 52 varieties of Mixtec listed in Ethnologue, and 63 Zapotecan languages, including Chatino and 57 varieties of Zapotec listed in Ethnologue. Recent census estimates for both Mixtec and Zapotec are in the area of 500,000 speakers. The Western Otomanguean branch numbers 37 languages, among them 14 distinct varieties of Chinantec and nine varieties of Otomi. The 2010 census gives 130,000 native speakers for Chinantec and 290,000 for Otomi.

5.2.4 Totonacan

This is a family of 12 small languages spoken in and around Puebla State in Mexico. The largest is Sierra Totonacan.

5.2.5 Mixe-Zoquean

This family groups the ten Mixean languages with the seven Zoquean languages. All are spoken on the narrow strip of southern Mexico between the Gulf of Mexico and the Pacific Ocean.

5.3 South America

With 55 language isolates and 53 families of two or more languages, South America has about a quarter of the language families of the world (Campbell, 2012, p. 59). While most are endangered and a large number nearly extinct, there are some very healthy exceptions, including Quechua, Tupi Guaraní, and Aymara, all discussed in this section. Especially since 1960, efforts have been under way to reverse some of the declines in language populations of earlier eras. Particularly active in this area is the Andean region, where several bilingual school programs have incorporated Quechua and Aymara into the curriculum. The past 25 years have also seen a surge in interest by linguists in documenting and analyzing the tremendously diverse languages of this continent.

Among the 108 language families Campbell (2012) finds in South America, larger groupings still remain to be firmly established. Of the hypotheses advanced to date, including Greenberg’s (1987) classification that puts them all in Amerind, none have been proved to general satisfaction.

5.3.1 Intermediate Area: Between Central America and South America

The area between the site of the Mayan civilization to the north and the Inca civilization to the south covers the northwestern part of South America, extending into Central America. Among the language families here are Chocoan, spoken in Columbia and Panama; Barbcacoan, spoken in Colombia and Ecuador; and Chibchan, spoken from Honduras to Venezuela. Chibchan may be related to Misumalpan, spoken in Honduras and Nicaragua.

5.3.2 Arawakan

The family with the greatest geographical reach, spreading from Honduras down to Bolivia and as far east as Suriname, is Arawakan, with 40 languages, not including about two dozen extinct ones. Some reserve the name Arawakan for a slightly larger group with 11 additional languages, but their genetic connection to the core family is unproven (Campbell, 2012, p. 71). For this reason, Campbell uses Arawakan (which includes the language Arawak) for the core group that also goes by the names Maipurean and Maipuran, as listed in Ethnologue.

Three Arawakan languages—Wayuu (Colombia), Garifuna (Honduras), and Asháninka (Peru)—account for more than 85% of the 645,000-odd speakers of languages in the family.

5.3.3 Arawan

The Arawan family of western Brazil, with six languages, and Guajiboan, with five languages in Eastern Colombia and southwestern Venezuela, comprise the group of 11 sometimes classed with Arawakan.

5.3.4 Cariban

Cariban is a family of 31 languages (as well as around two dozen extinct ones) in Brazil and Venezuela as well as in Guyana, Suriname, and Colombia. Most have just a few hundred speakers; some have a few thousand. The largest is Macushi, with 18,000 speakers in Brazil.

5.3.5 Tucanoan

Tucanoan includes 25 languages in Colombia, Ecuador, Peru, and Brazil. A few are extinct or very severely endangered. The two largest, with just over 6,000 speakers each, are Cubeo (Colombia) and Tucano (Brazil).

5.3.6 Aymaran

Aymaran has just two languages. One of them is Aymara, spoken by a million in Bolivia and several hundred thousand in Peru.

5.3.7 Quechuan

Quechuan languages are spoken natively by a greater number than any other language family indigenous to the Americas, a result of the spread of the Inca Empire in pre-Columbian times. The total speaking population is 8.5 million, mainly in Peru, Ecuador, and Bolivia. The designations of all but two of the 44 Quechuan languages include the name Quechua along with a geographical identifier, reflecting a close relationship, though in most cases not mutual intelligibility. Most are small, with a few thousand speakers. About a dozen others range from the tens of thousands to around 100,000, and a few more are spoken by several hundred thousand. Larger than these are South Bolivian Quechua (1,600,000 speakers in Bolivia), Ayacucho Quechua (900,000 speakers in Peru, including Lima), and Chimborazo Highland Quichua (800,000 in Ecuador). All three belong to what is known as Peripheral Quechua, a sister branch to Central Quechua. These two branches constitute the major break in the Quechuan family. Quechua is, along with Spanish, the official language in Peru.

Phonological, structural, and lexical similarities between Quechua and Aymara have raised the possibility that the two are related, as discussed by Orr and Longacre (1968) and Kaufman and Berlin, 2007, but Adelaar (1992, 2012) argues instead that the many similarities must have resulted from intense contact predating the protolanguages along with subsequent diffusion. Part of the reasoning is that the lexical similarities are in fact too similar where they occur and extend to only about a quarter of the vocabulary, while the rest is highly different.

5.3.8 Tupian

Jensen and Grimes (2003), Kaufman and Berlin (2007), and Rodrigues and Cabral (2014) regard the Tupian languages of Central Amazonia as a language stock—a grouping of languages families not fully established but thought to be distantly related. Here it is listed as an established family, following Kaufman (1990), Campbell (2012), and Ethnologue.

This set of 76 languages is grouped into 11 small branches and isolates and one major branch, Tupi-Guarani, which some recognize as a family in and of itself (Michael et al., 2015). Its 51 languages are found in parts of Paraguay, Brazil, and Bolivia but once covered a much larger expanse of South America, from the eastern coast to the west and from northern Argentina up to French Guiana. Ten languages of this group are varieties of Guaraní that together are spoken by five million, principally in Paraguay, where it is an official language (along with Spanish) and is widely used as a second language as well.

5.3.9 Northern Foothills

In this Andean region, we find Jivaroan, Cahuapanan, Zaparoan, and Witotoan, among a few others. Yagua is known to have belonged to the Peba-Yaguan family, whose other two members are extinct.

Beyond what is presented here, Campbell (2012) discusses many plausible and possible genetic relationships within South America. Campbell and Grondona (2012, p. 29) cite a dozen other works on this topic.

6. Sign Languages

As with spoken languages, it is impossible to trace back to the time when the first sign languages were used. Still, McBurney (2012) documents early reports on signing by the deaf, including an Ancient Egyptian text from around 1200 bce: “Thou art one who is deaf and does not hear, to whom men make (signs) with the hand.” From Plato’s Cratylus she quotes: “should we not, like the deaf and dumb, make signs with the hands and head and the rest of the body?” And from a collection on Jewish oral law from the late 2nd century ce: “A deaf-mute may communicate by signs and be communicated with by signs.”

Signing systems developed into languages as communities of users grew and the communicative needs of the deaf were recognized by governments, educators, and the general public. In parts of Europe, emerging deaf communities were developing sign languages well before the 18th century, and in 1817 Thomas Gallaudet established the first permanent deaf school in the United States, basing his methods on practices already in place in France and Britain.

Ethnologue lists 138 sign languages for the deaf, each one named for the location where it is used. Many are adaptions of signing systems already used in other regions, as illustrated by American Sign Language (ASL), which Thomas Gallaudet directly based on French Sign Language. ASL has become the most widely used sign language of the deaf, with 250,000 users in North American, the Caribbean, the Philippines, and Africa. ASL and other sign languages are not closely connected to the spoken languages of the regions where they are used. For example, British Sign Language and American Sign Language are not mutually intelligible.

Sign languages also develop in response to other needs. A famous case is Plains Indian Sign Language, once used as a lingua franca by Native Americans over a vast expanse of North America and still in use in some regions (Davis, 2010). Sign languages that have arisen in Aboriginal Australia in response to speech taboos and ritual observance have been described by Kendon (1988).

7. Pidgins and Creoles

7.1 Pidgins

Pidgins are simplified languages that arise out of a need to communicate among speakers lacking a common language, typically in colonial situations where one group is dominant. Members of the dominated group fuse grammatical features, often simplified, of their native language (called the substrate) with vocabulary from the dominant, or superstrate, language. The resulting language serves restricted purposes, such as trade.

There are not many pidgins. Ethnologue lists only 16, six of them in Africa and five in Oceania, if Indonesia is included. Hiri Motu, an official language of Papua New Guinea, is noteworthy because it goes against some typical views of pidgins. This language developed between the Motu and their trading partners nearby before any European contact. After colonization, its use spread, though the colonizers themselves had little if any knowledge of it. More usual are the cases of the original Chinese Pidgin English, once known as Pigeon English, which arose in 17th-century China for trade with the British, and Nigerian Pidgin, which developed in the same era, again due to trade contact with the British, notably the slave trade.

Hiri Motu, Chinese Pidgin English, and Nigerian Pidgin illustrate three different types of situation. Hiri Motu and Chinese Pidgin English exemplify pidgins that originate when trade partners are equal (Hiri Motu) or unequal (Chinese Pidgin English). The two had similar outcomes, eventually fading away—Hiri Motu in favor of Tok Pisin, a widely spoken creole of New Guinea, and Chinese Pidgin English in favor of Standard English, which came to be commonly taught in schools. (Since then, a different language called Chinese Pidgin English has arisen on the Pacific island of Nauru, for communicating with Chinese-speaking merchants and traders.) By contrast, Chinese Pidgin English and Nigerian Pidgin had analogous origins (for communicating with traders in a dominant position), yet different outcomes, since the first has died out, while the second has vastly expanded its uses and its speaking population. Currently Nigerian Pidgin is learned by many children at an early age for communication with peers in virtually any informal situation.

7.2 Creoles

Creoles are first languages of members of speech communities but originate from types of language contact resembling, if not always identical to, situations that give rise to pidgins. Being acquired as a first language gives creoles a stability that pidgins lack, and so it is not surprising that many more creoles are in current use—93 listed in Ethnologue—than pidgins. Thirty-two creoles are spoken around Latin America and the Caribbean, 26 in Oceania, and 22 in Africa. Like pidgins, creoles have a substrate and a superstrate. English is the superstrate for 33 creoles, Malay for 14, Portuguese for 13, and French for 11.

Probably the most vigorously debated topic in current pidgin and creole studies is how creoles form and evolve. Bickerton (1981, 1988) interpreted creolization in terms of what is known as the bioprogram hypothesis. This would see creoles as developing from a pidgin that learners were exposed to at an early age. The hypothesis was that acquisition is guided by an innate bioprogram that supplies structure to complement and modify the pidgin’s substrate and superstrate. This idea excited those who saw its potential to shed light on the human language faculty in general. At the same time, among creolists, the bioprogram hypothesis gave rise to a literature that almost universally sought to disprove it. Viewed more positively, it engendered lots of new thinking on how creoles come about.

Veenstra (2008) surveys some of the progress made during this period. Early commenters found reason to assign a greater role to the superstrate language than would be the case under Bickerton’s hypothesis, which leaned heavily on universal grammar. Another criticism cited the fact that some creoles develop without having a pidgin as a source. Bickerton’s explanation, relying on acquisition by a generation of speakers with no other first language, implied that a creole would always develop in a single generation, yet this has been falsified by Nicaraguan Sign Language, which took two generations (Kegl, Senghas, & Coppola, 1999). For many more counterproposals and refinements, see DeGraff (1999), Mufwene (1996), Singler (1996), and McWhorter (2005). One area of agreement is that neither pidgins nor creoles are homogeneous types, as earlier work seemed to assume. There are many varieties, as is found with the rest of the languages covered in this essay.

