Homo sapiens is the only living species that can communicate via language. What do I mean by language? For the purposes of this review, I don’t think a formal definition is necessary. Instead, we can take it as the thing that humans use when they talk (or sign in the case of sign languages) to each other, or write book reviews, or solve math problems in their heads. I think it’s clear that our closest relatives, chimpanzees, don’t have a language in the same sense.

How, then, did we get here? What were the language-relevant evolutionary steps in the last 6 million years since our last common ancestor with chimpanzees? What might a million-year-old proto-language sound like? And, most importantly for this review, what sort of evidence could we use to try to piece together the answers to these questions? Steven Mithen has collected ideas from archeology, psychology, linguistics, anthropology, genetics and philosophy in his book The Language Puzzle (2023), to solve the mystery of language evolution (a problem which he, shockingly, calls the language puzzle). The main goal of the book is to craft a coherent story of the path that language took as it evolved from something like what chimpanzees have to a human version. In this review, I want to think about what kinds of evidence we might use to solve the language puzzle.

The different types of evidence

The best kind of evidence that I can imagine would come from finding some cassettes with recordings from hominids who lived millions of years ago with English translations. Unfortunately, we haven’t yet found those. We also can’t ask a member of the Homo erectus gang to tell us about their language, seeing as all of them are dead. So we are left with guesswork and indirect evidence.

I think it would be interesting to go through the exercise of trying to come up with a hypothesis for how language evolved, using three steps. (This is my own formulation. In the book, Mithen approaches the question as if it was a jigsaw puzzle with different pieces from different subjects to be pieced together to form a complete picture.) Here are the steps:

  1. You can use modern languages (the ones spoken today and the ones for which we have written or audio records). You can do (reasonable) experiments on adults and children. You can study other animals, especially apes and monkeys. You can also assume a rough timeline for H. sapiens evolution: you know when our last common ancestor with chimpanzees lived, when H. sapiens as a species arose, and what other hominid species existed in the 6 million year long gap. Now, what can you piece together as a possible explanation for how language evolved?
  2. Then you magically find a whole skeleton of an individual of an extinct human species, together with the tools that the individual used, and with some other various items that might have survived until today. You can assume that this one specimen is a good representation of a member of his species. How should you update your hypothesis?
  3. You now randomly stumble across a museum that houses a perfectly labelled collection of skeletal remains and other archeological findings that gives an accurate summary of the different species that existed between the time of our last common ancestor with apes and of modern humans. How should you further update your hypothesis?

If you want a spoiler on what the final hypothesis looks like, check out the figure at the end of this review.

Step 1: No archeology

A good starting point would be to compare human languages to the vocalisations and communication systems of other primates. Perhaps we can find some traits in monkey calls that could be the primitive forms of some features of human language, which have slowly, over the 6 to 8 million years that the lineages have been apart, evolved into the modern human versions. Mithen brings out word-like and syntax-like features of monkey vocalisations. For example, vervet monkeys give out acoustically distinct alarm calls to different predators. When a vervet monkey spots a snake, it lets out an *aahh SNAKE* alarm call, causing the other monkeys in the group to look down. Similarly, a leopard alarm makes the others climb up trees, and a raptor alarm makes them look up. These calls are word-like because they carry semantic information, the sounds that the calls are composed of are arbitrary (i.e. not obviously descriptive of the threat if you didn’t already know the meaning of the call), and must be partially learned from others (younger monkeys give eagle alarms in response to harmless birds, for example, before learning the correct associations from older monkeys). The calls also seem to be under voluntary control – the monkeys are more likely to give out warnings if they have close kin nearby. In addition to vervet monkeys, chimpanzee vocalisations also share these word-like features. What about syntax-like features? Here, the evidence isn’t as convincing. For example, some chimpanzee groups combine calls into bigrams and trigrams (strings of length two and three) such that some components are statistically more likely to come before others. That is essentially the extent to which monkeys have syntax. In the majority of cases, monkey vocalisations are holistic – the parts that make up a call don’t carry meanings of their own.

In summary, if we assume that our last common ancestor with apes had a similar “language” to that of monkeys who are alive today (which seems like not a completely crazy assumption, given that they lived in similar environments), the starting point of human language evolution was a small collection of holistic, arbitrary calls that carry specific meanings which, although the monkeys are genetically predisposed to know, need a little bit of reinforcement learning to fully master.

Next, let’s think about modern human languages. A large majority of our words are arbitrary words – the labels aren’t naturally connected to the meanings. On the other side of the spectrum lie iconic words: words that mimic one or more of the referents’ properties – the way they sound or move, their size, shape, or texture. The most familiar examples are onomatopoeias, words that sound like what they describe, e.g. cuckoo or buzz. Iconic words are ubiquitous in all languages and there’s even an universal propensity to associate specific sounds with specific meanings. Speakers of very different languages consistently agree on the meanings of certain made up words because of iconicity (classic one is Bouba/Kiki but there also exists a very similar _maluma/takete _effect). But more importantly, a statistical analysis has shown that “a considerable proportion of 100 basic vocabulary items show persistent sound-meaning associations irrespective of language families, environment or culture, […] such as that of /i/ with words for small, and /r/ with words for round.” (citation from the book)

Another reason for why iconic words are interesting is that they are easier to learn than arbitrary words – the lexicons of two- to six-year-olds are dominated by iconic words. This also just makes sense, it should be easier to remember labels which are somehow naturally associated with their meanings rather than sounding completely random. These labels are also easier to learn even when we’re not aware of the iconicity (which is most of the time). Although iconic words are easier to learn and understand, they have a drawback. If your lexicon is made up of purely iconic words, then the space of things that you can talk about is seriously constrained. It would be very difficult to distinguish between two small black birds of different species, for example, if the words we use to label them weren’t allowed to have any arbitrariness. On the other hand, remembering tens of thousands of arbitrary labels requires a lot of memory. All of this together suggests that iconic words might have been one of the first milestones in language evolution.

We also have a pretty good understanding of how iconic words become arbitrary over time. Speakers of any language are always motivated to increase the efficiency of their speech and get away with shorter words and putting less effort into pronunciation. Therefore, over time, the harder to pronounce parts of words fade away or get replaced by easier sounds. After enough time has passed, speakers no longer recognise the core of the initial iconic word within the new arbitrary word.

Mithen thinks that iconic words are strongly associated with synaesthesia – the condition by which a sensation experienced in one modality, such as vision, causes the stimulation of another, such as hearing. This could explain how the first iconic word-like sounds arose – someone saw a teeny tiny fluffy bird, which happened to trigger also a neural pathway in the brain corresponding to the /i/ sound, after which they voiced this sound. Why would this weird skill become widespread? It must bring with it some gains in fitness. For example, imagine that you’re hunting with a group, and that your group leader has climbed to the edge of a hill to get a good view of what lies ahead, while you and the others remain hidden below. If the leader now lets out a sound which everyone naturally associates with big scary animals, then everyone knows to be careful and not storm ahead carelessly. If the leader manages to voice a sound which everyone naturally associates with tasty prey, then everyone knows to start preparing for attack. People with enhanced abilities to understand and express these multi-sensory associations have an advantage – they’re less likely to get eaten, and to get access to more nutritious food.

It is also interesting to note that children exhibit much higher levels of synaesthesia than adults, before this ability is reduced via the pruning of neuronal connections during development. This is maybe some evidence that some of our ancestors might have experienced more synaesthesia than modern adult humans, but I don’t think this is very convincing. Experiments on whether monkeys also experience synaesthesia are, according to Mithen, still inconclusive.

Let’s now turn to how children learn languages. According to Mithen, children acquire languages using only general-purpose learning mechanisms. There’s no need for a special ‘language learning’ toolkit. I’m under the impression that this question is still widely debated within the research community, so it’s probably best to take the following with a pinch of salt.

One of the most important strategies that children use is statistical learning, the ability to pick up on the statistical regularities in our sensory environment, typically without intention or conscious awareness. When 8-month-old babies were exposed to streams of nonsense words, they were able to, after just a few minutes of listening, distinguish between words and non-words from the stream (source). The streams consisted of four different three-syllable words in a random order. After listening to the streams, the infants showed longer listening times for words from the stream than for non-words (which were generated from the same syllables but in a different order), showing evidence that they had learned statistical relationships between syllables in a pseudo-language within just a few minutes. We also have some evidence that chimpanzees use statistical learning to an extent. Human children supplement this method with other tools, such as their intuitive understanding that the world contains objects, properties, events and processes. They are also able to “read minds”. If a parent points at a rabbit and says “look, it’s a rabbit!”, then the child can usually infer that the parent is talking about the entire animal, not just its tail or paws, for example. Additionally, children develop a theory of mind by age 4 which further helps them infer the intentions of others. All of these skills are generally useful – in addition to learning languages – for finding better food, surviving in a predator-rich environment, and co-existing with others in social contexts.

In general, cognitive advancements go hand in hand with larger brains. Given that the brains of modern humans are roughly three times larger than chimpanzee brains, and that language is acquired using general learning methods, we can say that developing a larger brain probably played a role in developing the ability to use language. Are there any other features of the human brain which differ from chimpanzees that we can directly associate with language capabilities? For example, Broca’s area — a brain region associated with speech production — is six times larger in humans than in monkeys. There are many other minor anatomical differences that I could mention, but I don’t think they are very informative without more detail than I can go into here. An important difference, however, seems to be that human brains have areas with increased functional specialisation that are, additionally, interconnected via more long-distance connections. From brain-imaging studies, we also know that language processing takes place in different locations throughout the brain, and that many abstract concepts are stored as spatially distributed networks in the brain. We might then hypothesise that brain-wide connections evolved together with language, and a less-connected brain would imply a more primitive language.

In addition to anatomical differences within brains, humans and monkeys differ in their vocal and auditory tracts. One reason why monkeys can’t talk like us is that they don’t have as much control over the muscles that are used to expel air from the lungs. In order to utter a long sentence, we need to steadily expel air, keeping a higher pressure in the lungs throughout. Monkeys aren’t capable of this. This means that enhanced voluntary control over breathing muscles is required for spoken language to evolve. The vocal tracts of humans also allow for more different vowels and consonants, and to produce the same sounds consistently. Although hearing systems are incredibly conserved in all mammals, the audiograms of human ears show a heightened sensitivity to the frequencies prominent in speech compared to monkey ears.

I have now covered some of the most important differences in monkeys and humans with relevance to language capabilities. So far we know that many anatomical changes needed to take place to allow for human language production and processing, that iconic words seem fundamental and easier to learn than arbitrary words (implying that they appeared earlier in evolution), and that synaesthesia might have something to do with iconic word creation. We also know that general learning methods are all that’s needed to acquire a language.

Let’s now see what more we can learn if we add archeology to the mix.

Step 2: What do we gain from a single fossil?

What if you had a full skeleton of a member of an extinct human species, perhaps Homo heidelbergensis or Neanderthalensis, together with some of its belongings? Of course, this is unrealistic. What we usually have are partial skeletons, mostly decomposed, and maybe a stone tool. But the point of this exercise is to think about which ingredients give us more information about the language puzzle.

Here’s a figure from the book summarising the different species and when they lived: **

Let’s start with some straightforward anatomy. We can infer the shape of this individual’s vocal tract from the shape of his face (the degree of elongation being the most important variable). We can get some additional clues on the vocal tract from looking at the shape of the hyoid bone, a small bone situated in the front part of the neck. This bone has a slightly different shape in monkeys than in humans, and its shape is believed to be an indicator of what the vocal tract behind it must have been like. Unfortunately, it is such a small bone that not many of them exist in the fossil record. However, we have a 3.3 million year old specimen from a member of Australopithecus afarensis which shows a chimpanzee-style hyoid. The next known fragment comes from Homo heidelbergensis from 450,000 years ago, showing a modern human shape. The same is true for the few Neanderthal samples we have. Similarly, we can infer the degree of voluntary control an individual had over his breathing from the cross-sectional area of his vertebral canals – the passages where the nerves would be located. We have samples from different species until 1.6 million years ago showing non-human dimensions, followed by a lack of specimens until 100,000 years ago, when we see much larger canals in H. sapiens and Neanderthals. Therefore, a human-like vocal tract had mostly evolved already by the time of H. heidelbergensis.

We already know that brain size is related to linguistic abilities. Brain size can be inferred from fossilised skull parts. In general, throughout human evolution, brains tended to get bigger. We can also look for clues on whether brain areas which are known to be more associated with language are enlarged or not. Broca’s area leaves a bulge on the inside of a human skull, and the fossil record shows this bulge developing between 1.7 and 1.5 million years ago in Homo erectus. Using similar reasoning, we think that a Neanderthal brain had a smaller cerebellum, also an area important for language.

We also guessed earlier that increased long-distance connectivity in the brain is associated with language. Sadly, we are unable to infer the level of connectivity within brains based on the skull alone.

The fossil findings also include information on the tools that the individual used. In general, a more complex tool implies higher intelligence, which implies better language skills. Additionally, it would make sense that in order for people to manufacture more complex tools, they needed better ways of teaching others how to do that. The ability to use words to describe your thought process when making a stone tool probably makes the information transfer much more efficient. The more precise the words that you’re able to use are, the more complicated procedures you can teach. Therefore, the tools and items found with a fossil are evidence for language ability for this reason as well.

In addition to tools, you might find evidence of visual art. Modern humans liked to make paintings on cave walls, and carve geometric images on soft rocks. Here, the author emphasises the importance of using visual symbols – images that are arbitrarily related to their referents, often requiring cultural knowledge to understand the meaning. For example, a cross is a symbol of Christianity, although crosses have no literal association with this specific version of god and a couple of old books. This concept is analogous to arbitrary words. We might then guess that evidence that an individual produced visual art with symbolic meaning would be evidence that he knew how to use arbitrary words as well. We’ll come back to this thought later.

In summary, a fossil specimen provides us with some information on the intermediate stages of human evolution. Fossilised skulls can be examined to reveal clues on how developed some language relevant parts of the brain were. Other bones offer information on the anatomy of the vocal tract and level of voluntary control in air expulsion from the lungs. The complexity of tools that the individual carried also imply something indirectly about his language capabilities. But overall, these are fairly noisy clues with minimal information value on their own. Perhaps if taken together and placed in their appropriate historical and evolutionary contexts, they might help us say something about language.

Step 3: The entire hominin fossil record

What if we now have access to the entire fossil record? Firstly, we can now observe trends in stone tool technologies. Chimpanzees use sticks and twigs, stone hammers and anvils. Although the skill with which chimpanzees use hammerstones appear similar to that of H. habilis, the latter had developed the skill of detaching flakes from nodules of stone. Broadly, stone tool technologies came in four waves (modes). First was the Oldowan era, during the time of H. habilis, where sharp-edged flakes were produced by hitting suitable rocks with hammerstones. Roughly coinciding with the appearance of H. erectus (and thus the enlargement of the brain), Acheulean tool technology became widespread starting 1.6 million years ago. Now, the flakes were further shaped to produce hand axes. Both eras lasted for more than a million years, showing incredible stability through time. Or in other words, lack of technological progress. The third mode is called Levallois technology, emerging between 400,000 and 350,000 years ago. Now, a flake is first cleverly shaped while still attached to the bigger rock and then struck off (see here for a cool gif). This is the technology that sapiens and Neanderthals used up until 40,000 years ago, although sapiens became much more inventive at around 150,000 years ago, showing much more geographical variation in technologies and clear adaptations to specific environmental conditions.

Mithen believes that stone tool technology and language co-evolved. This is mainly because stone tool making and speaking have very similar requirements: fine muscular control, forward planning skills, and hierarchical processing of information. It would also make sense that they would co-evolve if language was used to transmit the knowledge of tool making between generations. We have no direct evidence to think so (except for some experiments of suspicious value1), but I think it seems reasonable. The implication from this is that some linguistic transitions might have coincided with major developments in stone tool technology. A more far-fetched conclusion is that a lack of technological innovation implies a language that is not fully modern. If this was true, it would mean that sapiens before ~150,000 years ago and Neanderthals lacked some fundamental features of modern language. Coupled with the absence of visual symbols, Mithen hypothesises that the last feature that was missing in the most recent non-modern humans (i.e. Neanderthals and early H. sapiens) was abstract thought and abstract words. He argues that abstract thoughts require the use of analogies and metaphors to convey meanings that are not directly tangible/visible/audible/etc. Many of the words we use to denote abstract concepts first arose as metaphors – e.g. field in ‘field of study’ or move in ‘move on to the next topic’. This is what was missing from Neanderthal language, and the reason why they were unable to rapidly develop new technologies, and ultimately go through an agricultural revolution.

I personally think this hypothesis sounds nice but I need further convincing. Visual symbols seem more closely analogous to arbitrary words, not abstract ones. Why, then, didn’t Mithen claim that everyone before 150,000 years ago only spoke using iconic words? Because he believes that the earlier Acheulean -> Levallois technology transition (which, as mentioned earlier, happened between 400,000 and 350,000 years ago), which also coincided with the start of controlled use of fire and the emergence of bigger-brained H. heidelbergensis (who also had modern vocal tracts), was a major shift that corresponded to adopting arbitrary words. H. heidelbergensis had a larger brain than its predecessors, implying more capacity for remembering arbitrary labelings for concepts. With the ability to only use iconic words, it would have been more difficult to invent a new technology and have your descendants understand and transmit the knowledge even further. So, the Acheulean -> Levallois shift corresponded with the emergence of arbitrary words. Therefore, the final development in language 150,000 years ago couldn’t have been the invention of arbitrary words, but instead something else. He proposes abstract words, and it makes for a nice story. Hopefully it is also sort of true.

Summary

Here’s a picture summary (from the book) of most of the things, and including a few aspects that I didn’t cover in this review:

Footnotes

  1. For example, Morgan et al 2015 shows that allowing modern humans to speak when teaching others how to make stone tools facilitates the acquisition of difficult techniques. This doesn’t seem very surprising.