DS0803 - Éducation et apprentissages

Mechanisms of Early Lexical Acquisition – MechELex

How do babies learn words? A big data approach

If you are reading these words, then you have learned language. Among humans, learning language is trivial. In fact, it is surprising when someone struggles to learn language. But when you stop to think about it, learning language is quite a feat. Think about babies’ vocabulary development. By the time they are three years of age, they will have learned hundreds of words, often at a rate of 10 words a day! So how do they do it?

Building a vocabulary: An everyday marvel

Over the past century, researchers have developed cunning methods to look into this process, trying to be incredibly precise. For example, they would try to teach children a word in the laboratory, to make sure that they controlled the specific experience provided to the children. In this way, we learned a lot about the potential ways in which babies could learn words — but we were faced with a conundrum at the same time. Since these experiments take place in the lab, in an isolated and controlled environment, we could not make causal claims that the children actually did something similar in the messy outside world. This is where the present project came in. Our main goal was to find ways to study word learning by using real-life data.

The challenge we were facing was: How can we link up messy life experiences with knowledge? When one steps into the real world, measurements become noisy. To make up for this, we increased the amount of data we considered, and used novel techniques. For instance, to capture children’s everyday experiences, we used recorders that the child could wear in a breastpocket over his whole day. This way, we captured everything they said and heard on.a large scale. To measure what words children learned, we created a fun game on a touchscreen. Finally, to understand how children go from experiences to knowledge, we used computer models, by writing software that did the same things we thought babies were doing, and checking whether they were learning the same as babies were.

We found that children’s experiences are actually a lot more variable than we thought. For instance, Tsimane children probably only get about 1 minute per hour of speech directed to them, whereas American children of professional parents get 10 times that. Are these enormous differences in experience reflected in equally enormous vocabulary differences? The answer for now appears to be no: Tsimane children do not know 10 times fewer words than American children. This means that children’s mechanisms for word learning must be incredibly robust!

The MechELex project is a fundamental research project coordinated by Alejandrina Cristia, who is based at the Laboratoire de Sciences Cognitives et Psycholinguistique, Dept d'Etudes Cognitives, ENS, PSL University, EHESS, CNRS. The project started in October 2014 and lasted 4 years. It benefited from ANR funding amounting to 252969.60, with a global cost of 1034710.36.

The project resulted in 13 publications in well-established international peer-reviewed journals (such as Child Development), 15 papers in competitive engineering conferences (Interspeech, ACL) and 28 communications in international conferences without proceedings. Other articles are in progress and will be published in the coming months and years. The results of the project have also been presented to the general public (radio, online press; see www.lscp.net/babylab).

The great majority of children learn their native language effortlessly, and exhibit surprising linguistic knowledge even at a young age. By 6 months, infants already know a few words: when hearing the word “cookie”, they look longer at a picture of a cookie than a picture of a hand. In order to learn those words, the 6-month-old must have been able to extract and store the sound component of words, the “wordform”. In fact, it has been estimated that 1-year-olds have stored and be able to recognize as many as 500 wordforms. The characteristics of the input and mechanisms allowing this surprising development in natural language acquisition have not been studied before. The general goal of the present project is to shed light on how infants achieve, and caregivers promote, early wordform learning in the real world.

We will combine theories and methods from linguistics, experimental psychology, automatic speech recognition, and natural language processing as follows. In a first phase, we will use a novel technology allowing daylong recordings to gather a rich and realistic corpus representing infants’ input. We describe the wordforms present in this input by capitalizing on state-of-the-art wordform extraction algorithms. These algorithms vary in terms of the operations they carry out (e.g., extracting repeated sequences, additionally learning the language’s grammar) in different types of signal (e.g., raw acoustic speech, phonemic units). As a result, each makes some unique predictions with respect to the wordforms infants can find in the rich corpus just mentioned.

In a second phase, we will check these predictions against infants’ perception, by “reverse engineering” the wordforms they succeed in finding. Previous work has shown that infants prefer frequent wordforms (which they recognize) over others that are low in frequency. A preference for a given wordform is thus a sign that infants have extracted that wordform and stored it for subsequent recognition. Given that many such wordforms need to be tested, we will develop a novel method: the “preference toy”. The toy plays a sound each time the child shakes it. Laboratory-based research with comparable conditions (e.g., preferential listening) suggests that the child will shake the toy more when this results in wordforms he/she recognizes over unrecognized wordforms. By embodying it in an age-appropriate toy, we can provide it to the child to use at home for much longer periods of time. Repeated testing should boost precision, allowing us to check our multiple competing predictions. Given that the algorithms from phase 1 vary in terms of how much knowledge they assume in the learner, we expect different predictions to be true at different ages.

In the third phase, we will assess to what extent each child follows a unique path during early lexical acquisition. Since wordform learning necessarily depends on the input presented to the child, unique aspects in that input could explain individual variation across children. To understand the contributions of infant-specific versus common aspects of the input to infants' learning we will combine our innovations from the previous two phases: The child's input is captured through daylong recordings, processed to generate specific predictions, and that same child is tested on those predictions. In addition to gaining a deeper understanding of the acquisition process, this phase paves the way for applied work to be carried out in the future.

Project coordination

Alejandrina CRISTIA (Laboratoire de Sciences Cognitives et Psycholynguistique)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.


ENS, LSCP Laboratoire de Sciences Cognitives et Psycholynguistique

Help of the ANR 252,969 euros
Beginning and duration of the scientific project: September 2014 - 36 Months

Useful links

Explorez notre base de projets financés



ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter