Artificial intelligence learns language through a child’s eyes and ears

Children learn their skills between the ages of six and nine months first words and they begin to relate them to objects and concepts in the real world. Between the ages of 1.5 and two, most can understand an average of 300 words. However, it is not known exactly how they acquire these and relate them to their visual equivalents.

A better understanding of this process could inform next-generation artificial intelligence (AI) systems that develop connections between words and visual representations.

Current AI systems, such as GPT-4 chatThey can already learn and use human language, but do so astronomical amounts of data linguistically, much more than what children receive when they learn to understand and speak. The best AI systems are trained on texts that contain billions of words, while children only receive millions per year.

Because of this huge data gap, researchers have been skeptical that recent advances in AI can say much about it Learning and developing human languages.

To make progress in this area, a team at New York University (NYU) decided to develop a new machine learning model that is not based on huge amounts of data, but rather uses the experience of learning to speak from a single child as an example , which they called “Baby” S. The results of the study are now published in Science.

The team has developed an AI model that is not based on huge amounts of data, but on the example of the experience of how a single child (Baby S) learns to speak.

The authors designed an experiment that involved training a multimodal AI system through Baby S’s eyes and ears. To do this, they used video recordings from a front camera that they collected from his sixth month to his second birthday. And they examined whether the model could learn words and concepts that occur in a child’s everyday experience.

Wai Keen Vong, Researchers at the American university and first signatory of the study explains to SINC that in their experiment they used the SAYCam dataset, “a very rich and interesting resource consisting of videos recorded with head cameras in developing children.”

“We focus on just one child (you drink), as it contained the largest amount of transcribed speech data, making it easier for us to model. Every person must learn to speak from their own information – and not from that of others. Therefore, studying whether it is possible to acquire aspects of language with a computer model based on the sensory information of an individual child is a unique way to address this problem,” emphasizes the data scientist and AI expert.

The conclusions of the study show that the model or neural network can learn a significant number of words and concepts using limited fragments of the child’s experience. The co-author clarifies that the videos only captured about 1% of Baby S’s waking hours, but that was enough to power their model.

quotebefore

It’s the first work to really look at this type of learning with real, natural data and in a way that more accurately reflects what babies see and hear.

Wai Keen Vong, first author (New York University)
quoteafter

Words and their visual equivalents

“In our research, we are interested in a system that learns the relationships between words and their visual equivalents – for example, how to recognize that the word ‘ball’ refers to images of round, bouncing things.” Although many are already computationally based “When word learning models have been proposed, they have often been trained with simplified inputs or with certain built-in assumptions that in reality don’t work very well (or at all!) when applied to real images or natural language,” says Vong.

The expert emphasizes that this is “the first work that really looks at this type of learning with real and natural data, in a way that more accurately reflects what babies see and hear.” In his opinion, show The study’s findings “explain how recent algorithmic advances, coupled with the experience of a single child, have the potential to reshape our understanding of it.” Early language and concept acquisition“.

The team analyzed Baby S’s learning process in weekly video sessions from six to 25 months old, using more than 60 hours of recording. The audios contained approximately a quarter of a million words communicated. Many of these were repeated and linked to video images of what he saw saying them as he performed various activities during development, such as eating, reading books, and playing games.

Next, the NYU researchers trained a multimodal neural network with two separate modules: one captured individual video frames (the vision encoder) and another included the child’s transcribed speech (the speech encoder).

Contrastive learning algorithm

The two encoders were combined and trained using an algorithm called contrastive learning, which aims to learn useful input features and their properties intermodal partnerships. For example, when a parent says something in front of the child, some of the words used are likely to refer to something the child can see, meaning that understanding is conveyed by linking visual and linguistic cues.

After training the model, the researchers tested it with the same type of assessments used to measure word learning in infants: They presented the baby with the target word and a series of four different pictures and asked him to choose the picture that corresponded to the target word. Target word.

The results showed that the intermodal AI model was able to learn a significant number of words and concepts that occur in the child’s daily experience and then generalize them

The results showed that the intermodal AI model was able to learn a significant number of words and concepts that occur in the child’s everyday life. In addition, the system could generalize some of the learned concepts to visual instances that are very different from those observed during training, which is also what happens with children when they are tested in the laboratory.

“These results suggest that this aspect of word learning is possible using the type of real-world data that children receive while using relatively generic learning mechanisms, such as those found in neural networks,” he notes. Brenden Lakealso from NYU and lead author of the study.

Improve AI language

On the impact the work could have on improving the language of AI systems, Vong notes: “Children are extremely adept at learning languages. At two years old, they already have better results than our model. On the other hand, as adults we only speak hundreds of millions of words, while the latest AI systems require billions or even trillions to become fluent. I think a deeper study of human language acquisition might shed light on how we can do this learn so efficiently from limited dataand it is to be expected that this knowledge will also be transferred to artificial intelligence,” he emphasizes.

quotebefore

A deeper study of human language acquisition could shed light on how we can learn so efficiently from limited data

Wai Keen Vong
quoteafter

The researcher also tells SINC that the study took this into account ethical aspectsbecause first-person video footage of a child was used.

“Due to the sensitivity of the information, conducting research using this dataset – hosted on the NYU Databrary website – is necessary ethical recognition before college. This is something I always thought about when researching. The other important consideration was conservation Privacy of parents and children“But I can tell you that his name is Sam, he is now 11 years old, he is doing very well and he is in the 6th grade,” says Vong.

Reference:

Wai Keen Vong et al. “In-depth language acquisition through the eyes and ears of an individual child.” Science (2024)

Recent Articles

Related News

Leave A Reply

Please enter your comment!
Please enter your name here