Google can create videos of people talking, gesturing and moving from a photo

Google researchers have developed one Artificial intelligence system that can create realistic videos of people talking, gesturing and moving from a single photo.

The technology, called VLOGGERSrelies on advanced machine learning models to synthesize strikingly realistic images, opening up a range of applications while raising concerns Deepfakes.

Described in a research article titled «VLOGGER: Multimodal transmission for the synthesis of embedded avatars»The AI ​​model can take a photo of a person and an audio clip as input and then produce a video that matches the audio and shows the person saying the words and making corresponding facial expressions, head movements and hand gestures.

The videos aren’t perfect, with some artifacts, but they represent a significant advance in the ability to animate still images.

 

The researchers, led by Enric Corona at Google Research, used a type of machine learning model called diffusion models to achieve the novel result. Diffusion models have recently demonstrated remarkable performance in generating highly realistic images from text descriptions.

By extending them to the video domain and training them on a huge new data set, The team was able to develop an AI system that brings photos to life in an extremely compelling way.

Unlike previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the entire image (not just the face or lips), and considers a wide range of scenarios (e.g. visible torso ). or different identities of subjects), which are crucial for the correct synthesis of communicating people.

A key element was the curation of a massive new dataset called MENTOR, which contains more than 800,000 different identities and 2,200 hours of video – an order of magnitude larger than what was previously available. This allowed VLOGGER to learn how to create unbiased videos of people of different ethnicities, ages, clothing, poses, and backgrounds.

The technology opens up a number of use cases. The article demonstrates the ability of VLOGGER, i.eAutomatically convert videos to other languages ​​by simply changing the audio track. to seamlessly edit and fill missing frames in a video and to create entire videos of a person from a single photo.

Read Also:  Don't expect a Tile tracker that's compatible with the Google search network

One could imagine that actors could license detailed 3D models of themselves to generate new performances. The technology could also be used to crear photorealistic avatars for virtual reality and games. And it could enable the development of AI-powered virtual assistants and chatbots that are more engaging and expressive.

Google sees VLOGGER as a step in this direction “embedded conversation agents” who can interact with people naturally through speech, gestures and eye contact.

VLOGGER can be used as a standalone solution for presentations, education, storytelling, low-bandwidth online communication and as an interface for purely text-based human-computer interaction.

However, Technology also has the potential to be misused.a, for example, to create deepfakes – synthetic media in which a person in a video is replaced by another person’s image. As these AI-generated videos become more realistic and easier to create, challenges related to misinformation and digital counterfeiting could increase.

VLOGGER still has limitations. The videos generated are relatively short and have a static background. People don’t move in a 3D environment. And their behaviors and speech patterns, while realistic, are not indistinguishable from those of real people.

Recent Articles

Related News

Leave A Reply

Please enter your comment!
Please enter your name here