SORA, OpenAi’s new video tool, still makes mistakes

Late last week, OpenAI announced a new generative AI system called Sora that creates short videos from text messages. Although Sora is not yet available to the public, the high quality of the sample videos released so far has sparked both enthusiastic and worrying responses.

Sample videos released by OpenAI, which the company says were created directly by Sora without modifications, show results from messages such as “a photorealistic close-up of two pirate ships fighting each other while sailing in a coffee cup” and “Historical images of California.” during the gold rush.

Due to the high quality of the videos, the textures, the dynamics of the scenes, the camera movements and good coherence, it is often difficult to recognize at first glance that they were generated by AI.

OpenAI CEO Sam Altman also posted a few videos on X (formerly Twitter) created in response to user suggestions to demonstrate Sora’s capabilities.

How does Sora work?

Sora combines functions of text and image generation tools in a so-called “diffusion transformer model”.

Transformers are a type of neural network first introduced by Google in 2017. They are best known for their use in large language models such as ChatGPT and Google Gemini.

Diffusion models, on the other hand, are the basis of many AI image generators. You start with random noise and then iterate to a “clean” image that matches an input requirement.

Diffusion models

Diffusion models (in this case stable diffusion) create images from noise through many iterations. (Stable Diffusion/Benlisquare/Wikimedia, CC BY-SA)

A video can be created from such an image sequence. However, in a video, coherence and consistency between images are crucial.

Sora uses the Transformer architecture to manage how frames relate to each other. While Transformers were originally intended to find patterns in tokens that represent text, Sora uses tokens that represent small fragments of space and time.

In the lead, but with mistakes

Sora is not the first text-to-video conversion model. Previous models include Meta’s Emu, Runway’s Gen-2, Stability AI’s Stable Video Diffusion and, most recently, Google’s Lumiere.

Lumiere, which launched just a few weeks ago, claimed to produce better videos than its predecessors. But Sora seems to be more powerful than Lumiere, at least in some ways.

Sora can output videos with a resolution of up to 1920 × 1080 pixels and in various aspect ratios, while Lumiere is limited to 512 × 512 pixels. Lumiere’s videos are around 5 seconds long, while Sora creates videos up to 60 seconds long.

Lumiere can’t create multi-shot videos, but Sora can. Sora, like other models, is reportedly capable of performing video editing tasks such as creating videos from images or other videos, combining elements from different videos, and extending videos in time.

Both models produce largely realistic videos, but can suffer from hallucinations. Lumiere videos can be more easily recognized as AI-generated. Sora videos appear more dynamic and have more interactions between elements.

However, upon closer inspection, many sample videos reveal discrepancies.

Promising applications

Currently, video content is produced by filming the real world or using special effects, which can be expensive and time-consuming. If Sora becomes available at a reasonable price, people could start using it as prototyping software to visualize ideas at a much lower cost.

Read Also:  Dolphin explains why its Wii and GameCube emulator cannot be available on the App Store

From what we know about Sora’s capabilities, it could even be used to create short videos for some entertainment, advertising, and educational applications.

OpenAI’s white paper on Sora is titled “Video Generation Models as World Simulators.” The paper argues that larger versions of video generators like Sora can be “capable simulators of the physical and digital worlds and the objects, animals and people inhabiting them.”

If true, future versions could have scientific applications for physical, chemical, and even social experiments. For example, the effects of different size tsunamis on different types of infrastructure and on the physical and mental health of people nearby could be tested.

Achieving this level of simulation is challenging, and some experts say that a system like Sora is fundamentally incapable of doing so.

A complete simulator would need to calculate physical and chemical reactions at the most detailed levels of the universe. However, in the coming years it may well be possible to simulate an approach to the world and create videos that are realistic to the human eye.

Ethical risks and concerns

The main concerns with tools like Sora revolve around their social and ethical implications. In a world already plagued by misinformation, tools like Sora can make the situation even worse.

It’s easy to see how the ability to create realistic videos of any scene you can describe could be used to spread compelling fake news or cast doubt on real footage. It could jeopardize public health measures, be used to influence elections, or even overwhelm the justice system with potentially false evidence.

Video generators can also enable direct threats to certain people through deepfakes, especially pornographic ones. This can have a devastating impact on the lives of those affected and their families.

Beyond these concerns, there are also issues of copyright and intellectual property. Generative AI tools require large amounts of data for training, and OpenAI has not revealed where Sora’s training data comes from.

For this reason, large language models and image generators are also criticized. In the US, a group of famous authors have sued OpenAI over possible misuse of their materials. The case argues that large language models and the companies that use them steal the work of authors to create new content.

It’s not the first time recently that technology has gotten ahead of the law. For example, the issue of requiring social media platforms to moderate content has sparked heated debates over the past two years, particularly in the context of Section 230 of the US Code.

Although these concerns are real, experience shows that they will not slow down the development of video production technology.

OpenAI says it is “taking several important security measures” before releasing Sora to the public, including working with experts in “disinformation, hate speech and bias” and “developing tools to detect misleading content.”

Recent Articles

Related News

Leave A Reply

Please enter your comment!
Please enter your name here