OpenAI has launched an amazing new text-to-video AI model called Sora, and it’s more impressive than we expected. This model represents a huge advance, as it can generate high-definition video clips of up to 60 seconds from basic text prompts. Sadly, it’s currently limited to a small group of trusted beta testers and creators.
The new model was unveiled this Thursday with Sam Altman, OpenAI’s CEO, asking for prompts on X and replying with the resulting videos. The videos displayed Sora’s ability to generate high-definition, complex, vivid scenes involving multiple characters, backgrounds, motions, and emotions purely from written instructions.
If you weren’t carefully looking for inconsistencies, you could easily be fooled that some of them were real videos.
Sora Represents Major Leap for AI Video Generation
While other companies like Google, Meta Platforms (META), and smaller startups have shown advancements in early-stage text-to-video models, Sora appears to represent a massive leap forward in length, realism, coherence, resolution, and attention to detail.
The videos shared by OpenAI featured an animated fluffy monster, a countryside scene of California during the gold rush, and a variety of cute animals in their natural environments. They showed seamless transitions, dramatic viewpoints, realistic movements, and lively character expressions—all generated automatically from brief text descriptions.
“I didn’t expect this level of sustained, coherent video generation for another two to three years”, said Ted Underwood, a tech expert and professor from the University of Illinois, to The Washington Post. The academic further commented: “it seems like there’s been a bit of a jump in capacity”, referring to the eye-popping output that the model managed to generate.
For context, an early AI text-to-video model called ModelScope generated this absurd video of actor Will Smith eating spaghetti less than 1 year ago. The progress is nothing short of astonishing.
How Does Sora Work?
The Microsoft-backed AI company said that Sora is being built on the breakthrough image generation models DALL-E 2 and 3 by unifying how visual data is processed, allowing it to generate video frames, images, and other formats from the same foundation.
This also allows leveraging innovations like DALL-E’s recaptioning technique, which involves creating highly detailed text descriptions of visual data to train the models. OpenAI says that this helps Sora follow prompt instructions more accurately.
The company describes Sora as a “diffusion model”, beginning with random noise that is gradually transformed over many steps into the final output video. The transformer architecture also allows superior scalability compared to previous video AI models.
OpenAI Claims That Breakthrough Understanding of Physics Empowers Sora
A key innovation highlighted is Sora’s deep understanding of the physical world – allowing realistic rendering of motions, expressions, backgrounds, and scene changes in adherence to real-world constraints. It can make even the most absurd scenes, like golden retrievers doing a podcast on a mountain, look incredibly realistic.
https://t.co/uCuhUPv51N pic.twitter.com/nej4TIwgaP
— Sam Altman (@sama) February 15, 2024
Subjects stay consistent as camera angles shift, characters move logically in relation to objects and each other, and atmospheric effects like weather persist throughout scenes.
“The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world”, OpenAI stated in a blog post published yesterday that showcased the model’s capabilities.
Sora is Not Perfect… Yet
OpenAI showed off all of Sora’s strengths but it also made sure to note areas where it still struggles. Complex multi-character interactions can break down at times, with odd physics slip-ups or disappearing objects, depending on the complexity of the input. Moreover, directions and sequences of events are not always followed precisely.
The model also does not inherently understand the broader context or consequences of actions – a person may take a bite of food but the food itself may fail to show the bite mark afterward, the OpenAI team explained.
“The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory”, the blog post reads.
Finally, Sora struggles with complex interactions and movements for certain things, especially hands. They usually look like normal hands but hand movements are often a bit off, causing an ‘uncanny valley’ effect that’s hard to ignore.
Sora is obviously really good, but it hasn't crossed uncanny valley yet. Just look at the woman's hands in the back.pic.twitter.com/IzltjLJefh
— Stephen Flanders (@SteveFlanders22) February 15, 2024
In any case, Sora’s output is still impressive, mind-blowing, and a bit scary considering the fast pace at which new models are being released. These mixed feelings are shared by members of the general public, professionals, regulators, politicians, and AI experts.
The Toughest Task Ahead: Balancing Sora’s Creative Potential and Associated Risks
As expected, OpenAI states that they will conduct extensive safety testing and build guardrails before any public release.
This includes adversarial testing around potential misuse cases like fake news and political misinformation, developing tools to automatically detect Sora-generated video, and requiring detailed prompt explanations from professional users to access certain types of dangerous or unethical content.
At the same time, OpenAI voiced excitement at the creative possibilities that Sora opens up including easily generating storyboards, animations, and mock film scenes to assist video creators and artists.
The company says that it cannot predict all beneficial and harmful uses but wants to enable positive applications while increasing safety. Striking this balance will likely prove challenging amid the technology’s rapid progress.
OpenAI Calls Sora “An Important Milestone” Towards AGI
OpenAI described Sora as “an important milestone for achieving AGI [artificial general intelligence]” as the model manages to connect multiple modes of understanding and turn them into coherent actions.
The firm expects that new models building on top of Sora will push further toward the grand challenge of a true general AI, which is to create a human-like perception of reality that can observe, understand, reason, and act much like us in the midst of complex and open-ended real-world environments.
Concerns Around Deepfakes and Job Losses are Gaining More Traction as More Advanced AI Models Get Released
Sora’s impressive launch deepens existing concerns about the proliferation of high-quality fake videos that most people cannot easily identify as AI-generated “deepfakes”.
This has raised red flags around the potential to enable new floods of misinformation, scams, impersonation, political manipulation, and more.
Also read: OpenAI Unveils Landmark AI Safety Framework
OpenAI stated that they are working closely with policymakers to ensure the responsible deployment of models like Sora. However, many experts say that the technology is moving too rapidly for regulators and policies to keep pace.
Some creative professionals also voiced their worries about the economic impact of these tools as they can eliminate thousands of human jobs in industries like marketing, animation, visual effects, and more.
Recent demonstrations have included AI generating scene layouts, character movements, subtitles, scripts, and other support tasks that currently provide employment for hundreds of thousands of people globally. For example, these kinds of models could entirely replace the stock footage industry in just a few years. Why pay someone hundreds or thousands of dollars to create a basic scene when you can generate it in a minute or 2 for a small subscription fee?
FTC Cracks Down On AI Impersonation
As the buzz around Sora took off Thursday, the Federal Trade Commission (FTC) moved to ban AI tools that digitally impersonate real people without consent.
1. Fraudsters are using voice cloning & other AI tools to impersonate individuals with eerie precision and at scale. @FTC proposes to expand its impersonation rule to cover impersonation of individuals, so these fraudsters would pay hefty penalties.https://t.co/8ON0G63ZjL
— Lina Khan (@linakhanFTC) February 15, 2024
Citing surging complaints and public concern over new technologies like hyper-realistic deepfakes, the proposed rules aim to curb the malicious use of generative AI for fraud, scams, and reputational damage.
This seeks to extend existing protections against identity theft and misrepresentation into the AI era as technology rapidly blurs the lines between what is real and what’s not.
OpenAI Says Early Feedback Critical
OpenAI stated that, while risky, releasing Sora research early is critical to enable feedback for safety and ethics considerations before any applications are built by using the technology.
From the public’s perspective, the cat is already out of the bag. There are going to be good and bad uses. OpenAI, regulators, and legislators will have to do their best to prevent harmful misuse of the cutting-edge technology.
The company emphasized its intentions to proceed gradually and responsibly despite intense public pressure and media hype. The next steps include expanded trials among researchers, creative professionals, policy experts, and public interest groups.
“Despite extensive research and testing, we cannot predict all beneficial and harmful uses” OpenAI conceded. “Learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time”, the tech company further emphasized.
With generative AI models continuing to demonstrate new capabilities almost weekly, expect much more turbulence ahead at the intersection of technology, ethics, and governance.