Brando Koch avatar

ML Engineer & CEO privatesynapse.ai

Sora text-to-video released by OpenAI

OpenAI has recently released Sora, a text-to-video model capable of generating one-minute videos. Sora represents a significant advancement compared to ChatGPT, and here's why.

What is Sora?

Sora stands as a groundbreaking advancement in video understanding, although not yet marketed as such. Its primary function currently lies in text-to-video conversion. Sora can craft high-quality videos up to one minute in duration, featuring intricate scenes and precise backgrounds. The astonishing level of detail and realism in these generated videos marks a milestone, making it increasingly challenging to discern AI-generated content from real footage.

Sora video example. Courtesy of OpenAI

To create Sora, OpenAI employed their signature approach: a massive engineering endeavor combined with access to the world's most powerful computing resources. They trained scalable transformer models, specifically a diffusion model in this case, using vast amounts of internet-scale data.

Sora vs. ChatGPT

One common misconception about ChatGPT was that it possessed an understanding of the world, often referred to as a "world model." However, as noted by prominent researchers, this was not the case. ChatGPT was solely trained on text data, relying on next-word prediction. It's akin to a child being born without sight and touch, learning solely through listening. This approach doesn't align with how humans naturally learn.

How humans privitively learn:

Sight + Time = Video

Previous discussions about achieving Artificial General Intelligence (AGI) were hindered by this limitation. A fundamental understanding of the world, including basic physics and visual context, is crucial for AGI development.

Sora changes this paradigm. By generating videos that closely mimic reality and adhere to the laws of physics, the model demonstrates a form of fundamental understanding of the world and its physical principles. This represents a significant technological milestone.

Some may question the connection between video generation and understanding physics. However, to produce realistic videos, such as waves crashing on a cliff or a ball bouncing on the ground, Sora must implicitly learn underlying physical rules.

The Primary Use Case: Video Understanding

While the text-to-video application highlights Sora's capabilities, its most powerful use case lies in video understanding, although this aspect has not yet been fully explored. This parallels the evolution of large language models initially trained on next-word prediction, which later found more diverse applications through fine-tuning with instructions.

If the model can generate video that means it has an understanding of video and the most useful applications start here.

Consider the potential impact: replacing complex robotic systems for scene understanding, which rely on specialized sensors and algorithms, with Sora. This technology could enable robotic actuators to operate purely based on video understanding, inferring both the surroundings and the robot's position.

This breakthrough opens the door to new forms of reasoning:

  • Understanding movement
  • Grasping temporal relationships
  • Analyzing object interactions

New and exciting use-cases of this technology are soon to arrive. Thank you for reading.