Google DeepMind Genie 3

Basic Information

AttributeValue
OrganizationGoogle DeepMind
TypeResearch (not commercially available)
Launch DateAugust 2025
Current VersionGenie 3
URLdeepmind.google/discover/blog/
AvailabilityResearch only
SRL-4 Pilot

Capability Profile

CapabilitySupportNotes
Single Image to WorldFullPrimary input
Real-time InteractivityFull<100ms latency
Promptable World EventsFullText commands mid-session
3D ExportNoneVideo stream only
Action SpacePartialNavigation only; no object manipulation
Session LengthPartial3-5 minute persistence

Standards Compliance

StandardComplianceNotes
glTF 2.0NoneNo 3D export
OpenUSDNoneNo 3D export
Any 3D formatNoneOutputs video frames only

Key Insight: Genie 3 represents the video-based paradigm that cannot be addressed by traditional 3D standards. Its output is interactive pixel streams, not geometry.


Maturity Assessment

SRL Score: SRL-4 (Pilot)

Evidence:

  • Research demonstration with published papers
  • Not commercially available
  • Designed for AI agent training (SIMA), not consumer products

Interoperability

AspectDetails
Export FormatsInteractive video stream only
Import FormatsSingle image, text prompts
API AvailabilityNo (research only)
Known IntegrationsDeepMind SIMA agent

Strengths

  1. Real-time interactivity with emergent physics - Worlds respond to user actions with plausible physics simulation

  2. Infinite variation - Memory footprint independent of world size; can generate arbitrarily large environments

  3. Promptable mid-session events - Dynamic world modification through text commands during play


Limitations

  1. No exportable 3D geometry - Outputs locked to video stream; cannot be used in game engines

  2. Navigation-only action space - No picking, crafting, throwing, or object manipulation

  3. Non-persistent worlds - Worlds are reimagined each session; no save/load

  4. Short memory horizon - ~1 minute before “forgetting” off-screen elements


Architecture Diagram

The system uses next-token prediction to generate video frames:

  1. Input image processed by vision encoder
  2. Current state encoded as tokens
  3. Next frame predicted based on action input
  4. Frame rendered and displayed
  5. Loop continues in real-time

MSF Relevance

Genie 3 represents the video-based paradigm that cannot be addressed by traditional 3D standards.

Implications for MSF:

  • Any MSF work must acknowledge this paradigm exists
  • Focus interoperability efforts on geometry-based systems
  • Monitor video-to-3D reconstruction research as potential bridge
  • Do not attempt to standardize video-based outputs directly

Research Value

While not directly standardizable, Genie 3’s architecture informs understanding of:

  • Real-time world generation requirements
  • Agent training environment needs
  • The fundamental tradeoff between interactivity and exportability