How a Moonshot Led to Google DeepMind's Veo 3

🔭 Intro: From a Moonshot to a Moment — Why I’m Writing This
🧭 The Moonshot That Started It All
🤖 Video Prediction, Robotics, and the Original Hypothesis
🧪 Early Progress and an Ongoing Evaluation Puzzle
🔬 Physics, Benchmarks, and the Limits of Rigidity
🗓️ The Timeline: From Veo 1 to Veo 3
🔊 Why Native Audio Made Veo 3 Pop
🎭 The Viral Moment: Why People Shared Yetis and Alien Interviews
🧩 User Trends and the Roadmap: Control, Iteration, and the Desire for More
🖼️ Image-to-Video vs Text-to-Video: The Hidden Complexity
✍️ New Prompting Methods: Drawing, JSON Prompts, and Accidental Magic
⌛ Coherence and the Long Generation Problem
🌐 Genie 3, World Models, and the Pixel vs Concept Debate
🛠️ Steerability: How Much Control Should Users Get?
🔁 Capability Transfer: How Video Understanding Helps Generation
🧱 Why Image Data Helps Video Models
🧭 Unresolved Research Questions That Keep Me Up
🔎 How We Listen to Users: Thumbs, Tweets, and Internal Feedback Loops
🔮 What I Think’s Next for Veo and Generative Video
🧾 Final Thoughts: A Moonshot That Pays Back
❓ Frequently Asked Questions
📣 Closing: Where to Watch Next

🔭 Intro: From a Moonshot to a Moment — Why I’m Writing This

I’m Logan Kilpatrick, and I hosted the conversation that dug into the origin story and technology behind Veo 3 with Veo’s co-lead, Dumitru (Dumi) Erhan. In this piece I’ll walk you through the whole arc — how a small exploratory program inside Google Brain in 2018 grew into the Veo family of models, the design choices we made, the technical problems we hit (and still haven’t fully solved), and why the native audio in Veo 3 created the viral moment it did.

This is a news-style deep-dive written from my point of view. I want to make three things clear up front: first, this is not a dry product sheet — it’s the story of how research becomes something people actually enjoy and use. Second, I’ll be blunt about the trade-offs, the uncertainties and the unresolved research questions we still face. Third, I’ll highlight concrete design decisions that shaped the models — because they’re often the kind of choices that determine whether something works at scale or remains a lab curiosity.

🧭 The Moonshot That Started It All

Back in 2018, a group of us inside Google Brain started exploring video generation. It began as curiosity more than a clear roadmap to any product. The team was playful and experimental — the sort of place that welcomes risky ideas. We called the effort Brain Video and framed it as a moonshot: push boundaries, try things that felt hard and see what sticks.

At that time, no one really expected generative video to have a transformative moment within a few years. The excitement around transformers and the advances in language and image models made us wonder: could we do something similar for video? We had access to data and compute, and senior researchers including Jeff Dean were willing to back us. So we started to explore — and we meandered.

We tried many approaches rather than settling on a narrow path early.
We experimented with framed problems like video prediction (what happens next?) and applications in robotics.
We learned that some approaches would only show their value when scaled to the compute regimes we later had.

To quote the project’s own early intuition: “we formed what was then called a moonshot program.” That kind of institutional backing is what lets teams take long shots and iterate until a principled, scalable method emerges.

🤖 Video Prediction, Robotics, and the Original Hypothesis

One of the original mental models we had was to treat generative video as a prediction problem. That is, if you give the model some frames of context, can it predict what happens next? It has practical research motivations and a plausible application path to robotics. A robot that can visualize the consequences of its actions — to simulate a short future — could be safer and more capable.

Prediction as a setup is also a better-posed learning problem than “text-to-video” from scratch. If you have several frames of context, predicting the next frame(s) constrains the problem — geometry, object identity, lighting, and motion all give the model a structure to latch onto. In contrast, generating a video entirely from a short text prompt asks the model to invent all of that structure out of thin air: what cat, which camera angle, which lighting?

So while early work leaned toward future-frame prediction and robotics, the research team kept testing other directions. Over time, we realized that different components of the problem would benefit from scale in different ways. That meant some early ideas that looked good at small scale might explode in quality when cranked up to hundreds or thousands of TPUs — but we only discovered which ideas those were by trying them.

🧪 Early Progress and an Ongoing Evaluation Puzzle

The most striking change between 2018 and now is quality. If you look back at the videos we generated as an early team, they’re in a different galaxy compared to what Veo 3 can do. But quieter, fundamental problems didn’t just vanish: how to evaluate video generation is still messy, and there’s no universally accepted metric that maps to human preference reliably.

We use automated metrics to detect garbage models and to check for catastrophic failures. Those metrics are great as filters — they tell us if a model simply doesn't work. But they are weak guides for incremental improvements in real-world quality. A model that “cheats” a particular metric by being prettier or cranked-up contrast can score better in those tests but look worse to humans.

Human preference evaluations are essential, but they’re costly and noisy. I’ll be honest: people’s subjective biases show up everywhere. Simple tactics like increasing saturation or contrast can fool human evaluators into preferring a model’s outputs even if those changes don’t represent a genuinely better understanding of content or motion. That’s a problem we continue to wrestle with.

🔬 Physics, Benchmarks, and the Limits of Rigidity

One tempting evaluation idea is to measure physical plausibility. For example, can the model simulate accurate object dynamics, collisions, or conservation of momentum? That’s clearly useful for robotics and for scientific simulation.

But in practice, the majority of use-cases for generative video aren’t trying to mimic reality. A lot of the outputs people enjoy are fantastical: ASMR-style videos of objects being cut under impossibly dramatic lighting, alien interviews on a street corner, yeti documentaries, or tiny worlds that break physical laws. Those outputs are enjoyable precisely because they’re not realistic.

So a physics-based benchmark will help in certain domains — particularly robotic learning — but it won’t measure whether a model is delighting its users. We found that a model optimized only for physical plausibility can miss the kinds of creativity and humor that make generative video viral.

🗓️ The Timeline: From Veo 1 to Veo 3

Let me walk you through the product timeline I’ve been close to:

2018: Brain Video moonshot begins — lots of exploratory research and multiple approaches tested.
2024, May (I/O): Veo 1 launched as a research flagship — high quality but not yet optimized for broad user access.
December 2024: Veo 2 released — reached a level of quality and inference efficiency that let us put it into many users’ hands.
2025, May (I/O): Veo 3 introduced — the big differentiator: native, interleaved audio generation with much-improved lip sync.

It’s worth stressing that Veo 1 was a proof point. Veo 2 was the moment we made the model accessible and practical. And Veo 3 amplified a capability that we had been incubating for some time — joint audio-video generation. That final piece turned out to be a game changer for user perception.

🔊 Why Native Audio Made Veo 3 Pop

We spent a long time thinking about audio. There were several lines of research and prior work, like the VideoPoet paper that explored mappings between audio and video modalities. But for Veo 2 the audio quality just didn’t meet the bar. When we launched Veo 2 in December 2024, adding audio would have meant shipping a subpar experience.

So we waited, invested more engineering and modeling effort, and built audio into the joint generation pipeline for Veo 3. The payoff was the lip-sync and interleaved audio-video behavior that made people stop and say: “How did we do without this?” The interleaving — generating audio and frames together so speech aligns with the mouth motion — made content feel real in a way silent or post-dubbed video never does.

“Once we ship them, they'll be like, how did we live without this until now?”

That line from the team captures the feeling well. It’s one thing to generate a cool-looking clip. It’s another to produce a short scene that speaks — literally — to the audience and matches mouth movement with the spoken audio. That sync was the viral spark for many early Veo 3 demos.

🎭 The Viral Moment: Why People Shared Yetis and Alien Interviews

We tested many things internally and thought rap video-style clips would be the breakout. We were wrong. People latched onto imaginative, humorous, and slightly absurd content. The yeti interviews and fantastical ASMR clips exploded because they combined novelty, humor and believable audio-visual sync.

What this told us is important: the use-cases that go viral aren’t necessarily the most technically impressive. They’re the ones that people find emotionally engaging and easy to iterate on. A duck that looks photorealistic but has no voice rarely travels as far as a silly character that seems to speak to you from the screen.

🧩 User Trends and the Roadmap: Control, Iteration, and the Desire for More

One clear user pattern is this: people like to be in control. They want to iterate. They want to tweak a specific detail — change a line of dialogue, move an object, alter lighting — without regenerating the whole scene and hoping for a happy accident.

That desire drives design and product trade-offs. You can imagine two approaches:

Expose many sliders and parameters for advanced control (a product-side decision).
Make the model more naturally steerable through richer prompt interfaces, reference images and iterative editing flows (a modeling-and-product hybrid).

We’re trying both, but my team and I prefer solutions that are intuitive and low-friction. Drawing on an image to indicate motion, or giving a “reference to video” where the user wants to see themselves in a different scene, feels natural. It avoids a hundred dropdowns while giving real control.

🖼️ Image-to-Video vs Text-to-Video: The Hidden Complexity

Superficially people say “a video is just a stack of images.” That description hides a lot. While adding image data to video training helps a great deal, naive image-to-video isn't always what users mean. When someone provides an image and asks “animate this — make me swim with a mermaid,” they don’t expect an animation that simply extends the original frame. They expect transformation.

That introduces a learning mismatch. For an image-to-video workflow to do what users want, the model must understand the gap between the starting frame and the desired result and then produce plausible transitions, changes in pose, new objects, different backgrounds, and more. Often, users aren’t asking to animate the literal picture; they want a “reference to video” — keep the identity or style, but place it in a new context.

In practice this challenge is reflected by the way we annotate training data and teach models to represent objects, people and scenes. It's not just a matter of copying pixels forward — it’s about inventing plausible trajectories, preserving identity, and respecting constraints like camera motion and occlusions.

✍️ New Prompting Methods: Drawing, JSON Prompts, and Accidental Magic

Another fascinating trend has been the emergence of new prompting techniques. People discovered that drawing on an image — sketching where an object should move, or indicating a camera path — is a powerful control mechanism. It’s intuitive because it maps directly to the user’s mental model: “put that there, move this over here.”

We also noticed a curious behavior: some users have had success with structured prompts like JSON. That wasn’t intentional — the models weren’t trained to parse JSON. If it works for certain prompts, that’s an accidental synergy between how the model learned language patterns and how users structure requests. We’re considering guidance for users because when these unexpected techniques work, they unlock interesting workflows.

To be clear: the drawing control we experimented with in our internal tests came from someone inside the team discovering that it works well. It’s a small example of how product intuition and creative discovery by users or engineers can influence the model roadmap.

⌛ Coherence and the Long Generation Problem

Length is a complicated trade-off. Veo 3 focuses on short, dense, high-quality clips (eight seconds by default in many demos), but people ask for longer sequences. The challenge is not only compute — longer durations increase inference cost and training complexity — but also coherence.

When you ask a model to produce an hour of content from a single short prompt, it typically degenerates. Without a storyline, characters and long-term constraints, generation collapses into filler: repeated motifs, drifting visuals, inconsistent identities. A great example we discussed internally is producing a looped mountain biking scene: the last frame should align with the first if it’s a loop. That kind of global constraint is hard for models to enforce when context windows are limited.

We have a few approaches on the table:

Auto-regressive extension: stitch shorter clips together and ensure overlap consistency.
Hierarchical generation: plan coarse structure (plot, camera moves, characters), then fill in frames.
Use non-pixel representations (concepts or latent states) for long-range planning, then render pixels locally.

Each approach has trade-offs. Stitching can introduce seam artifacts. Hierarchical methods need reliable planning modules. Conceptual representations are promising — they help reduce redundancy — but they lack ground truth: we observe images and videos, not the latent concept sequences we invent.

🌐 Genie 3, World Models, and the Pixel vs Concept Debate

Genie 3 is a contemporary example of a model optimized to generate long-form, consistent visual worlds in real-time. It raises a deeper architectural question: when we build world models for agents (real or virtual), should those models operate in pixels or in higher-level concepts?

Pixels are concrete: you can render them, measure differences, and train a model to make them look right. Pixels are also costly and include redundancies. Concepts (representations of objects, their relationships, intentions, or coordinates) are compact, potentially much more efficient, and easier to use for planning.

The problem is that concept spaces lack objective ground truth. There’s no single canonical representation of “what the world means.” That makes it harder to train, evaluate, and debug. If you want to train a robot by letting it imagine outcomes in a photorealistic virtual world, you want accurate physics and realistic visuals. But if the goal is higher-level planning, rich concepts might be more efficient.

So the debate continues: pixels for fidelity and direct rendering; concepts for efficiency and planning. It’s an unresolved research direction, and my strong sense is that both will play a role: pixels when photorealistic fidelity or physics matter, concepts when long-term planning and sample efficiency matter.

🛠️ Steerability: How Much Control Should Users Get?

Steerability sits at the intersection of product and modeling. Users want control but don’t want complexity. They want to iterate fast and cheaply. We experimented with a few approaches to improve this workflow:

Low-fidelity drafts: allow quick “rough” renders that cost far less than full-quality outputs, then let users select and refine.
Selective re-rendering: change part of a scene (a speaker’s line or an object) without regenerating everything.
Image-based local edits: allow drawing and bounding-box edits to move objects or adjust camera framing.

There are financial and technical limits. Creating preview drafts is easy in principle, but it can be misleading if short previews don’t capture the dynamics users expect. Selective re-rendering is appealing, but implementing it so edits appear natural and consistent is non-trivial.

We’re prioritizing approaches that add control without a heavy interface burden. Drawing-based controls, reference images and natural language edits (coupled with small local regeneration) feel promising because they match how people think about the creative task.

🔁 Capability Transfer: How Video Understanding Helps Generation

One practical success story: Gemini’s video understanding capabilities help Veo in a concrete way. We use Gemini to annotate video data at scale — captions, scene semantics, object relationships and action descriptions — which we then use to train Veo as the inverse mapping: caption-to-pixels.

That auto-labeling loop is powerful. Humans can annotate richly, but it’s expensive. Gemini provides verbose, precise captions at scale. Those captions become the textual conditioning data for Veo. The quality of those textual descriptions matters a lot: more detailed annotations lead to steeper learning of fine-grained concepts.

That’s one reason we mix image and video training data. Images bring diversity of concepts — specific shoes, car makes, logos — that videos alone may not contain in sufficient density. Images teach the model a wide set of static concepts, and videos teach dynamics. Together they make the model stronger.

🧱 Why Image Data Helps Video Models

It’s tempting to think images are redundant when you have video. But images are surprisingly valuable because they expose many more unique concepts. A random video slice is likely to contain mundane views. Images can be curated and cover niche items in high detail. The model benefits by learning more object varieties, textures and compositions from images, improving the diversity and specificity of generated visuals.

So the training mix matters. We’ve published research showing that combining image datasets with video datasets yields better generation quality than training on video alone. The intuition is simple: images bring concept density and diversity; videos bring temporal coherence and motion priors.

🧭 Unresolved Research Questions That Keep Me Up

There are several unresolved issues we wrestle with every week. I’ll list the ones that I think are most consequential:

Evaluation metrics: Can we design automated measures that align with human judgment without being gamed by superficial tricks?
Long-range coherence: How do we generate long-form content with consistent character identities, plot and camera behavior without exploding compute requirements?
Representational choice: When does it make sense to operate in pixels versus latent concepts? How do we create objective ground truth for representations?
Steerability vs simplicity: How do we provide powerful editing tools without burying users in options?
Efficient iteration: How do we give users cheap, informative previews that predict what full renders will look like?

These are not small problems. They will shape the next several generations of models and products. And honestly, I love that they’re open; that means there’s room for creative approaches and unexpected breakthroughs.

🔎 How We Listen to Users: Thumbs, Tweets, and Internal Feedback Loops

We read user feedback carefully. If you’re using the Gemini app or a Veo product you can tap thumbs up or thumbs down; we do a pass through that feedback religiously. At the same time, public reactions — memes, viral clips and Twitter threads — are invaluable. They tell us what people find delightful, and where expectations have shifted.

One fun observation: once people saw Veo 3 outputs with audio, their expectations for other models changed. Suddenly silent video feels incomplete. That kind of market effect reshapes what “good” means for everyone building in this space.

🔮 What I Think’s Next for Veo and Generative Video

At a high level, I expect the future to look like three parallel tracks:

Incremental quality improvements: better rendering, improved lip-sync, more precise object identity.
Product-level workflows: faster iteration loops, localized edits, reference-driven editing and simple drawing tools for control.
Research breakthroughs: efficient long-range context handling, hybrid pixel-concept world models, and better automated evaluation.

We’re already working on many of these. For example, the idea of using Gemini to generate training annotations is one example of cross-model synergy that speeds research. Another is experimenting with hierarchical generation approaches that separate planning and rendering.

My favorite kind of feature is the one users didn’t know they wanted. When we ship features that make creation easier in obvious and delightful ways, they tend to stick — users begin to rely on them and expect them. Audio in Veo 3 is a great example; we thought it was important, but the viral reaction exceeded our internal predictions.

🧾 Final Thoughts: A Moonshot That Pays Back

The journey from a moonshot idea inside Google Brain to a widely used generative model isn’t linear. It’s full of iterations, bad ideas that teach great lessons, and lucky discoveries. The core engineering breakthroughs — the right inductive biases, pragmatic uses of compute, and good training data mixtures — matter. But product design and user intuition matter as much. If a capability doesn’t feel useful or intuitive, users won’t adopt it regardless of the benchmark scores.

At the end of the day, I’m excited. Veo 3 shows what happens when you pair long-term research patience with crisp product thinking: you get features that surprise and delight. The future will bring harder problems — longer coherence, richer interactivity, better evaluation — and we’re already rolling up our sleeves to tackle them.

❓ Frequently Asked Questions

How did Veo start and where did the idea originate?

Veo began as an exploratory moonshot inside Google Brain around 2018. The team wanted to push the boundaries of generative video, starting with video-prediction ideas and testing many architectures. Over several years, as compute and techniques scaled, we evolved the project into the Veo series of models that culminated in Veo 3.

Why did you focus on video prediction early on?

Video prediction is a well-posed learning problem: given context frames, predict what happens next. It’s grounded in observable dynamics and was initially attractive for robotics applications where simulating plausible short-term futures is valuable. Prediction gives strong structural constraints compared to generating video from a short text prompt.

When were the Veo models released?

Veo 1 was introduced as a flagship research model around Google I/O 2024. Veo 2 followed in December 2024 with a focus on making the model accessible to users through more efficient inference. Veo 3, which added native joint audio-video generation and significant improvements in lip-sync, was showcased at I/O the following year.

Why is native audio important for generative video?

Native audio interleaved with frames solves the lip-sync problem and dramatically increases perceived realism and engagement. People find videos with matched audio far more compelling; once users experienced Veo 3’s audio, expectations for generative video shifted broadly.

How do you evaluate generative video models?

We use a combination of automated metrics to filter out clearly failing models and human preference studies for nuanced quality measurements. Automated metrics are useful for catching catastrophic problems but aren’t sufficient for optimization. Human evaluations are costly and subjective, and we’re working to design better, less gameable automated proxies.

Can physics-based benchmarks evaluate video models?

Physics benchmarks are valuable for domains where realism and physical plausibility matter (e.g., robotics). But many generative video use-cases are intentionally fantastical. Therefore, physics-based evaluation is useful for specific purposes but not a general proxy for creative or entertaining video quality.

What’s the trade-off between generating short vs. long videos?

Longer videos cost more to train and run, and maintaining coherence over long durations is hard because context windows and long-range dependencies become an issue. Short, high-quality clips are often more useful and cheaper. For long videos, hierarchical or concept-based planning approaches are promising but still under active research.

How does image data help video generation?

Images provide a wider variety of static concepts and specific object instances that aren’t present in sufficient diversity in video datasets. Mixing image data into training improves concept coverage while videos teach motion and temporal consistency. The combination yields better, more diverse generation.

What is the difference between image-to-video and reference-to-video?

Image-to-video often implies animating the literal input frame. Reference-to-video means preserving identity or style but placing that reference into a new context or scene. Users generally expect the latter (transformative reuse of a reference) rather than literal temporal extension of the source image.

How do you plan to improve steerability and user iteration?

We’re exploring several approaches: cheap rough drafts for previews, selective re-rendering to edit portions of a scene, drawing-based controls to indicate motion, and reference-based edits. The aim is to offer intuitive controls that avoid overwhelming users with countless parameters while enabling meaningful iteration.

How do you use Gemini in Veo’s pipeline?

We use Gemini’s video understanding to auto-annotate large video datasets. Those captions and semantic descriptions function as conditioning data for training Veo (the inverse mapping: text-to-video). Gemini helps provide richer, more detailed annotations at scale, which improves training efficiency and output quality.

Will Veo models ever be good enough to generate an entire movie from a single prompt?

Not realistically from a single short prompt. Generating a compelling long-form movie requires storyline, plot consistency, character development, and long-range constraints. Current models are better suited to short, high-quality scenes. Future approaches combining planning, hierarchical generation, and concept representations may move the needle, but that remains a challenging research problem.

How should people provide feedback or report issues?

If you’re using Veo inside Google products, use the built-in feedback controls (thumbs up/down). Public reaction, social posts and developer feedback channels are also very useful. We regularly review feedback and it directly informs product and research priorities.

📣 Closing: Where to Watch Next

It’s been a privilege to take a moonshot through research and product iterations and see the result in people’s hands. Veo 3’s joint audio-video capabilities are one milestone among many. The deeper innovations — better evaluation, long-term coherence, efficient editing workflows — are the next frontiers.

If you want to follow along, keep experimenting, share the outputs that delight or disappoint you, and use the feedback mechanisms in the apps you try. Those reactions shape what the technology becomes.

Thanks for reading. I’m excited to see what you create with Veo and to keep pushing these problems forward with the team.