Why It Accidentally Got Called Nano Banana 🍌 | The Rise of Gemini 2.5 Flash Image

Featured

📣 Executive summary

I am a product manager on the Gemini app team, and I led product work on what the world now knows affectionately as NanoBanana. Officially the model is Gemini 2.5 Flash Image, but a late-night placeholder name stuck and helped launch one of the fastest viral creative tools I have ever seen. In a matter of weeks people generated billions of images, new trends spread across countries, and entire emotional rituals formed around the ability to edit and recreate people and memories with a level of realism and consistency we had not achieved before.

This report-style article explains how we arrived at NanoBanana, the technical breakthroughs that made it possible, the role of internal teams who pushed the model's creative boundaries, how the model first went viral in an anonymous evaluation, the surprising origin of the NanoBanana name, the safeguards we built to address misuse, and where we are going next for image and video generation. I will tell this story in first person, mixing an inside look at product decisions with examples from real users and the creative communities that adopted the tool.

🛣️ My path to the Gemini app and why this felt different

I joined Google intending to stay for one year. That was more than a decade ago and, as sometimes happens, a one-year plan became a career. I started at YouTube working on user experience innovation and later moved into generative AI work at YouTube, where I built chatbots and explored how AI could enhance viewer experiences.

As generative models evolved, I felt the pull to go deeper. I wanted to be part of the teams shaping how people create and interact with media using AI. When the opportunity came to join the Gemini app team and focus on image and video creation and editing I applied and joined the group. From the start I believed we were at an inflection point: advances in generative models were moving from novelty to capability, and the ability to create realistic images and videos would change how people express themselves, remember their past, and entertain each other.

👀 The first "wow" moment: seeing myself in an image

The very first time I tried this new model I uploaded a photo of myself and asked the system to put me in space. Instead of producing a blurry or slightly off capture of my face—what I call the AI distant cousin—what came back looked unmistakably like me. It was the first time I felt the model had crossed from "looks like an AI image" to "looks like a real person I know." That moment made me realize we were on to something big, even if I did not appreciate how big it would get.

The problem we were solving is deceptively difficult. Humans are extremely sensitive to tiny deviations in faces. We notice when a nose is just a bit off, when a wrinkle is misplaced, or when a smile doesn't match the eyes. Those tiny errors break the illusion. Achieving facial and character consistency at the level where people recognize themselves and loved ones across different contexts required a new level of precision in generation and editing.

🔬 Technical shifts that unlocked character consistency

There are several layers to the technical progress we had to make in order to produce lifelike and consistent images:

  • Reference fidelity. The model needed to understand and preserve critical facial geometry, skin tone subtleties, hairline, and other identifiers from a source image. This required training and inference strategies that prioritize identity-preserving transformations.
  • Contextual rendering. Putting a person "in space" is not just pasting a head onto a new background. Lighting, perspective, and facial shading must match the context. We improved the model's ability to reason about scene lighting and apply consistent lighting changes to the subject.
  • Small feature precision. Tiny features like teeth alignment, the creases around the eyes, and expression-specific micro-geometry needed to be stable across edits. We invested in loss functions and datasets that penalized identity drift.
  • Prompting and templates. The Gemini app abstracts complex parameterization into user-friendly templates. That means we provide design affordances that help everyday users create consistent images without mastering technical prompts.

These improvements made it possible for people to look at a generated image and immediately identify themselves, family members, or pets rather than feeling they are looking at an AI approximation. That perceptual leap explains much of NanoBanana's early momentum.

🧠 The "Greenfield" team and model whisperers who unlocked creative use cases

Within the team we have a small group of expert prompters and researchers I often refer to as the Greenfield team or "model whisperers." When a new model arrives, we turn it over to them and give them permission to experiment wildly. Their job is to push the model to its limits, discover new use cases, and find the prompts and combinations that produce surprising outputs.

Some of the outputs they found were whimsical. They combined items and concepts in ways that follow no real-world logic but that make for entertaining and viral content. Examples include:

  • A couch made out of a potato: "couch potato"
  • A toilet made from seashells
  • A face created from fruit loops cereal

These playful compositions showcased the model's ability to understand materials, texture blending, and shape transfer across objects. They also illustrated a broader point: an image model that both preserves identity and flexibly composes novel objects is a rich canvas for creativity.

🏁 LM Arena: anonymous testing and the masked moment

Before we had a mainstream launch, we submitted the model anonymously to LM Arena. For those unfamiliar, LM Arena is a public benchmark where users compare outputs from different models and vote on which image they prefer. It is often used by the research community and technically advanced users to evaluate model performance.

When we uploaded the model as "NanoBanana"—a placeholder name chosen late at night by a PM on our team—the model quickly rose to the top of LM Arena's rankings. That gave us empirical evidence that the model was producing outstanding results. But an even more interesting signal arrived from outside LM Arena: the model started trending on social platforms. People on X began to discuss NanoBanana and share outputs. Our internal metric for traffic, queries per second or QPS, spiked far beyond what we expected. The rise happened rapidly; the tool went viral around the world.

The LM Arena experiment was crucial for two reasons. First, it gave us a neutral, comparative metric that validated our improvements against other models. Second, the anonymity allowed the model to be judged on merit rather than brand association. That viral, anonymous validation created a moment where the technology spoke for itself.

📈 Viral growth and scale: billions of images and global trends

Following the anonymous phase and the official launch, usage exploded. In a short period people created well over 5 billion images with NanoBanana in the Gemini app. The volume came from a range of behaviors: individual users playing and experimenting, creators building trends, and localized viral phenomena in places like Thailand, the Philippines, Indonesia, Mexico, and Brazil.

Traffic metrics like QPS gave us a real-time view of the phenomenon. We saw steady adoption at first, then sudden surges when social influencers and celebrities began to share their NanoBanana outputs. Notably, a figurine trend started in Thailand where users crafted miniature statue-like images of themselves. This trend reached celebrities such as Kim Kardashian and Gordon Ramsay, which amplified it globally.

Beyond scale, what I found most meaningful were the sincere, emotional uses people found for the tool. The ability to restore old photographs, create Polaroid-style keepsakes that included people no longer with us, and compose images to honor family members at celebrations moved us. One family in Mexico used the tool to include a deceased grandmother in a baby shower Polaroid. Another user in Brazil included a deceased father in a graduation image. Those stories underscored how this technology is not just about novelty; it's a medium to remember and honor people and moments.

🎨 The cultural dynamics: how prompts and local creativity shaped trends

Different countries and communities developed their own styles and prompting culture. The "90 word prompt" phenomenon in Thailand is a great example. Users there developed elaborate prompts that included many stylistic modifiers and details, producing a distinctive aesthetic that spread to neighboring countries and eventually to Latin America.

Other trends were simpler and more universally accessible, such as Polaroid looks and restored family photos. The variety of trends demonstrates two important things: first, that people of all backgrounds will appropriate a technology based on what resonates culturally; second, that easy-to-use templates help accelerate adoption across demographics that might not write long prompts.

🍌 The accidental name: how NanoBanana stuck

We did not mean for "NanoBanana" to be the public name. The model's official title is Gemini 2.5 Flash Image. But when you submit a model anonymously to LM Arena, you have to give it a placeholder name. A PM on our team named Naina, working late at 2:30 or 3:30 a.m., came up with NanoBanana as a whimsical placeholder. We had no expectation the model would go viral under that name—but it did.

Once NanoBanana began trending, people continued to use that name even after our official launch. It was catchy and had personality, so we decided to embrace it. Today the Gemini app uses banana emoji and other playful design signals to indicate where that model is available. The accidental name turned into a branding moment that helped the model feel approachable and human.

⚖️ Being bold and responsible: how we framed our approach

When you build a powerful image generation tool, ethical considerations are unavoidable. We framed our approach around two priorities: be bold and be responsible.

Being bold means honoring user requests as much as possible. Over time you will notice fewer blanket "I can't do that" responses in the app because we want to provide users with the creative freedom to express themselves. That said, there are well-understood limits to what a responsible platform should allow, and we maintain guardrails to prevent harmful misuse.

Being responsible means tooling and policy that help people understand what they are seeing and allow downstream verification. For images from the Gemini app we deployed a multi-layer provenance approach:

  • Visible watermark. Every image generated by the Gemini app carries a visible watermark on the bottom right. The watermark gives viewers an immediate clue that the image was AI generated.
  • Invisible watermark (SynthID). The visible watermark is easy to crop or edit out, so we paired it with SynthID—an invisible, robust watermark embedded into pixels in a way that survives cropping, recompression, and many edits. It is detectable by systems that look for it even if the image has been altered.
  • Detection and provenance services. We have internal capabilities to detect whether an image was generated by our models at the scale required for high-profile cases. We are working with trusted researchers and partners to make detection capabilities more broadly available so third parties can verify provenance without relying solely on Google to respond every time.

Provenance is subtle. Answering the question "Is this real?" requires more nuance than a simple yes or no. Changing color grading on a photograph can make it technically AI-generated if it was created or altered through an AI pipeline, but the underlying event—and the message—may still be authentic. SynthID and our provenance tools are designed to provide that nuance so people can distinguish between an entirely synthetic scene and a photo that has been enhanced.

🔍 How SynthID works and why it matters

SynthID is our invisible watermarking approach. Without getting lost in technical minutiae, the key design goals are these:

  • Persistence. The watermark should persist through common editing workflows: resizing, cropping, recompression, and some color grading.
  • Non-destructive. It must not meaningfully degrade image quality or visual fidelity.
  • Detectable. It must be detectable at scale by trusted tools and partners so provenance can be established quickly and reliably.

We currently provide detection capabilities to trusted testers—academics, researchers, and vetted partners—so we can verify performance and gather feedback on how to make provenance information accessible to more people. Our ultimate aim is to democratize detection so anyone who encounters an image can query provenance data without having to route every request through our team.

🎥 Beyond still images: the progress of video generation

While much of the public attention has focused on static images, video generation is an equally exciting frontier. We recently shipped an update in the VO3 family: VO3 3.1. This is an incremental model improvement that raises quality across the board for photo-to-video scenarios and improves reference-to-video generation.

Photo-to-video allows you to upload a still image and convert it into the opening frame of a generated video. This creates continuity between a well-composed photograph and a motion sequence, which is powerful for storytelling and social content. Reference-to-video allows you to take a likeness or object from a photo and place it into a new video context, preserving identity and object characteristics while animating them in plausible ways.

We are also pleased to have expanded availability of photo-to-video to the European Union and the United Kingdom for the first time. That involved not just technical work but also careful consideration of localization, regulatory expectations, and alignment with our safety and provenance frameworks.

🧭 How we iterate: feedback, telemetry, and community signals

Product development for generative models is highly iterative. After a launch we do a few specific things:

  1. Observe telemetry. We watch what templates users choose, where they click thumbs up or thumbs down, what prompts fail, and where quality issues arise.
  2. Aggregate feedback. We read social posts, forum threads, and direct reports to spot emergent trends or concerns.
  3. Prioritize improvements. Signals inform our roadmap: whether to tune identity consistency, expand template design, or improve edge-case safety filters.
  4. Deploy and repeat. We ship improvements as they become robust and monitor for downstream effects.

Your interaction with the app matters more than you might think. If you click the thumbs down on a generated result, that signal helps the model and the product team know what to prioritize. If you share a successful image with a public caption, that helps us see how people are using the tool in the world. We ask users to help guide the product through feedback because this is "day one" in terms of discovering how people will use this capability.

🧩 Real-world examples and surprising uses

I want to highlight a handful of specific use cases that stood out to me because they show the breadth of what people did with NanoBanana:

  • Figurine trend. Users created small statue-like images of themselves, often with stylized lighting and ceramics-like textures. This was an early viral trend that spread from Thailand to celebrity audiences internationally.
  • Polaroid memories. People used templates to create Polaroid-style images that included themselves and loved ones. For users whose relatives had passed away, these were powerful ways to include those family members in present-day milestones—weddings, graduations, and baby showers.
  • Photo restoration. Users took decades-old photos with stains, tears, and discoloration and restored them with modern color, clarity, and sensor characteristics one might expect from a 2025 camera. The emotional impact of seeing a recognizable scene restored can be profound.
  • Playful compositions. Beyond emotional uses, many people made humorous edits for family groups or social sharing: profession swaps, exaggerated fantasy scenes, and surreal object combinations like couch potatoes.
  • Cross-modal creation. Creators used a NanoBanana-generated image as a reference frame for video generation workflows, combining static outputs with dynamic sequences to make short social videos.

The diversity of use cases reminded me that technology rarely defines culture; people do. A model gives capabilities, and communities decide how to use them. Sometimes those uses are touching, sometimes silly, and sometimes culturally innovative.

📝 Prompting patterns and templates: what works best

Many users ask what makes a good prompt for Gemini and NanoBanana. The answer depends on your goal. For everyday users, templates are the easiest entry point: they encapsulate the complex prompting into a single tap. For creators who want full control, here are some principles that have emerged from both our Greenfield team and the community:

  • Start with identity preservation. If you want the model to keep a person recognizable, include a short phrase that emphasizes the likeness: "preserve facial features and proportions" or "maintain identity with realistic lighting."
  • Specify context and style separately. Define who the subject is and then describe the environment: "portrait of [the person] in a lunar landscape, cinematic lighting, ultra-detailed."
  • Use reference tokens. If you upload a photo, reference it explicitly in your prompt: "use uploaded photo as reference for identity and pose."
  • Be concise but specific. Long prompts can add nuance, but clarity is more important than length. If a community trend uses a 90 word prompt and it works, that is because it codified many stylistic modifiers. Templates can replicate those bundled modifiers for general users.
  • Iterate. Use the thumbs up and thumbs down to guide the model and employ small changes across iterations to refine outputs.

For beginners, my personal recommendation is to try a template—figurine or Polaroid—to see how the system preserves identity in different styles. Once you feel comfortable, experiment with style modifiers like "hyperreal," "cinematic lighting," "analog film grain," or "hand-painted porcelain" to achieve distinct looks.

🧰 Tips for first-time NanoBanana users

If you are reading this and want to try NanoBanana for the first time in the Gemini app, here is a practical step-by-step approach I suggest:

  1. Open the Gemini app and look for the templates on the zero state page. Pick an approachable template like figurine or Polaroid.
  2. Upload a clear, well-lit photo. The model performs better when the source photo has good lighting and your face is unobstructed.
  3. Choose the template style and review the suggested prompt. You can keep it as-is for your first run so you see the baseline output.
  4. Generate, then evaluate. Use the thumbs up or thumbs down to give feedback. If you want to refine the image, adjust the prompt with specific style cues or re-upload a different reference photo.
  5. If you want to create something more ambitious, try combining a NanoBanana image with the photo-to-video workflow to make a short animated storyline.

Remember, templates were designed to lower the barrier and produce pleasing results quickly. They are also a safe way to explore identity-preserving features without mastering deep prompting techniques.

🔐 Safety, privacy, and user control

Image generation raises legitimate privacy questions. We take these seriously in both product design and policy. A few important commitments we have made include:

  • Guardrails for people and minors. We restrict certain sensitive or exploitative prompts and have policies around generating images of minors or altering images in ways that could promote harm.
  • Transparency. Watermarks and invisible provenance markers are part of making generated images self-describing so viewers can quickly understand whether an image is synthetic.
  • Detection collaboration. We are working with researchers and trusted partners to refine detection and provide tools that scale.
  • Iterative policy. This is day one for many of these features. We will adapt our approach as we learn from real-world usage, stakeholder feedback, and advances in detection science.

Our aim is to enable creative expression while minimizing harm. That balance is dynamic; we listen to users and the research community as we update policy and tooling.

🌍 Global and cultural considerations

One of the lessons from the NanoBanana rollout was the importance of cultural sensitivity and localization. Different regions adopted different templates and prompting styles, and we watched how local creators repurposed features in unexpected ways. For instance, the figurine meme had a particularly organic origin in Thailand and then propagated outward. That pattern reminded us to design with cultural flexibility in mind: templates should be granular enough to allow local creative flavors while preserving safety and provenance controls.

When expanding features like photo-to-video to new geographies, we also consider regulatory environments. The EU and the UK were an early focus because we wanted to bring photo-to-video there only after aligning with local expectations for safety, privacy, and provenance. Those conversations are ongoing worldwide and are part of how we prioritize new releases.

📣 Community, creators, and the responsibility to listen

Creators played a central role in shaping what NanoBanana became. Influencers, hobbyists, and everyday users acted as early evangelists and provided invaluable feedback. We treated that feedback as raw material for product improvement. The thumbs up/thumbs down telemetry, direct reports, and social signals all shaped subsequent model updates and template refinements.

Because of the scale of adoption and diversity of use cases, we rely on a continuous dialogue with the community. That means publishing clear policies, enabling accessible feedback pathways, and working with academic and civil society partners to understand societal impacts. If you have ideas about how we should change direction, tell us. Product teams are listening, and the feedback loop matters.

🔭 What comes next: iteration and expansion

Where do we go from here? A few priorities guide our roadmap:

  • Improve identity fidelity and robustness. We will continue to refine the model so identity consistency is even stronger across more extreme edits and lighting conditions.
  • Expand video capabilities. Video is the next frontier. We will improve temporal coherence, higher frame quality, and better reference-to-video fidelity so that likenesses and objects remain consistent throughout a clip.
  • Broaden provenance tools. We aim to make detection and attribution more widely available so third parties and users can establish provenance without delay.
  • Enhance templates and creative workflows. We will add more templates and make it easier to chain image generation into other creative flows like video, sharing, and physical prints.
  • Improve accessibility and localization. More languages, cultural templates, and regionally tailored features are on the roadmap so the product serves diverse global communities.

Iteration will come from both telemetry and community input. Every thumbs down helps prioritize a bug fix or a model tuning. Every viral trend helps us understand what creative affordances people truly value.

🧾 Final reflections and practical advice

Working on this product and watching NanoBanana travel from a late-night placeholder name to a global phenomenon has been humbling. The technology unlocked both playful creativity and deeply personal uses that surprised me. Seeing a restored photograph bring tears to a grandmother's eyes or watching a user include a passed family member in a graduation image reminded me that these tools can be both powerful and profoundly human.

If you want to try NanoBanana, start with a template, upload a well-lit photo, and explore. Use the visible watermark and SynthID to understand the provenance of images you encounter, and share feedback so we can improve. If you are a researcher or an organization interested in detection tools, reach out to participate in trusted testing programs so we can expand access responsibly.

"This is still day one, and we're learning from all of the ecosystem ... Please let us know when you think we should take a different direction." — a guiding principle for how we will continue to develop generative tools responsibly and iteratively.

I am excited about where we will go next. The combination of identity-preserving image editing, robust provenance, and improved photo-to-video generation will open new creative possibilities. And yes, whenever the next product gets an off-the-cuff placeholder name, I will be both nervous and amused. Congratulations to Naina for the name that stuck and to all the users who have shown us new ways to use these tools.

🧭 How to get involved and share feedback

If you are using the Gemini app right now, please take a moment when an image feels particularly good or particularly off and click thumbs up or thumbs down. If you see something worrying or a use case you think we should address differently, send feedback through the app or engage with our research partners. The product evolves faster and more responsibly when we have clear signals from real usage.

Finally, remember that AI-generated content is a collaboration between the tool and the human who guides it. Use it to tell stories, restore memories, and play. And keep asking questions about responsibility and provenance—we will keep answering them as we build.

Key takeaways

  • NanoBanana is the public nickname for Gemini 2.5 Flash Image, an image generation and editing model that rapidly achieved identity-preserving results.
  • Technical progress in facial consistency, lighting fidelity, and small-feature precision made the model feel like it captured real people rather than AI approximations.
  • The Greenfield team discovered many creative and playful use cases that demonstrated the model's compositional abilities.
  • Anonymous testing on LM Arena validated the model's quality and contributed to an early viral moment.
  • Adoption scaled quickly to billions of images, with global cultural trends like figurine and Polaroid effects and emotionally significant restorations.
  • We implemented visible watermarks and SynthID invisible watermarking for provenance and are working with trusted partners to broaden detection.
  • Video generation improvements (VO3 3.1) advance photo-to-video and reference-to-video quality, and we expanded availability to additional regions.
  • User feedback and community signals remain central to our iteration strategy; your thumbs up or down matters.

Thank you for reading. If you try NanoBanana, please share what you made and what surprised you. I look forward to seeing where people take this technology next.


AI World Vision

AI and Technology News