Science & Tech FrontiersModel Architecture

Vision Banana: Generative Understanding Finally Comes to Vision

Published Jun 8, 2026

Snap a photo on the street, feed it into a model, and it tells you the tree in the picture is 13.71 meters away. Check Google Maps — actual distance: 12.87 meters. An error of 0.84 meters, under 7%.

The model behind this is called Vision Banana. It was never trained on distance. Never trained on geometry. Never trained on photogrammetry. It was trained on exactly one thing: drawing images.

Before getting into where this model came from, think about what you already do. You’ve probably used Gemini or GPT — snapped a photo and asked what’s in it. It can answer. Cat or dog. What the sign says. How many chairs in the office. All fine. But if what you want isn’t a one-line answer, but a per-pixel segmentation mask, or a depth map showing exactly how far each pixel is from the camera — existing multimodal models can’t deliver. They can see, but they can’t produce this kind of fine-grained structured output. In computer vision today, tasks like segmentation, depth estimation, and surface normal estimation each still depend on their own dedicated model architectures and training pipelines.

Vision Banana says: not necessary. It has one input, one output. The input is an image plus a text instruction, and the output is also an image. Every vision task gets reformulated as instruction-following image generation. One model, one set of weights, switching between tasks through prompts.

This story has already played out once in NLP. Before LLMs, NLP was a patchwork of specialized tasks. Sentiment analysis required training a sentiment classifier. Named entity recognition required training an NER model. Translation required training a translation system. Each task had its own dataset, its own output head, its own training pipeline. Then generative pretraining arrived. Predicting the next token — a task that looked like it could only do autocomplete — turned out, at scale, to subsume translation, summarization, question answering, reasoning, and code into a single model. Generation and understanding turned out to be two sides of the same coin.

But in vision, this story has been missing a native version. Sure, you can throw an image into Gemini or GPT-4V and ask what it is, and get an answer. But that is fundamentally piping visual signals into a language model; the understanding still comes from the LLM side, not from a truly vision-native system. Could a model that starts purely from image generation — like an LLM starts from next-token prediction — naturally evolve an understanding of the visual world through its generative capability alone? This hypothesis has been floating around for years. Nobody had tested it at scale.

Vision Banana is the first model to validate it.

Last month, Google DeepMind published a paper titled Image Generators are Generalist Vision Learners. Kaiming He and Saining Xie appear in the author list as leadership sponsors. He is the inventor of ResNet, first author of the most-cited paper of the 21st century. Xie is a co-author of the DiT framework, the generative architecture underlying Sora. But this is not a paper carried by names alone. Its core finding: a pure image generation model, with zero architectural changes, beats Meta’s heavily-invested SAM 3 on segmentation, surpasses the specialist model Depth Anything 3 on metric depth estimation, and outperforms Lotus-2 on surface normal estimation. Three tasks, no specialized architectures, no specialized loss functions — just the weights of a single generative model.

What this paper really rewrites is how vision tasks are defined.

From Outputting Labels to Outputting an Image

In traditional computer vision, an input image is followed by a “head.” Segmentation needs a mask head. Depth estimation needs a depth head. Surface normals need a normal head. Each head is a specialized network layer producing a particular kind of output, and each task has its own training pipeline.

Vision Banana flips this logic around. Since a generative model can only output RGB images, let every vision task output an RGB image.

Take semantic segmentation. Normally, you need a purpose-trained model to classify every pixel. Vision Banana only needs a prompt like: paint the cat’s ears orange, the exit sign blue, the background purple. The model generates a colored image. You cluster by color, and the class of every pixel falls out. The color specification in the prompt doesn’t matter how it’s written — natural language works, hex codes work, JSON works.

Now take metric depth estimation. Depth values are in meters. The model can’t output numbers; it can only output colors. The paper’s approach is to map depth values to a color encoding: nearby objects get rapidly changing colors, distant ones stay roughly the same hue. This concentrates the model’s attention on the things closest to you — the ones you care about most. Once you have the generated color image, you reverse-lookup using the same encoding to recover the exact distance for every pixel. The entire mapping is invertible, leaving no ambiguity.

Surface normals are the simplest. The three components of a normal vector naturally live in the range [-1, 1]. A linear transformation maps them directly onto the three RGB channels, and the reverse mapping is equally straightforward.

All these tasks share the same design principle: disguise the answer as an image and teach the model to draw it. No new network structures are needed.

How Much Training Data It Used

A model beating three domain specialists simultaneously — intuition says it must have devoured enormous amounts of data. It didn’t.

The paper’s approach is to mix a very small proportion of vision task data into Nano Banana Pro’s existing image generation training stream for instruction tuning. In each batch, most samples are still the original generation data; only a tiny fraction are vision task samples.

The goal is to prevent the model from forgetting how to draw. The authors call this lightweight instruction tuning. The objective is to teach the model to render its understanding in a prescribed format on the canvas — not to teach it segmentation or depth estimation from scratch. They believe the understanding was already there, built during generative pretraining; instruction tuning just adds a layer of output-format alignment.

The paper compares the image generation quality of Vision Banana and the original Nano Banana Pro. The two are statistically tied. Text-to-image win rate: 53.5%. Image editing win rate: 47.8%. After fine-tuning on vision tasks, the drawing capability barely budged. This is fundamentally different from the catastrophic forgetting familiar from traditional fine-tuning, where learning a new task erases all previous abilities.

Experimental Results

The paper evaluates zero-shot transfer on four benchmarks. None of the training data overlaps with any of these benchmark samples — everything comes from capabilities generalized during pretraining.

Semantic segmentation, Cityscapes: mIoU 69.9, beating SAM 3’s 65.2. SAM 3 is a purpose-built segmentation architecture.

Referring expression segmentation, RefCOCOg: cIoU 73.8, slightly above the SAM 3 plus Gemini 2.5 Pro combined pipeline at 73.4. Referring expression segmentation is considerably harder than semantic segmentation. The model must first understand a natural language description, then locate the corresponding object — for example, “the man wearing the pink T-shirt” or “the cat that is stretching.”

ReasonSeg: gIoU 79.3, above SAM 3 plus Gemini 2.5 Pro at 77.0. This benchmark tests reasoning. An example: “the toaster being used as a game controller.” The model has to know that a toaster is not normally a controller, then find the one in the image that is being used as one.

Metric depth estimation: across the four datasets used to evaluate Depth Anything 3, Vision Banana achieves an average δ1 of 0.929, above Depth Anything 3’s 0.918. This means 92.9% of pixels have a predicted depth within 25% of the true depth. Moreover, Vision Banana requires no camera intrinsics at all — while most top traditional depth estimation models depend on this information.

Surface normal estimation: across three indoor datasets, mean angular error 15.549 degrees, lower than Lotus-2’s 16.558 degrees.

The most intuitive result comes from an informal test. A paper author took a phone photo at Kinkaku-ji in Kyoto. Vision Banana predicted the depth of a point in the scene as 13.71 meters. The author then measured the actual distance on Google Maps: 12.87 meters. This test wasn’t standardized on any benchmark, but it says something no table can: the model isn’t coasting on memorized training data. It really extracts absolute-scale information from a single image.

What This Means

These numbers point to a larger shift.

Let’s look back at the path NLP took. Before LLMs, sentiment analysis required a sentiment classifier, entity recognition required an NER model, translation required a translation system. Each task was an independent problem with its own dataset, its own output head, its own loss function. Then people discovered that predicting the next token — a single task that looked like it could only do autocompletion — at scale, caused the model to spontaneously learn grammar, facts, reasoning, intent following, and long-range dependencies. These capabilities were not trained separately. They emerged internally, forced by the single objective of generation.

A sufficiently powerful next-token predictor cannot remain a shallow memorizer. The task itself compels it to build an understanding of language.

That was the lesson LLMs taught the entire NLP field: generation is a more fundamental way to learn than discrimination.

Here’s the strange part: this lesson has never been demonstrated in a native way in vision. In recent years, a lot of work has poured images into LLMs for understanding — multimodal models that answer questions about images, reason about images. The underlying logic is to borrow the LLM’s understanding capability to cover vision. The visual signal itself has not produced an independent, generation-driven capacity for understanding.

Vision Banana fills in this missing chapter. It doesn’t wrap a language model around images. It is the image generation model itself. Generation is its native language. Lightweight instruction tuning merely teaches it to render answers as images rather than words. With that single step, its visual understanding surpasses architectures purpose-built for these tasks across multiple benchmarks. This forms a strict symmetry with the LLM trajectory: in NLP, generative pretraining taught models to understand language; in vision, the same logic now has experimental evidence.

Whether this is the “GPT moment” for vision is too early to say. But the evidence this paper provides is substantial: a single model, changing only the prompt and nothing in the architecture, simultaneously reaches or exceeds domain specialists across five different benchmarks. That doesn’t look like coincidence.

For Most People, One Takeaway

Vision Banana is not another benchmark-sweeping story.

Over the years we’ve grown accustomed to the idea that visual understanding and image generation are two separate things. Segmentation models and generative models evolved independently — different teams, different frameworks, different datasets — their only intersection being the occasional glimpse of each other on a leaderboard. This paper suggests that the division itself may have been artificial. A person who can picture an object in their mind already possesses some degree of understanding of it: its shape, its material, its position in space. Generation and understanding may have always been one thing.

Vision Banana’s experiments simply put a set of measurable numbers behind that intuition.

Paper: Image Generators are Generalist Vision Learners, Project page: vision-banana.github.io