Generation Is Pretraining, in Vision Too
Google DeepMind's Vision Banana paper shows that training a model to generate images — and only that — produces transferable visual representations strong enough to beat specialized discriminative models on segmentation and metric depth estimation when lightly instruction-tuned. The finding is the visual analog of how LLM pretraining generalizes across language tasks.
Read more →
