Research Perspective — BAIR, UC Berkeley

Vibe-Figma: Why Designers in 2026 Need Multimodal AI Agents to Keep Up

By the img2figma Team — AI Researchers, BAIR, UC Berkeley · February 9, 2026 · 12 min read · Berkeley, California

We're going to coin two terms in this post, and we genuinely believe they'll stick: Vibe-Figma and Vibe Designing.

Here at BAIR (Berkeley AI Research), we've been watching two parallel revolutions unfold. On one side, software engineers are iterating at ridiculous speed thanks to vibe coding — natural language in, working code out. On the other side, designers are still stuck with largely manual workflows. Yes, the tools got prettier. But the fundamental loop — create, tweak, export, repeat — hasn't changed much in years.

That imbalance is becoming a real problem. In 2026, engineering teams can ship features in hours. Design teams are still taking days. The bottleneck has shifted, and it's shifted hard. Designers need their own version of vibe coding. They need Vibe Designing — a workflow where you describe the design intent, and AI handles the grunt work. And the tool that enables this inside Figma? That's what we're calling Vibe-Figma.

But to understand why this is possible now and wasn't even two years ago, we need to talk about the history. Because the story of multimodal AI is what makes all of this work.

The Multimodality Problem: A Brief History

The core challenge in AI graphic design has always been multimodality — the ability to work across different types of data simultaneously. Images are one modality. Text is another. Code is a third. Layout structure is a fourth. A real design tool needs to handle all of them, and the connections between them, at the same time.

This wasn't always possible. Let's trace the path.

The Road to Multimodal AI Design

2014

GANs (Generative Adversarial Networks)

Goodfellow et al. introduced GANs — two neural networks fighting each other to generate images. Revolutionary, but unstable to train. Images were blurry, small, and hard to control. The first proof that neural networks could generate visuals from scratch.

2019

StyleGAN & High-Resolution GANs

NVIDIA's StyleGAN showed GANs could produce photorealistic faces. But they were still limited to specific domains. You could generate a face, not a dashboard. No text understanding, no layout control, no multimodal capability.

2021

CLIP (OpenAI) — The Multimodal Breakthrough

This is where everything changed. CLIP learned to connect images and text in a shared embedding space. For the first time, an AI could understand that a picture of a cat and the words "a cat" refer to the same concept. This was the foundation that made text-to-image generation possible. Without CLIP, none of what followed would exist.

2022

DALL-E 2, Stable Diffusion — Diffusion Models Take Over

Diffusion models replaced GANs as the dominant image generation paradigm. Instead of two networks fighting, diffusion models learned to gradually denoise a random image into a coherent picture. Combined with CLIP embeddings, you could type a description and get a matching image. Stable Diffusion made this open-source, and the floodgates opened.

2023

Midjourney, SDXL — Quality Explosion

Image generation quality skyrocketed. Midjourney produced images that won art competitions. SDXL pushed resolution and detail. People started generating UI mockups — not great ones, but the potential was obvious.

2024

Flow Matching — The Next Generation

Flow matching models emerged as a more elegant successor to diffusion. Instead of learning to denoise, they learn direct probability flows from noise to image. Faster training, faster inference, better quality. Models like Stable Diffusion 3 and Flux adopted this architecture. Image generation became faster and more controllable.

2026

Ideogram, NanoBanana — UI-Quality Generation

Current-gen models can produce pixel-perfect UI designs. Text rendering is accurate, layouts are coherent, visual quality is indistinguishable from hand-crafted design. The "generation" problem is effectively solved for UI. But the "editability" problem remained wide open — until now.

That's over a decade of progress in generative AI, compressed into six major milestones. Each one built on the last. GANs proved generation was possible. CLIP proved multimodality was possible. Diffusion models scaled it. Flow matching made it practical. And now, in 2026, we have image generators that produce design-quality UI.

But here's what the timeline misses: none of these milestones solved the editability problem. Every single model on this list outputs flat images. Beautiful, yes. Editable? No.

The Missing Link: Where Image Generation Meets Code

There's a parallel timeline that's equally important but gets far less attention in the design world: the rise of large language models and their ability to write code.

While image generation was evolving through GANs → diffusion → flow matching, text models were evolving through transformers → GPT → instruction-tuned LLMs → coding-specialized LLMs. By 2024, models could write production-quality code. By 2025, they could understand visual context and write code that renders specific layouts.

Two Timelines Converging

Visual AI Timeline

GANs (2014)

↓

CLIP (2021)

↓

Diffusion Models (2022)

↓

Flow Matching (2024)

↓

UI-Quality Gen (2026)

Code AI Timeline

Transformers (2017)

↓

GPT / LLMs (2020)

↓

Coding LLMs (2023)

↓

Multimodal Coding (2025)

↓

Design-Aware Code Gen (2026)

Visual AI + Code AI = Vibe-Figma

This is the key insight: the visual AI timeline and the code AI timeline have converged. For the first time, we have models that can both see (understand images at a structural level) and write (generate structured code that renders designs). Put them together with a detection layer and an inpainting layer, and you get a multimodal AI agent that can take a flat image and produce an editable design file.

That's not a theoretical possibility. That's what img2figma does today.

Vibe Designing: The Designer's Answer to Vibe Coding

Let's talk about the speed gap.

A software engineer in 2026, armed with a good coding AI, can go from idea to working prototype in an afternoon. Describe the feature, generate the code, test it, iterate. The feedback loop is measured in minutes.

A designer in 2026? Still opening Figma, creating frames, placing rectangles, typing text, adjusting colors, organizing layers, exporting assets. It's faster than it was five years ago, sure. But the gap between design iteration speed and code iteration speed has never been wider.

The Iteration Speed Gap (2026)

Engineer + Vibe Coding Idea → Prototype: ~30 min

Designer + Traditional Tools Idea → Mockup: ~4 hours

Designer + Vibe-Figma Idea → Editable Design: ~5 min

Vibe Designing is the paradigm that closes this gap. Instead of manually constructing every element, the designer describes a concept (or feeds a reference image), and AI generates a fully structured, editable design. The designer's role shifts from construction to curation — guiding, refining, and making creative decisions rather than pushing pixels.

And Vibe-Figma is the specific embodiment of this paradigm inside the tool designers already use. It's what happens when multimodal AI agents become native to Figma. You feed in inspiration, and you get back editable layers. That's the workflow we're building with img2figma.

Why Multimodality Is THE Hard Problem

Since CLIP came out in 2021, the AI research community has been obsessed with multimodality. And for good reason — it's genuinely one of the hardest problems in the field.

The difficulty isn't in any single modality. We can generate great images. We can generate great code. We can detect objects. We can inpaint images. Each capability exists in isolation. The hard part is making them work together coherently on the same problem.

Think about what converting an image to a Figma design actually requires:

Modalities Required to Convert Image → Figma

📸 Visual Understanding

Parse the image, identify element boundaries, understand visual hierarchy and spatial relationships

📝 Text Recognition

Read all text content, detect font properties (family, size, weight, color), preserve exact strings

🎨 Image Manipulation

Remove foreground elements cleanly, reconstruct background textures, handle transparency

💻 Structured Code Output

Write valid Figma JSON with correct nesting, proper frame/group hierarchy, pixel-accurate positioning

🎨 Design Semantics

Understand that a group of elements is a "card," that repeated items form a "list," that some elements are interactive "buttons"

🆘 SVG / Icon Handling

Detect icons, distinguish them from text and images, extract or reconstruct as vector assets

No single AI model handles all of this. That's why the solution is a pipeline — a multimodal AI agent composed of specialized models working in concert. An object detector for understanding structure. An inpainting model for separating layers. A coding LLM for writing the design file. Each model handles its modality, and the agent orchestrates the handoffs.

This is what makes the img2figma approach fundamentally different from trying to build one monolithic model that does everything. The pipeline approach leverages the best model for each sub-task and composes them into something greater than the sum of its parts.

The Vibe-Figma Pipeline: From Research to Reality

Here's how the full Vibe-Figma pipeline works, from the research perspective:

Vibe-Figma Pipeline (Research View)

Image Input

Any UI image: AI-generated, screenshot, mockup, photo

UI Object Detection (trained on millions of UI samples)

Outputs: bounding boxes, element types, text content, confidence scores

AI Inpainting (element erasure + background reconstruction)

Outputs: clean background layer with all foreground elements removed

Multimodal Coding LLM (vision + detections → Figma JSON)

Outputs: structured Figma file with frames, text nodes, images, correct hierarchy

Figma Canvas Rendering

Native Figma nodes placed on canvas — fully editable by the designer

Each stage is informed by years of AI research. The detector builds on advances in object detection that started with R-CNN and evolved through YOLO, DETR, and specialized UI-detection architectures. The inpainting model leverages the diffusion/flow-matching revolution we traced above. The coding LLM is a product of the transformer revolution that started in 2017 and has been compounding ever since.

Why This Matters for Designers (Not Just Researchers)

OK, enough research context. Here's why this matters if you're a working designer in 2026.

The teams you work with are moving faster than ever. Product managers expect tighter iteration cycles. Engineers ship features before the design is even finalized. The pressure to keep up is real, and it's not going away.

Vibe Designing isn't about replacing designers. Not even close. It's about giving designers the same leverage that vibe coding gives engineers. You still make the creative decisions. You still guide the aesthetic. You still know what good design looks like and what the user needs. But instead of spending hours manually constructing mockups, you spend minutes iterating on AI-generated starting points.

That's the Vibe-Figma workflow:

Generate — Use an AI image generator to visualize your concept. Iterate on the prompt until the output matches your vision.
Convert — Use img2figma to convert that image into editable Figma layers. 60 seconds, fully automated.
Curate — Edit, refine, polish. Change the heading. Swap the color palette. Adjust the spacing. This is where your design expertise shines.
Iterate — Don't like the direction? Generate a new concept. Convert again. The cost of exploration drops to near zero.

The designer becomes a creative director rather than a pixel laborer. And the speed? Comparable to what engineers get with vibe coding. That's the balance we need.

From Berkeley to Product

Our team sits at the intersection of AI research and product design. We spent years in labs — here at BAIR and in other research groups — working on the individual components: object detection, generative models, vision-language systems, structured code generation. We published papers. We trained models. We pushed state-of-the-art numbers on benchmarks.

But at some point, benchmarks stop being interesting. The question shifts from "can we improve accuracy by 0.3%?" to "can we actually ship something that solves a real problem?" And the problem we saw — designers stuck in the manual loop while engineers accelerate with AI — was too real to ignore.

img2figma is the result. It's a research project that became a product. The pipeline draws directly from the multimodal AI research timeline — CLIP-style understanding, diffusion-era inpainting, transformer-based code generation — and packages it into a Figma plugin that any designer can use without knowing anything about the models underneath.

That's what we mean by Vibe-Figma. Not a buzzword. A real tool, built on real research, solving a real problem. Designers deserve to iterate as fast as coders. Multimodal AI agents are how we get there.

Try Vibe-Figma Today

Experience the Vibe Designing workflow. 4 free credits, no card required.

Install Figma Plugin Create Account

← The New Vibe Coding Stack: Multimodal AI Agents for Graphic Design ← AI Image Generators Are Insane in 2026 — But There's One Big Problem