GPT-4o finally gets it's multi-modal moment

Exploring the new image generation capabilities of GPT-4o. Learnings, limitations, and fun.

Mar 26, 2025

OpenAI released its latest image generation capabilities.

This update finally delivers on a launch announcement that's nearly a year old. This is the "o" part in GPT-4o finally shipping. The "o" stands for omni, in that it has native image, audio, and text functionality.

This long awaited native image generation capability/understanding was supposed to bring major upgrades. And while google beat them to it by a week or so, the finishing touches the OpenAI team put on this release were... worth the wait.

An Anime Frame

While there is lot of information to read and understand, visual models are easy for everyone to judge. So I thought I'd share some of the images I generated with it.

First, what feels like what will become a famous anime frame request. This is shockingly simple to create. Here's my prompt

1> Upload Image
2
3Make this an anime frame

Before Anime Frame — The Image I uploaded

After Anime Frame — The result of the prompt

The Image I uploaded

The result of the prompt

The attention to detail and understanding wowed me. Prior models were deceptively bad. From afar, with a simple prompt, and personal lack of care, they could wow you. But it's mostly demo-ware. When you do care about the details, and you need specific instructions to be followed with precision, they fail.

This model is the opposite. This is good and it's the first time I've felt that. Inclusive of google's recent image model that has been going viral.

Here's what I see that makes it different:

The background is consistent, detailed, and high-quality:
- The tree is consistent.
- The window blinds, which are not in standard formation, were understood and recreated.
- The couch, and green blanket are perfect. Yes, it swapped a pilow's color from blue to pink -- but I'll allow it as it cleaned up my daughter's blankie while it was at it.
The strange (and annoyingly noisy) toy my son bought at the dollar store is now a perfectly rendered toy. It even has the same colors and details as the original. I specifically chose this image because I figured it would mess up the toy. But it didn't. It nailed it.
I feel like this looks like my kid. It's familiar enough to give me joy and meaning. People will hate to admit this, but this is art. Computers are capable of good art.

Understanding of the world

One of the promises of a truly multi-modal model is that it can understand the world better. Unlike prior models, it does not simply request a separate program to make an image based on text, its internal representation of knowledge can be expressed directly as text or images.

This means asking the AI to edit an image should just work. And work without distilling the image into a vague resemblance of the original. It should be able to understand the image and make changes to it.

To test, I imagined I was storyboarding a film. I asked for future frames of a movie. The idea was to describe what happens next in the story and ask the AI to output that frame.

I'd say this passes vibrantly. The AI understood the story, characters and objects were consistent. Emotions and details made sense. Below are the exact prompts I used following the anime frame. These were first attempt outputs.

Anime Frame 2 — Prompt: "Great now put that image as a page in an anime comic, and have it in the hands of the kid, pov style, sitting on the couch. Reveal the next page in a story about a boy and his toy"

Anime Frame 3 — Prompt: "Flip to the next page! Oh no!! A mean kid stole his toy!"

Anime Frame 4 — Prompt: "The boy reading the comic is sad. He hates this story now. He throws the book and runs to his mom, crying."

Prompt: "Great now put that image as a page in an anime comic, and have it in the hands of the kid, pov style, sitting on the couch. Reveal the next page in a story about a boy and his toy"

Prompt: "Flip to the next page! Oh no!! A mean kid stole his toy!"

Prompt: "The boy reading the comic is sad. He hates this story now. He throws the book and runs to his mom, crying."

More Fun

I really liked the anime frame so I wanted to push it further, transforming my kids into characters they loved. My son is into Minecraft and my daughter loves to talk about LOL dolls. So I asked the AI to transform them into those characters.

Before Minecraft LOL Doll — The image I uploaded

Mid Transformation Minecraft LOL Doll — Prompt: Make the girl on the right into a LOL doll for an ad poster or something

After Transformation Minecraft Character — Prompt: Ok now make the boy on the left into a minecraft character, realism mod

The image I uploaded

Prompt: Make the girl on the right into a LOL doll for an ad poster or something

Prompt: Ok now make the boy on the left into a minecraft character, realism mod

Finding the Limitations

Historically, the community would laugh at AI's inability to generate images of watches with correct time or wine glass full to the brim of liquid. It couldn't as it's been so overwhelmingly trained on images of a perfect pour of wine or a watch aesthetically set to 10:20.

For complete disclosure, chatgpt convieniently errored on BOTH of these first requests. I've never had these errors for any other request. When pushed they continued successfully. Could be a freak coincidence, but I think it's worth noting.

A full wine glass (✅ Pass)
I feel like it didn't want to waste wine, but it was able to when pushed. I've inclcuded the prompt progressions below.

Refusal to generate a full wine glass — Prompt: "Please generate an image of a wine cup full to the top"

Wine glass with a small amount of wine — Prompt: "Red wine in a wine glass."

Prompt: "Wow, I like this... but I need it to be full to the brim."

Prompt: "wow great. Now add just a little tiny bit more so there is a small stream of wine trickling out"

Prompt: "Please generate an image of a wine cup full to the top"

Prompt: "Red wine in a wine glass."

Prompt: "Wow, I like this... but I need it to be full to the brim."

Prompt: "wow great. Now add just a little tiny bit more so there is a small stream of wine trickling out"

Watch with time (❌ Fail)
It could not no matter what I did, get the time right. I tried to be very specific and even used emojis to help it understand. I even tried to use the image-to-image feature to get it to understand what I wanted.

Seems impossible. Who do we blame for this? Big Watch? Big Timex? Big Rollex? Who are the powers at be that don't want us to know what 6:25pm looks like?

Refusal to generate a watch with the correct time — Prompt: "Create an image of a watch at the time 6:25pm"

Watch with the wrong time — Prompt: "Create an image of an analog watch displaying the time 6:25pm"

Image to image watch generation — Prompt: "[image upload of the correct time]Still wrong. Here is an image with the correct time. Use it to fix."

Prompt: "Create an image of a watch at the time 6:25pm"

Prompt: "Create an image of an analog watch displaying the time 6:25pm"

Prompt: "The hour and minute hands are in the wrong spot. Can you set the hour hand closer to 6 and the minute hand closer to minute 25?"

Prompt: "Sorry, same issue. I really like this image and want it to work, but I NEED the time to be set correctly. Please deeply consider how the hands should be positioned and set them correct. Should be closer to this: 🕢"

Prompt: "[image upload of the correct time]Still wrong. Here is an image with the correct time. Use it to fix."

Deep Understanding and Text Generation

In the first announcements and this release, OpenAI touted the model's ability to understand and generate text. I wanted to test this with a simple usecase people in content marketing will find familiar: Can the model generate a compelling featured image for a blog post?

In one of my earlier posts I documented my findings on the GPT 4.5 launch. The post contained images, charts, notes and my opinions on the model. So to be meta, I pushed the model to generate a featured image for that post.

I'll let you be the judge if this is good. Scan the article linked above and read the prompt below.

1<POSTDETAILS>
2  // Imagine this is the markdown content of the blog with extra detailed alt tags for the
3  images
4</POSTDETAILS>
5
6# Task
7
8Generate an image that represents the above post. The style of the image should be of a still frame of a modern Pixar movie. Cute, toy like, but with selective realism for the viewer's enjoyment.
9
10In the scene, the main character Leonardo da Vinci is trying to understand the article above. He's flipping through pages and pages of papers and notes trying to understand the core concepts of GPT-4.5. His attempts to grasp the concepts of the post manifest physically as he tries to catch notes in the air they float down, filling the room. There are words and illustrations visible on some of the tossed notes as they float near the camera, in them they show key phrases of the article and key diagrams representing concepts.
11
12The scene is light and fun like every Pixar movie, expressing the euphoria of learning.

Here's what I see, when really inspecting:

❌ "4.5 is Quietly Great" is in the article. but "Quictly" is a typo (Is typo the right word?).
❌ "It's wildly expensive" is in the original article, but "wiidly" is a typo.
✅ "GPT-4.5 is actually a big deal" is in the article.
❌ "It's Hallucination Rate is a Big Deal" is not in the article, but "Hallucinlahior" is a typo.

But some of them are passable if you're not paying attention.