GPT-4o finally gets it's multi-modal moment
Exploring the new image generation capabilities of GPT-4o. Learnings, limitations, and fun.
Mar 26, 2025
OpenAI released its latest image generation capabilities.
This update finally delivers on a launch announcement that's nearly a year old. This is the "o" part in GPT-4o finally shipping. The "o" stands for omni, in that it has native image, audio, and text functionality.
This long awaited native image generation capability/understanding was supposed to bring major upgrades. And while google beat them to it by a week or so, the finishing touches the OpenAI team put on this release were... worth the wait.
An Anime Frame
While there is lot of information to read and understand, visual models are easy for everyone to judge. So I thought I'd share some of the images I generated with it.
First, what feels like what will become a famous anime frame request. This is shockingly simple to create. Here's my prompt


The attention to detail and understanding wowed me. Prior models were deceptively bad. From afar, with a simple prompt, and personal lack of care, they could wow you. But it's mostly demo-ware. When you do care about the details, and you need specific instructions to be followed with precision, they fail.
This model is the opposite. This is good and it's the first time I've felt that. Inclusive of google's recent image model that has been going viral.
Here's what I see that makes it different:
- The background is consistent, detailed, and high-quality:
- The tree is consistent.
- The window blinds, which are not in standard formation, were understood and recreated.
- The couch, and green blanket are perfect. Yes, it swapped a pilow's color from blue to pink -- but I'll allow it as it cleaned up my daughter's blankie while it was at it.
- The strange (and annoyingly noisy) toy my son bought at the dollar store is now a perfectly rendered toy. It even has the same colors and details as the original. I specifically chose this image because I figured it would mess up the toy. But it didn't. It nailed it.
- I feel like this looks like my kid. It's familiar enough to give me joy and meaning. People will hate to admit this, but this is art. Computers are capable of good art.
Understanding of the world
One of the promises of a truly multi-modal model is that it can understand the world better. Unlike prior models, it does not simply request a separate program to make an image based on text, its internal representation of knowledge can be expressed directly as text or images.
This means asking the AI to edit an image should just work. And work without distilling the image into a vague resemblance of the original. It should be able to understand the image and make changes to it.
To test, I imagined I was storyboarding a film. I asked for future frames of a movie. The idea was to describe what happens next in the story and ask the AI to output that frame.
I'd say this passes vibrantly. The AI understood the story, characters and objects were consistent. Emotions and details made sense. Below are the exact prompts I used following the anime frame. These were first attempt outputs.



More Fun
I really liked the anime frame so I wanted to push it further, transforming my kids into characters they loved. My son is into Minecraft and my daughter loves to talk about LOL dolls. So I asked the AI to transform them into those characters.



Finding the Limitations
Historically, the community would laugh at AI's inability to generate images of watches with correct time or wine glass full to the brim of liquid. It couldn't as it's been so overwhelmingly trained on images of a perfect pour of wine or a watch aesthetically set to 10:20.
For complete disclosure, chatgpt convieniently errored on BOTH of these first requests. I've never had these errors for any other request. When pushed they continued successfully. Could be a freak coincidence, but I think it's worth noting.
A full wine glass (✅ Pass)
I feel like it didn't want to waste wine, but it was able to when pushed. I've inclcuded the prompt progressions below.




Watch with time (❌ Fail)
It could not no matter what I did, get the time right. I tried to be very specific and even used emojis to help it understand. I even tried to use the image-to-image feature to get it to understand what I wanted.
Seems impossible. Who do we blame for this? Big Watch? Big Timex? Big Rollex? Who are the powers at be that don't want us to know what 6:25pm looks like?





Deep Understanding and Text Generation
In the first announcements and this release, OpenAI touted the model's ability to understand and generate text. I wanted to test this with a simple usecase people in content marketing will find familiar: Can the model generate a compelling featured image for a blog post?
In one of my earlier posts I documented my findings on the GPT 4.5 launch. The post contained images, charts, notes and my opinions on the model. So to be meta, I pushed the model to generate a featured image for that post.
I'll let you be the judge if this is good. Scan the article linked above and read the prompt below.

Here's what I see, when really inspecting:
- ❌ "4.5 is Quietly Great" is in the article. but "Quictly" is a typo (Is typo the right word?).
- ❌ "It's wildly expensive" is in the original article, but "wiidly" is a typo.
- ✅ "GPT-4.5 is actually a big deal" is in the article.
- ❌ "It's Hallucination Rate is a Big Deal" is not in the article, but "Hallucinlahior" is a typo.
But some of them are passable if you're not paying attention.