Blog...
A blog posts...
The content for a blog post is loading...
A blog posts...
The content for a blog post is loading...
A deep dive into the architecture and tech stack of my content promotion agent
Apr 3, 2025
In 2025, everyone's talking about agents. But not a lot of people are shipping them.
This is the first post in a two-part series where I document a real-world agent I've built and now use to help promote my blog posts. This post is for engineers. It covers the architecture, tech stack, prompting strategies, and lessons I've learned. A future post will walk through the โwhyโ and how it works in practice for content and distribution.
The way it works is simple: It reads a URL, writes a Twitter thread, makes a video, and posts both to ๐ and TikTok โ automatically.
For example, to promote my last blog post I need to give it this:
In order to get this:
GPT-4o just launched native image generationโfinally delivering on its "omni" promise ๐ I've tested it extensively. Here are the most impressive (and surprising) results ๐
When I trigger the agent (via text message or a web UI), I give it a URL and a reason I want to share it. From there, it:
What it really does? It gives me an excuse to play with a bunch of tools I'd been itching to use. This was just a way to smash them all together into something real and useful.
gpt-4o, gpt-4.5-preview, o3-mini) and Google Gemini (gemini-2.0-flash) using ai-sdk
Our first task is to understand what the user is asking us to talk about. Our goal here is to extract the most relevant information and condense it as we don't unlimited context lengths in our later models.
AI makes this insanely simple. Gemini 2's massive content window means you can give it any website html and you can get your json out in one api call.
Note: If you intended to scale this you'd need to use a service like firecrawl.dev or use browserbase. Direct scraping like this is against the TOS of Trigger.dev.
While that works, the first challenge is limited contextual understanding. The resulting data does not have any understanding of any images found. Further it is missing key information on the people/companies mentioned in the post. It's obvious to us, but key to creating engaging social content is proper @mentions and selecting the right visuals.
We solve this with two parallel tasks:
Here's some simplified code for how I implement both in a new task with a purpose of gaining a full understanding of the webpage.
All of this gets re-merged into a resulting understandWebpageOutpage object, which becomes the shared context for both the thread and video generation agents.
Below is how I visualize this object in the Nextjs app. You'll see in the task log the AI has found and enriched two entities (OpenAI and Google) and a lot of images. Beside that I show a render of the webpage content with image descriptions overlayed on top. This is a simple way to visualize the data and make sure the AI is doing a good job.
We're finally in a good spot to start creating content about this page.
Right before we move to the next step, I integrated my brain to ensure my opinions would be carried into any content created.
In the prior step we were "using ai", but this is the first "agentic" step. We're going to be implementing a evaluatorโoptimizer loop. This is where two agents are pitted against eachother until one of them wins.
We're going to be pitting OpenAI's best creative/writing model gpt-4.5 against OpenAI's smartest/most logical model o3-mini. GPT-4.5 will write our thread and o3-mini will evaluate it. If the thread becomes good enough, it will be posted to X. If not, the thread will be sent back to the writer for more iterations.
Because the concept of the task is simple (it's a loop), I figure the more interesting part is the prompts.
Here's how it works:
The writer used gpt-4.5-preview, which more of a creative writing model. Most notably, given it's release date, is that it is not a reasoning model. Meaning it doesn't think prior to responding. So we use the now old trick to give it an ounce more intelligence by getting it to document a chain of thought prior to responding.
Here's an an example of the writers first attempt.
The evaluator uses o3-mini, which is a reasoning model. These models are quite good at logical reasoning, instruction adherence, and double checking intuitions. It does a great job of holding GPT-4.5's more creative mind to the instructions. As a baseless aside, I think it's very important that the evaluator is a different model than the writer. Models tend to like their own work, so the more different the models the better.
The evaluator always writes postive and negative feedback followed by a numeric score. If that score is high enough we continue, if not we loop back to the writer with the feedback added to the conversation log and continue working.
Here's an example of the evaluator's output:
You'll see how the score steadily improved from 0.2 to 0.8 over three iterations. If that score is below 0.8, I'll grab the json and add it back to the conversation history AS IF the user wrote it, urgent the writer assitant to make a change for me.
In total this took 6 minutes 52 seconds, but I'm not fussed about the time. Once complete, the thread is posted to ๐ (see Step 4). But it's also sent to Step 4, to not waste the compute.
Now it's time to create a video. We'll use the thread as a basic outline for the video, as we need to ask the AI to do a lot in this next step. Not only do we want it to write a script, we need it to create a video composition object that we can use with in Remotion to generate a video.
We're going to use another evaluatorโoptimizer loop. Again, the loop part is easy, so here are the system prompts and a template of the script thats reused between the writer and evaluator. In that shared prompt are the specific instructions that we need to follow if we want the video to work once this loop is complete -- in a way they describe the options available in the remotion project.
To kick off the loop we'll send in information on the webpage understanding, the thread, and any brain data.
Again, we'll use the gpt-4.5 model as the writer. It's a lot for this model digest, but it's basically the only model that can do it while remaining creative.
There are definite improvements to be made here by breaking this into a bunch of smaller steps. For example, we could iterate on an outline. Then once we pass that, we could iterate on a video composition object. I'd put a lot of money on getting significantly better results if we did that. But I didn't get there, yet. Done is better than perfect, and I wanted to see the whole thing working first before any optimizing.
Instead I just asked it to break out its chain of thought in hopes to give it a bit more test-time-compute.
Here's an example of what the script writer can do on its first try.
I won't belabour the point here, as we follow the same idea as before: using o3-mini as the evaluator giving the models the ability to iterate to something good, accurate, and complete.
What I'll note is that I've tried to increase harshness of the evaluator. And further raised the threshold for the final score to 0.9, as I've noticed the evaluator gets increasingly lenient with as the volume of data increases. You can see we start at 0.8.
In total, this took 8 minutes and 30 seconds to run. I'll parse the resulting script into json to move on to rendering. We do this by taking that last xml message and using GPT-4o to parse it into a json object. The json schema for this exactly matches the video schema of our remotion comp.
With the final script in JSON, we can simply loop over the sections, grab the script text and sent to Heygen to make "me" speak it. In theory we could do this in one clip, and perhaps I'd save money in API credits and get a more authentic look to the "avatar", but I think that'd complicate the process of rejoining the video back together.
Below you can see how I visualize the clips, composition, and the final render.
There's a few things I do in the code that might be helpful. I'm captioning the videos as they come in, and that convienietly gets me the duration of the videos. When I setup the composition in Remotion, we'll set the duration of the composition to the sum of all the clips.
And once it's all combined with the code below, render this all on AWS Lambda. This is mainly because remotion made it easy.
Below is the basic code of the composition. The code is gnarly but it's sort of a proof-of-concept that worked well enough to get me a video.
Finally, we have the render. It's time to post. I'm using Blotato to post to TikTok and ๐, it has a lot of features that I've never really checked out, but it mainly was a way to get API access to direct posting. It's sort of a work around the limitations the platforms set to stop projects like this one.
It works! But does it generate views? Likes? Shares? Stay tuned for part two.