How I Built a Fully-Automated Agent to Promote My Content on 𝕏 and TikTok
A deep dive into the architecture and tech stack of my content promotion agent
Apr 3, 2025
In 2025, everyone's talking about agents. But not a lot of people are shipping them.
This is the first post in a two-part series where I document a real-world agent I've built and now use to help promote my blog posts. This post is for engineers. It covers the architecture, tech stack, prompting strategies, and lessons I've learned. A future post will walk through the “why” and how it works in practice for content and distribution.
The way it works is simple: It reads a URL, writes a Twitter thread, makes a video, and posts both to 𝕏 and TikTok — automatically.
For example, to promote my last blog post I need to give it this:
In order to get this:
GPT-4o just launched native image generation—finally delivering on its "omni" promise 🎉 I've tested it extensively. Here are the most impressive (and surprising) results 👇
How this Agent works (and a Table of Contents)

When I trigger the agent (via text message or a web UI), I give it a URL and a reason I want to share it. From there, it:
- Reads the content on the page
- Creates a Twitter thread
- Creates a TikTok video
- Posts the thread and video
What it really does? It gives me an excuse to play with a bunch of tools I'd been itching to use. This was just a way to smash them all together into something real and useful.
- Runtime: Vercel/Next.js + Trigger.dev (for long-running workflows)
- AI APIs: OpenAI (
gpt-4o
,gpt-4.5-preview
,o3-mini
) and Google Gemini (gemini-2.0-flash
) using ai-sdk - AI clone generation: Heygen
- Captions / Speech-to-text: OpenAI Whisper
- Video rendering: Remotion.dev
- Posting to Social: Blotato (for easy direct posting to 𝕏 and TikTok)
- Knowledge base: My personal project "Brain" — a RAG system storing notes, blog posts, and entities
- Database: Supabase
Step 1: Understanding the Webpage

Our first task is to understand what the user is asking us to talk about. Our goal here is to extract the most relevant information and condense it as we don't unlimited context lengths in our later models.
Parsing HTML into an article object
AI makes this insanely simple. Gemini 2's massive content window means you can give it any website html and you can get your json out in one api call.
Note: If you intended to scale this you'd need to use a service like firecrawl.dev or use browserbase. Direct scraping like this is against the TOS of Trigger.dev.
While that works, the first challenge is limited contextual understanding. The resulting data does not have any understanding of any images found. Further it is missing key information on the people/companies mentioned in the post. It's obvious to us, but key to creating engaging social content is proper @mentions and selecting the right visuals.
We solve this with two parallel tasks:
- Run any found images through GPT-4o with some context to get rich descriptions
- Extract people/companies and enrich with Clearbit
Describing Images and Entities found
Here's some simplified code for how I implement both in a new task with a purpose of gaining a full understanding of the webpage.
Merge the data back to a single JSON object
All of this gets re-merged into a resulting understandWebpageOutpage
object, which becomes the shared context for both the thread and video generation agents.
Below is how I visualize this object in the Nextjs app. You'll see in the task log the AI has found and enriched two entities (OpenAI and Google) and a lot of images. Beside that I show a render of the webpage content with image descriptions overlayed on top. This is a simple way to visualize the data and make sure the AI is doing a good job.

We're finally in a good spot to start creating content about this page.
Right before we move to the next step, I integrated my brain to ensure my opinions would be carried into any content created.
Step 2: Creating the Twitter Thread

In the prior step we were "using ai", but this is the first "agentic" step. We're going to be implementing a evaluator–optimizer loop. This is where two agents are pitted against eachother until one of them wins.
We're going to be pitting OpenAI's best creative/writing model gpt-4.5
against OpenAI's smartest/most logical model o3-mini
. GPT-4.5 will write our thread and o3-mini will evaluate it. If the thread becomes good enough, it will be posted to X. If not, the thread will be sent back to the writer for more iterations.
Because the concept of the task is simple (it's a loop), I figure the more interesting part is the prompts.
Here's how it works:
The writer
The writer used gpt-4.5-preview
, which more of a creative writing model. Most notably, given it's release date, is that it is not a reasoning model. Meaning it doesn't think prior to responding. So we use the now old trick to give it an ounce more intelligence by getting it to document a chain of thought prior to responding.
Here's an an example of the writers first attempt.
The Evaluator
The evaluator uses o3-mini
, which is a reasoning model. These models are quite good at logical reasoning, instruction adherence, and double checking intuitions. It does a great job of holding GPT-4.5's more creative mind to the instructions. As a baseless aside, I think it's very important that the evaluator is a different model than the writer. Models tend to like their own work, so the more different the models the better.
The evaluator always writes postive and negative feedback followed by a numeric score. If that score is high enough we continue, if not we loop back to the writer with the feedback added to the conversation log and continue working.
Here's an example of the evaluator's output:
You'll see how the score steadily improved from 0.2 to 0.8 over three iterations. If that score is below 0.8, I'll grab the json and add it back to the conversation history AS IF the user wrote it, urgent the writer assitant to make a change for me.
In total this took 6 minutes 52 seconds, but I'm not fussed about the time. Once complete, the thread is posted to 𝕏 (see Step 4). But it's also sent to Step 4, to not waste the compute.
Step 3: Creating the Video

Now it's time to create a video. We'll use the thread as a basic outline for the video, as we need to ask the AI to do a lot in this next step. Not only do we want it to write a script, we need it to create a video composition object that we can use with in Remotion to generate a video.
We're going to use another evaluator–optimizer loop. Again, the loop part is easy, so here are the system prompts and a template of the script thats reused between the writer and evaluator. In that shared prompt are the specific instructions that we need to follow if we want the video to work once this loop is complete -- in a way they describe the options available in the remotion project.
To kick off the loop we'll send in information on the webpage understanding, the thread, and any brain data.
The Video Script Writer
Again, we'll use the gpt-4.5
model as the writer. It's a lot for this model digest, but it's basically the only model that can do it while remaining creative.
There are definite improvements to be made here by breaking this into a bunch of smaller steps. For example, we could iterate on an outline. Then once we pass that, we could iterate on a video composition object. I'd put a lot of money on getting significantly better results if we did that. But I didn't get there, yet. Done is better than perfect, and I wanted to see the whole thing working first before any optimizing.
Instead I just asked it to break out its chain of thought in hopes to give it a bit more test-time-compute.
Here's an example of what the script writer can do on its first try.
The Video Script Evaluator
I won't belabour the point here, as we follow the same idea as before: using o3-mini
as the evaluator giving the models the ability to iterate to something good, accurate, and complete.
What I'll note is that I've tried to increase harshness of the evaluator. And further raised the threshold for the final score to 0.9, as I've noticed the evaluator gets increasingly lenient with as the volume of data increases. You can see we start at 0.8.
In total, this took 8 minutes and 30 seconds to run. I'll parse the resulting script into json to move on to rendering. We do this by taking that last xml message and using GPT-4o to parse it into a json object. The json schema for this exactly matches the video schema of our remotion comp.
Parsing to JSON and Rendering
With the final script in JSON, we can simply loop over the sections, grab the script text and sent to Heygen to make "me" speak it. In theory we could do this in one clip, and perhaps I'd save money in API credits and get a more authentic look to the "avatar", but I think that'd complicate the process of rejoining the video back together.
Below you can see how I visualize the clips, composition, and the final render.

There's a few things I do in the code that might be helpful. I'm captioning the videos as they come in, and that convienietly gets me the duration of the videos. When I setup the composition in Remotion, we'll set the duration of the composition to the sum of all the clips.
And once it's all combined with the code below, render this all on AWS Lambda. This is mainly because remotion made it easy.
Below is the basic code of the composition. The code is gnarly but it's sort of a proof-of-concept that worked well enough to get me a video.
Step 4: Rendering & Posting the Video

Finally, we have the render. It's time to post. I'm using Blotato to post to TikTok and 𝕏, it has a lot of features that I've never really checked out, but it mainly was a way to get API access to direct posting. It's sort of a work around the limitations the platforms set to stop projects like this one.
Learnings
- Evaluator-Optimizer agents can be really effective at increasing quality of work and adherence to instructions.
- The more I've played with these ideas, the more I've learned to just let the AI do its thing. At first I tried to overly constrain the video composition to make sure it "would work". But I found just creating a basic format that could be used and reused let the AI do what I ultimately wanted, and for less work on my end.
- Remotion is the real hero here.
- AI Avatars and voice cloning isn't here yet. It's close, but it didn't fool my 6 year old. ByteDance's OmniHuman-1 feels like one of the missing pieces. I'm sure it's less than 6 months before there is API access to that.
- AI has really limited ability to understand media outside of it's domain. LLMs work great for text, but crossing the barrier into video/image is tough -- they're mostly flying blind. Things are changing, but it'll feel like a long time before we get visual logic.
- This is a really digestible way to consume content. I've found myself testing it on articles I want to read later, but I know I won't get around to. It's a much more fun way to consume content.
- It's hard to describe but the zeitgeist of the medium a viewer is using matters just as much as the content. The models are 100% unaware of the state of today's tiktok/𝕏. They lack understanding of the memes of today, the trends that are charting, and the nuance/meta that we're all actually interested in.
- To improve: I'd want to really burn compute on a) coming up with an angle for the video, b) generating a more robust script/composition. and c) once the videos come back, adjust the timing of the composition to match the pace of the speaker. These are all very simple things to do.
- It's a future of media. Probably not the future.
Results?
It works! But does it generate views? Likes? Shares? Stay tuned for part two.