RAG Ingestion, Rebuilt (and Simpler Than It Sounds)
A 2026 update on how my Brain ingests content into Neon/Postgres with pgvector, Drizzle, Mastra, and OpenAI embeddings.
Jan 22, 2026
Back in Jan 2025 I shipped my first write‑up on Building The Brain. It was a good v1. This is the v2 update: one monorepo, cleaner ingestion, and hybrid search (vector + filters + text constraints). It sounds fancy, but the programming is simple — I want to show the exact moving parts so you can copy them.
The old system had two issues: speed (blog → Brain API → DB) and ownership (Supabase felt opaque + pricey for a hobby project). So I pulled everything into a single Turborepo, switched to Neon for scale‑to‑zero Postgres, and built a shared tRPC router so the blog can call the Brain directly. Same logic, one hop, more control.
The Brain is a shared tRPC router (@acme/api). I use the protected procedure for raw RAG results and keep a redacted public route. The interesting part is the input contract, so I inline it below.
1export const brainRouter = createTRPCRouter({
2 ragSearchRaw: protectedProcedure
3 .input(
4 // inlined `ragSearchInputZod` for demonstration
5 z.object({
6 query: z.string().describe('User query text'),
7 topK: z.number().int().positive().optional(),
8 minScore: z.number().optional(),
9 minContentLength: z.number().int().nonnegative().optional(),
10 filter: z
11 .object({
12 sources: z.array(ragSearchSourceZod).optional(),
13 publishedAt: dateRangeZod.optional(),
14 updatedAt: dateRangeZod.optional(),
15 createdAt: dateRangeZod.optional(),
16 tags: z
17 .object({
18 any: z.array(z.string()).optional(),
19 all: z.array(z.string()).optional(),
20 })
21 .optional(),
22 })
23 .optional(),
24 context: z
25 .object({
26 before: z.number().int().nonnegative().optional(),
27 after: z.number().int().nonnegative().optional(),
28 })
29 .optional(),
30 rewrite: z
31 .object({
32 enabled: z.boolean().optional(),
33 promptContext: z.string().optional(),
34 strategy: z.enum(['original+rewrite', 'rewrite-only']).optional(),
35 })
36 .optional(),
37 rerank: z
38 .object({
39 enabled: z.boolean().optional(),
40 promptContext: z.string().optional(),
41 candidates: z.number().int().positive().optional(),
42 })
43 .optional(),
44 }),
45 )
46 .output(ragSearchResponseZod)
47 .query(({ ctx, input }) => queryRagSearch({ db: ctx.db, input })),
48});
1const results = await brainCaller.ragSearchRaw({
2 query: 'What changed in the Brain ingestion pipeline?',
3 topK: 12, // widen recall
4 minScore: 0.15, // drop weak matches
5 minContentLength: 120, // ignore tiny chunks
6 filter: {
7 sources: ['posts'],
8 publishedAt: { from: '2025-01-01' },
9 tags: { any: ['RAG', 'Postgres'] },
10 },
11 context: { before: 1, after: 1 },
12 rewrite: { enabled: true, strategy: 'original+rewrite' },
13 rerank: { enabled: true, candidates: 40 },
14});
I switched to Neon for scale‑to‑zero pricing, and I use Drizzle so the schema is in code with types (including JSONB metadata). The layout borrows Mastra’s docs/chunk model — battle‑tested structure instead of inventing my own.
1export const RagDocument = pgTable('rag_document', (t) => ({
2 path: t.text().notNull().primaryKey(),
3 checksum: t.text().notNull(),
4 metadata: t.jsonb().$type<RagDocumentMetadata>(),
5}));
6
7export const RagChunk = pgTable(
8 'rag_chunk',
9 (t) => ({
10 id: t.text().notNull().primaryKey(),
11 documentPath: t
12 .text()
13 .notNull()
14 .references(() => RagDocument.path, { onDelete: 'cascade' }),
15 chunkIndex: t.integer().notNull(),
16 text: t.text().notNull(),
17 embedding: vector('embedding', { dimensions: 3072 }).notNull(),
18 metadata: t.jsonb().$type<RagChunkMetadata>(),
19 }),
20 (table) => ({
21 documentPathIdx: index('rag_chunk_document_path_idx').on(table.documentPath),
22 embeddingIdx: index('rag_chunk_embedding_ivfflat_idx').using(
23 'ivfflat',
24 sql`(embedding::halfvec(3072)) halfvec_ip_ops`,
25 ),
26 }),
27);
The JSONB metadata is Zod‑first, and input validators are shared so nothing can drift:
1export const ragDocumentMetadataZod = z.object({
2 version: z.literal(1),
3 title: z.string().optional(),
4 frontmatter: z.union([postFrontmatterZod, noteFrontmatterZod]).nullable().optional(),
5 createdAt: z.string(),
6 updatedAt: z.string(),
7});
8
9export const ragChunkMetadataZod = z.object({
10 version: z.literal(1),
11 tokenCount: z.number().optional(),
12 sectionTitle: z.string().optional(),
13 sectionDepth: z.number().optional(),
14 summary: z.string().optional(),
15 keywords: z.array(z.string()).optional(),
16 questions: z.array(z.string()).optional(),
17});
Notes on vectors: 3072 dims is OpenAI’s text-embedding-3-large, and the half‑precision ivfflat index is faster without meaningfully harming semantic recall.
I used to run GitHub Actions + custom scripts to chunk markdown. It worked, but it put all the cleverness on me. I’d rather use battle‑tested cleverness, so I adopted Mastra for chunking + extraction. I still run ingestion in GitHub Actions; it only runs when content changes:
1# (Checkout + setup steps omitted for brevity.)
2- name: Detect content changes
3 id: changes
4 shell: bash
5 run: |
6 # For push events, use github.event.before as the base commit SHA.
7 before="${{ github.event.before }}"
8 if [ -z "$before" ] || [ "$before" = "0000000000000000000000000000000000000000" ]; then
9 before="$(git rev-parse HEAD~1 || git rev-parse HEAD)"
10 fi
11
12 files="$(git diff --name-only "$before" "${{ github.sha }}")"
13 printf "%s\n" "$files" > /tmp/changed_files.txt
14
15 brain_changed=false
16
17 if grep -Eq '^packages/content/(posts|notes|knowledge)/' /tmp/changed_files.txt; then
18 brain_changed=true
19 fi
20
21 echo "brain_changed=$brain_changed" >> "$GITHUB_OUTPUT"
22- name: RAG ingest (full)
23 if: steps.changes.outputs.brain_changed == 'true'
24 # (Add DATABASE_URL + OPENAI_API_KEY env vars in your workflow.)
25 run: pnpm -F @acme/rag ingest
1const chunkResult = await doc.chunk({
2 strategy: 'semantic-markdown',
3 joinThreshold: CHUNK_JOIN_THRESHOLD,
4 maxSize: MAX_CHUNK_TOKENS,
5 extract: {
6 summary: {
7 llm: EXTRACT_LLM,
8 promptTemplate:
9 "Write a summary of the following passage. Use only the information provided within passage, and try to include as many key details as possible while staying within a character limit.\n\n<passage>{context}</passage>\n\nRespond only with the summary. Do not exceed your limit. The first character of your response must be the start of the summary. End your turn immediately once you're finished. Do not add anything other than the summary.\n\nYour character limit is `clamp(140, round(passage.length * 0.2), 420)`, or put another way, approximately 20% of the passage length, with a minimum of 140 characters and a maximum of 420 characters. NEVER exceed this limit.",
10 },
11 keywords: {
12 llm: EXTRACT_LLM,
13 keywords: 6,
14 promptTemplate:
15 'Given the PASSAGE below, extract up to {maxKeywords} concise keywords or noun phrases from the passage. Avoid stopwords.\nEach keyword must appear verbatim in the passage.\nIgnore metadata lines if present (for example, key-value labels like "path:", "source:", or "tokenCount:") and focus only on human-readable content.\nExclude file paths, URLs, markdown/HTML/XML tags, and labels unless they appear verbatim in the passage.\n\n<passage>{context}</passage>\n\nThese keywords will be used for database metadata filtering and sorting. Only include keywords that are useful for search. If the passage has no unique words worth indexing, respond with "KEYWORDS:" and nothing else.\nProvide keywords in the following comma-separated format: \'KEYWORDS: <keywords>\'',
16 },
17 questions: {
18 llm: EXTRACT_LLM,
19 questions: 2,
20 promptTemplate:
21 'Given the PASSAGE below, generate {numQuestions} questions this passage can answer.\nIgnore metadata lines if present (for example, key-value labels like "path:", "source:", or "tokenCount:") and focus only on human-readable content.\nDo not ask about file paths, metadata, or formatting/markup unless they appear verbatim in the passage.\n\n<passage>{context}</passage>\n\nThese questions will be used to refine embeddings for a vector search engine. Write questions a user might ask that can be answered directly from the passage. If the passage has no useful questions, respond with "QUESTIONS:" and nothing else.\nProvide questions in the following format: \'QUESTIONS: <questions>\'',
22 },
23 },
24});
1if (tokenCount > maxTokens) {
2 // fallback: re-chunk oversized semantic chunks with a token strategy
3 const doc = MDocument.fromMarkdown(chunk.text, passthrough);
4 const chunkResult = await doc.chunk({
5 strategy: 'token',
6 maxSize: maxTokens,
7 addStartIndex: false,
8 modelName: EMBEDDING_MODEL,
9 });
10 // ...emit smaller subchunks so we never exceed embedding limits
11}
1if (!force && existingEntry?.checksum === checksum && hasChunks) {
2 skippedDocuments += 1;
3 return;
4}
5
6await db.delete(RagChunk).where(eq(RagChunk.documentPath, file.relativePath));
7await db.insert(RagChunk).values(batch);
In v1, everything was vector‑only. In v2, I can filter by source, tags, and dates, enforce minimum content length, and optionally pull adjacent chunks.
1const conditions = [sql`char_length(${RagChunk.text}) >= ${minContentLength}`];
2if (filter?.publishedAt?.from) {
3 conditions.push(
4 sql`(${RagDocument.metadata}->'frontmatter'->>'date')::date >= ${filter.publishedAt.from}::date`,
5 );
6}
7if (filter?.tags?.any?.length) {
8 conditions.push(sql`(${RagDocument.metadata}->'frontmatter'->'tags') ?| ${filter.tags.any}`);
9}
1const rows = (await filteredQuery.orderBy(
2 sql`${RagChunk.embedding}::halfvec(${EMBEDDING_DIMENSIONS_SQL}) <#> ${queryHalf}`,
3)) as RagQueryRow[];
Rewrite is now optional. If you want speed, turn it off. If you want recall, turn it on. Rerank is where the one‑token trick shines (from the Zep team):
- Ask a yes/no question: “Is this passage relevant to the query?”
- Constrain output via logit bias to only
true or false.
- Use the first token’s logprobs as a confidence score.
1const providerOptions = {
2 openai: {
3 logprobs: true,
4 topLogprobs: 2,
5 logitBias: RERANK_LOGIT_BIAS, // bias for " True" and " False" tokens
6 },
7};
8
9const result = await generateText({
10 model: openai.chat(RERANK_MODEL),
11 system,
12 prompt,
13 temperature: 0,
14 topP: 1,
15 maxOutputTokens: RERANK_MAX_TOKENS,
16 providerOptions,
17});
If p(true) is high, the chunk is relevant. If it’s low, it’s noise. That gives you a lightweight ranker without a heavy model.
This is still just a small set of simple parts: content → Mastra chunks → OpenAI embeddings → Neon/pgvector → Drizzle + tRPC → optional rewrite/rerank. It’s a lot of buzzwords, but the actual code is boring — and that’s the point. You can build something like this in a weekend and iterate with real usage. If you want to dig deeper, the repo is the source of truth.