Ibis Birdbrain: an early attempt at agentic data engineering

In late 2023, I built Ibis Birdbrain – a “portable Python ML-powered data bot” under the Ibis project. It was an early attempt at what we’d now call agentic AI: language model tool use in a loop, applied to data work. It didn’t work. But the ideas were right.

what it was #

Ibis Birdbrain was an LLM-powered bot that could take natural language questions about your data and:

convert text to SQL via Marvin (a Python LLM toolkit by Prefect)
execute that SQL against any of Ibis’s 20+ backends (DuckDB, Postgres, BigQuery, etc.)
if execution failed, feed the error back to the LLM and retry (up to 3 times)
return the result as an Ibis table

It had a Streamlit GUI, a CLI, and could be used as a Python library. The whole thing was built on top of Ibis for data portability and Marvin for LLM calls (which at the time only supported OpenAI).

Worth noting: I started with LangChain before any commits (hence not in the git history), got frustrated with the abstraction overhead, ripped it out, started calling OpenAI APIs directly, then discovered Marvin and found its @marvin.fn decorator ergonomic enough to build on.

the timeline #

The repo tells a story through its PRs. Three major phases, two abandoned rewrites, a contributor who showed up late and did real work, and a final README update that reads like an epitaph.

v1: the kitchen sink (Sep-Oct 2023) #

The first version was built fast – dozens of “qs” (quick save) commits across September and October 2023. The architecture was sprawling:

Bots (bots/birdbrain.py, bots/tpch3000.py) – persona-specific bot configurations
Systems (systems/default.py, systems/tpch3000.py, systems/old.py) – system-level orchestration, three versions already
States (states/default.py, states/tpch3000.py) – state management per bot type
Tools (tools/eda.py, tools/code.py, tools/docs.py, tools/filesystem.py, tools/github.py, tools/internet.py, tools/text.py, tools/advanced.py) – eight separate tool modules
Functions (functions/code.py, functions/docs.py) – LLM function wrappers
Classifiers (classifiers/code.py) – Marvin’s logit-bias trick for routing decisions
Models (models/text.py) – Pydantic models for structured LLM output

Plus a full Reflex (React-based Python) GUI in app/, a Streamlit app in examples/, example data projects for IMDB and penguins, and Dagster data pipelines. It was trying to be everything at once – a chatbot, a data pipeline tool, a filesystem operator, a GitHub searcher, a documentation crawler.

PR #21 (Oct 6, 2023) added the TPC-H demo bot, which was 102 lines of custom tools just for that one benchmark dataset. Every demo required its own bot/system/state/tools stack.

This commit message is just a link to an Arrested Development “I feel like a fucking idiot” GIF. That’s it. That’s the commit.

first refactor attempt: PR #24 (Oct 18, 2023) – closed #

PR #24 “minor refactor” was anything but minor. It deleted 2,322 lines and added 536. It tried to flatten the entire architecture: kill the bots/systems/states/tools hierarchy and consolidate into bot.py, functions.py, models.py, platforms.py, and systems.py. It also nuked the Reflex GUI, all examples, and all Dagster pipeline code.

No PR description. Closed without merging. The attempt to simplify was right, but it was too much at once.

refactor attempt 2: PR #25 (Nov 26, 2023) – merged #

This was the real rewrite. PR #25 deleted 2,353 lines and added 1,454. The description said it all:

OOP! model bot communication as messages w/ attachments. everything can be encoded/decoded as Ibis tables (for storing/querying later), with some measure of reproducibility

This is where the email metaphor was born. The PR introduced the core abstractions that would survive to the final version:

Messages and Attachments as the universal communication protocol
Bot as the top-level orchestrator with message history
Typed attachments: TableAttachment, DatabaseAttachment, SQLAttachment, ErrorAttachment, ChartAttachment, WebpageAttachment
Subsystems (eda.py, code.py, learn.py) replacing the old tools/systems split
Tasks as Message -> Message transforms
encode()/decode() stubs on everything (for future Ibis table serialization – never implemented)

It deleted the Reflex GUI, all examples, all bot personas, all old tools. Clean slate with the new abstractions. The TODO list in the PR was honest about what was missing: “finish up the call decision tree”, “implement tasks (steal from old code)”, “tuning, testing, etc.”

the system diagram: PR #29 (Nov 26, 2023) – merged #

Same day as the refactor, PR #29 added system.excalidraw. The description: “it’s not great but better than nothing, update later.” Here’s the rendered diagram from PR #29:

Ibis Birdbrain system diagram

The Excalidraw file was deleted 10 days later in a “reset” commit, but the architecture it describes — Bot dispatching to Flows which decompose into Tasks, all communicating via Messages with typed Attachments — survived into the final version.

prompt engineering: PRs #30-31 (Nov 26, 2023) – merged #

Two quick PRs on the same day, adjusting the system prompt strings. PR #30 “pre-prompt engineering” and PR #31 “a little more prompt engineering.” Small changes to strings.py. This is the part of LLM app development that felt like the whole game in 2023 – carefully wordsmithing the system prompt because that’s all you had.

subsystems and tasks: PR #32 (Dec 5, 2023) – closed #

This was supposed to be the big capability expansion. PR #32 “implement subsystems and tasks” added 420 lines and tried to close issues #19 (documentation search), #20 (write Ibis code), #17 (plot data via plotly), and #14 (GitHub search) all at once.

It added new task types: TextToSQLTask, ExecuteSQLTask (later SearchTextTask too), plus DocsTask and PythonTask. It introduced a docs.py subsystem alongside the existing eda.py and code.py. The idea was that the bot would classify the user’s intent and route to the right subsystem, which would run the right tasks.

Closed without merging. The subsystem routing never worked well enough. The classification step – “is this a data question, a docs question, or a code question?” – was brittle with 2023-era models. The PR body was just “closes #19 #20 #17 #14” – four issues, zero delivered.

evaluating frameworks: issue #36 (Dec 5, 2023) #

At this point I stepped back and asked: should I even be building this myself? Issue #36 “evaluate other OSS frameworks for messages and whatnot” listed AutoGen, AutoGPT, and others. The conclusion was to evaluate them “as options to simplify Ibis Birdbrain, allowing for a thin wrapper that provides data platform access via Ibis.” Nothing came of it. The Marvin dependency was already causing pain (issue #35: it was locked to Marvin 1.3.0 and OpenAI 0.28.1, and upgrading either broke everything).

the AGI-or-bust rewrite: PR #37 (Feb 13, 2024) – closed #

Two months of silence, then PR #37: “final attempt at ibis-birdbrain (agi or bust).”

The description is remarkably self-aware:

We are going to think step-by-step. The intention of this PR is to build up a useful, portable ML-powered data bot using Ibis and language model calls. I will take code from previous iterations, with the intent to have a (relatively) clean commit history to build up to a useful bot.

It listed three goals: text-to-SQL, documentation querying, Python code generation. It added 746 lines and deleted 0 – a clean-room rewrite on a branch. New utils/ directory for helpers. Same Message/Attachment abstractions from PR #25 but rebuilt from scratch.

Closed without merging. The branch was superseded the same day by PR #38.

v3: the one that shipped: PR #38 (Feb 13, 2024) – merged #

PR #38 “v3 or something: get a working demo of ibis-birdbrain back in place” had the exact same 746 additions as PR #37. Same files, same code. The description was pragmatic:

see untitled.ipynb for usage. all code in one giant bot.py file for now, refactor later. as of this PR, just writes SQL code based on the data and returns it as an attachment.

This was the pivot from ambition to pragmatism. Stop trying to build AGI. Get a demo working. The TODO list: “executing that SQL code, better Message/Attachment ergonomics, more robust SQL/EDA flow, a ton more.”

PR #39 “whoops” followed hours later with 855 additions – mostly filling out bot.py (218 new lines) and adding untitled.ipynb (618 lines of demo notebook). Then a flurry of commits: “get the bot running SQL code again”, “ensure english prose”, “fix slowness”, “minor updates.”

major updates: PR #41 (Mar 4, 2024) – merged #

This was the last big architectural change. PR #41 added 3,222 lines and deleted 1,131. It introduced the Flows/Tasks hierarchy that would be the final architecture:

Messages: bots/people communicating. Attachments: objects attached to messages. Flows: things the bot can do. Tasks: things a flow can do.

The key insight: a Task is Message -> Message. A Flow is Messages -> Messages. Only DataFlow was implemented, running a fixed pipeline: text-to-SQL -> execute -> retry on error. It also added the Excalidraw architecture diagram and a Quarto presentation.

The description was honest: “need to think through configuration options and…a bunch of stuff. many TODOs and hacky code.”

Jiting’s contributions: PRs #58-60 (Mar-Apr 2024) #

After v3 stabilized, Jiting Xu contributed the most substantial feature additions:

PR #58 (Mar 22): transpile_sql via SQLGlot and execute_last_sql to re-run queries on different backends. Renamed “language” to “dialect” on code attachments – a small but correct conceptual fix
PR #59 (Mar 23): get_attachment_by_type helper for cleaner attachment retrieval
PR #60 (Apr 2): text-to-SQL caching via Levenshtein distance search. If a question was >75% similar to a cached question, skip the LLM call and reuse the SQL. Used an Ibis UDF wrapping the Levenshtein Python package

Jiting also opened PR #61 (SQLGlot syntax checking) and PR #62 (LLM-generated natural language responses), but both were closed when the project was abandoned in September 2024.

the end: PRs #61-63 (Sep 2024) #

On September 27, 2024, PR #63 updated the README to:

Highly experimental. Effectively abandoned. Made for cool demos. If anyone’s interested in forking this or understanding the learnings, reach out!

Two days later, all remaining open issues (12 of them) and PRs (#61, #62) were batch-closed. The issues closed include: conversational mode (#45), text-to-SQL cache (#46), filesystem operations (#47), SQL formatting (#48), RAG database (#49), visualization (#50), internet search (#51), feedback flow (#52), and docs (#44). Each one represented a capability that was planned but never built.

the architecture (final state) #

graph TD
    User([User]) -->|text + objects| Bot
    Bot -->|selects| Flow
    Flow -->|runs| Tasks

    subgraph Bot
        Messages[(Messages)]
        Attachments[(Attachments)]
    end

    subgraph Flow["DataFlow"]
        T4[SearchCachedTask]
        T1[TextToSQLTask]
        T2[ExecuteSQLTask]
        T3[FixSQLTask]
    end

    T4 -->|cache hit?| T2
    T4 -->|cache miss| T1
    T1 -->|SQL| T2
    T2 -->|error| T3
    T3 -->|fixed SQL| T2
    T2 -->|result table| Bot
    Bot -->|response| User

The final codebase had these abstractions:

Bot: top-level orchestrator. holds a database connection (ibis.BaseBackend), message history, and flows. the __call__ method takes text + optional Python objects, routes through a flow, returns a response Message
Messages / Email: wrappers around text with metadata (to/from addresses, subject, timestamps). the bot, flows, and tasks all communicate through these. Email is the only Message subclass – everything is modeled as email
Attachments: typed Python objects attached to messages – TableAttachment (Ibis table), SQLAttachment (SQL string + dialect), DatabaseAttachment (Ibis connection), ErrorAttachment (error string), WebpageAttachment (URL + content), ChartAttachment (plotly figure, never used)
Flows: Messages -> Messages. only DataFlow was ever implemented. Flows.select_flow() literally returns the only flow or raises NotImplementedError
Tasks: Message -> Message. the real workhorses: TextToSQLTask (Marvin @fn for text-to-SQL), ExecuteSQLTask (run SQL via Ibis), FixSQLTask (Marvin @fn for error correction), SearchTextTask (Levenshtein cache lookup)

The text-to-SQL task used Marvin’s @marvin.fn decorator – write a Python function with just a docstring and the LLM implements it:

@staticmethod
@marvin.fn
def _text_to_sql(
    text: str, tables: Attachments, data_description: str, dialect: str
) -> str:
    """
    Generates correct, simple, and human-readable SQL based on the input
    `text`, `tables, and `data_description`, returns a SQL SELECT statement
    in the `dialect`.
    """

The docstring was the prompt. Marvin handled the LLM call and structured output.

what it could actually do #

At its best (final state, v3 + Jiting’s contributions), the bot could:

Text-to-SQL: take a natural language question, generate SQL for the connected database’s dialect, execute it, return an Ibis table. worked for simple queries against single tables. joins were unreliable
Error recovery: if SQL execution failed, package the error + original SQL + schema and ask the LLM to fix it. up to 3 retries. worked maybe 40-60% of the time
SQL transpilation: translate SQL between dialects via SQLGlot (e.g., DuckDB SQL to Postgres SQL)
Cross-backend execution: run the last successful SQL on a different Ibis backend
Cache lookup: skip LLM calls for questions similar to previously-seen questions (Levenshtein >0.75 threshold)

What it could not do despite being planned:

conversational mode (#45 – context from previous messages)
visualization via plotly (#50 – ChartAttachment existed but was never wired up)
documentation search / RAG (#19, #49)
Python code generation (#20)
filesystem operations (#47)
internet search (#51 – WebpageAttachment and duckduckgo-search were in deps but unused in final version)
GitHub search (#14)
feedback flow (#52 – auto-create GitHub issues from bad responses)
multi-flow routing (the select_flow method was NotImplementedError)
conversation history persistence (#33 – encode()/decode() stubs everywhere, never implemented)

what worked #

the email metaphor was surprisingly natural. modeling all communication as messages with typed attachments made the data flow explicit and debuggable. you could inspect any message in the chain and see exactly what was passed
retry loops. when SQL execution failed, the error got packaged as an ErrorAttachment and sent to FixSQLTask, which got the original SQL, the error, the schema, and the dialect. this was directionally correct and felt magical in 2023
Ibis portability. because everything went through Ibis, you could point it at DuckDB locally and BigQuery in production. same bot, same code
Marvin’s @fn decorator was an elegant API for LLM-as-function. write a type signature and a docstring, get a working function
typed attachments as a protocol. passing SQLAttachment and TableAttachment between tasks instead of raw strings forced type-safety into the agent communication. modern tool-use does the same thing

what didn’t work (and why) #

The issue wasn’t that I didn’t know what was needed. I knew birdbrain needed planning, multi-flow routing, and task decomposition. I tried to build all of it. It just didn’t work — the technology wasn’t there yet.

PR #32 (subsystems and tasks) was the clearest example. This was supposed to add routing between a docs subsystem, a code subsystem, and a data subsystem. Four issues targeted (#14, #17, #19, #20). The LLM couldn’t reliably classify which subsystem to route to. GPT-4 in late 2023 was okay at text-to-SQL in isolation but couldn’t do the meta-reasoning — “is this a data question or a docs question?” — reliably enough to build on. The PR was closed without merging.

PR #37 (“final attempt at ibis-birdbrain, agi or bust”) was a 746-line clean-room rewrite trying one more time to get multi-flow working. It was superseded the same day by PR #38 (v3), which gave up on multi-flow entirely and just shipped the one thing that worked: text-to-SQL with retry.

Three rewrites in four months (PR #24 closed, PR #25 merged, PR #37 closed, PR #38 merged). Each one threw away work. Not because of scope creep — because each attempt hit the same wall: the models couldn’t do the planning/routing part reliably.

The other failures were downstream of this:

Flows.select_flow() was NotImplementedError — not because nobody wrote it, but because every attempt to implement it produced unreliable results
encode/decode never shipped — the conversations-as-Ibis-tables idea was ahead of its time but depended on having multiple flows generating data worth persisting
Marvin dependency hell — locked to Marvin 1.3.0 + OpenAI 0.28.1. Upgrading either broke the other. By the time the ecosystem moved on, so had I
prompt engineering was the whole game — the system prompt was enormous and fragile (“be concise; ignore platitudes”, “DO NOT add a LIMIT unless specifically told otherwise”). Every edge case meant more instructions because the model couldn’t infer intent

what I learned #

The hard problem wasn’t text-to-SQL — it was everything around it. Deciding when to query vs. answer from context, handling ambiguous questions, recovering from errors, maintaining state. The “agentic” part. And in 2023, the models just weren’t good enough for that part.

The abstractions were actually decent. The email/message/attachment model made data flow explicit and debuggable. The Bot → Flow → Task hierarchy was directionally correct. It just needed models that could do the routing and planning reliably — which we now have.

Being locked to Marvin 1.3.0 + OpenAI 0.28.1 killed experimentation. The LLM layer should have been isolated better. Today that’s obvious; in 2023, the ecosystem was moving so fast that any abstraction you built was obsolete in months.

then vs. now #

The biggest difference between 2023 and 2026 isn’t the architecture — it’s the models and the harnesses.

Birdbrain was a custom harness built from scratch: message passing, tool dispatch, retry loops, prompt engineering. Today, Claude Code ships all of that out of the box. The harness is commoditized. You don’t build the agent loop anymore — you configure it.

The models are dramatically better too. GPT-4 in late 2023 could do text-to-SQL but couldn’t plan, route between strategies, or recover from errors reliably. The classification problem that killed PR #32 (routing between subsystems) is basically a solved problem now. I wrote about the context length pain and more context length pain at the time — 8K-32K tokens was the ceiling. Now we’re at 200K-1M. That changes everything about what an agent can keep in its context window.

Could I rebuild birdbrain today with Claude Code in an afternoon? Probably. The text-to-SQL part would be trivial. The multi-flow routing that killed the project would actually work. Months of effort, now a single install and an afternoon.

github.com/ibis-project/ibis-birdbrain