Production · running at bailder.com · single-tenant by design

The harness,
not the model.

An agent is its environment. Bailder is the production system around the LLM — tier-classified safety, decision traces, multi-agent orchestration, human-in-the-loop over WhatsApp. Built and operated by one engineer, shipping itself.

Multi-agent orchestration Decision trace Tier system PendingConfirmation Replay A/B skill eval Multi-LLM routing Conversation-first MCP tools Audit-by-default Architecture as code Self-deploying Multi-agent orchestration Decision trace Tier system PendingConfirmation Replay A/B skill eval Multi-LLM routing Conversation-first MCP tools Audit-by-default Architecture as code Self-deploying
7
specialized agents
orchestrator · mission · marketing · devops · po · qa · brand attendant
60
tools in the registry
declarative + per-agent whitelist
101
agent runs logged
every decision auditable
505
structured decisions
input · candidates · chosen · reason
160
headless workers spawned
claude code -p with tier guard hook
77
destructive actions gated
human approved over whatsapp

The thesis: the model is commodity,
the harness is the product.

Every pillar below is implemented and running in production. None of them is a slide. Each one solves a real failure I had — and most have a story.

01

Tier system enforced in code.

Every action — bash command, API call, tool call — gets classified by a deterministic Ruby classifier, not by the prompt. Green is autonomous, yellow audits and continues, red pauses for human approval over WhatsApp. The LLM is never asked to judge destructiveness.

RED_BASH = /\b(rm\s+-rf|force[- ]push|git reset --hard|drop table|droplet.*destroy)\b/

Why: an LLM told to 'be careful' will still rm -rf when reasoning gets tangled. Safe-by-code, not safe-by-vibes.

02

Decision Trace — structured artifacts, not logs.

Every meaningful decision — model choice, tool selection, scope change, repair activation — becomes a row: input, candidates considered, chosen, reason. Debugging agents goes from log archaeology to SQL.

Decision.record!(kind: "model_routing",
  input: prompt, candidates: ["glm", "sonnet"],
  chosen: "sonnet", reason: "tools required")

Why: a year of prod taught me that 'what did the agent think?' is the most-asked question. So it became a first-class table.

03

Human-in-the-Loop over WhatsApp.

Tier-red actions don't execute — they open a PendingConfirmation that pings me with summary + payload. I reply 'APPROVE 42' or 'REJECT 42'. The dispatcher resumes. Async, durable, auto-expiring.

PendingConfirmation.create!(
  tier: :red, tool_name: "droplet_destroy",
  action_summary: "destroy giants production")

Why: chat is the only async UX where a busy human stays responsive — and the audit trail is automatic.

04

Tool Registry + per-agent whitelist.

Tools are declarative Ruby classes: description, JSON Schema, execute. A global registry holds them all. Each AgentDefinition declares its allowed_tools — Mission sees 6, Marketing sees 25, Attendant sees 10. Whitelist enforced server-side.

allowed_tools: %w[
  create_order_draft confirm_order
  escalate_to_sergio get_customer_history
]

Why: least-privilege for agents. Adding a tool to the registry doesn't expose it — agents declare what they need.

05

Multi-agent: router, not decomposer.

Six specialized agents — Orchestrator, Mission, Marketing, DevOps, PO, QA, plus per-brand Attendant — talk via an AgentMessage bus. Crucial design call: the orchestrator does not decompose tasks. Each agent owns its piece and decides internally whether to spawn subagents.

AgentMessage.create!(from: "devops",
  to: "po", kind: :alert,
  payload: { app: "giants", status: 500 })

Why: top-down decomposition by an LLM is brittle. Bottom-up with autonomous agents that hand off is way more robust.

06

Workers: Claude Code headless with tier guard.

When real coding is needed, the orchestrator spawns claude -p --output-format stream-json as a background worker. Tier guard hook installed via temporary settings.json. Working directory pinned to the right repo. Worker can edit Bailder itself — deploys run externally as a structural escape hatch.

WorkerSessionRunner.spawn!(
  project: :bailder, prompt: "fix bug X",
  model: "sonnet-4-6", tier_guard: true)

Why: 'agent edits itself' is a meme until you make the deploy loop external. Then it's just leverage.

07

Replay + A/B — skills versioned like code.

Any past AgentRun can be re-executed with a tweaked prompt. A/B runs the same scenario against two prompts in isolated sandboxes; reports tool calls, cost, escalation, order created. Skill evolution is data-driven, not vibes-driven.

BrandAttendanceAbTest.new(
  brand: salgadelli, scenario: msg,
  prompt_a: current, prompt_b: candidate
).call

Why: I treat the system prompt like a unit under test. Because it is.

08

Conversation-first modeling.

Conversations are first-class. Entities (Order, Mission, Decision) are derived state — projections of the conversation, not forms a user fills in. A traditional CRM is entity-first; Bailder inverts it. The model already speaks conversation; let it.

Order.new(items: agent_extracted,
  conversation: conv,
  fulfillment: derived_from_messages)

Why: every CRM I've ever seen lies because the salesperson fills the pipeline for show. Derived state can't lie.

09

Multi-LLM routing — model is commodity.

Every chat call goes through OpenRouter. Cheap models for triage (GLM-5.1), Sonnet for default reasoning, Opus gated behind explicit opt-in. The choice of model itself is a Decision row.

OpenRouterClient.chat(
  model: ModelRouter.pick(task),
  context: { agent_run_id:, conversation_id: })

Why: I burned $16 once when retry auto-escalated GLM → Sonnet → Opus. Lesson learned in a table.

10

Audit log + Events — everything is a row.

Every tool call, shell command, webhook in/out, message sent flows through structured audit logs and an event bus. Payload as JSONB, queryable. Forensics is SQL, not grep.

AuditLog.log!(action: "post.published",
  tier: :yellow, target: brand.slug,
  payload: { post_id:, platforms: })

Why: post-mortems are a write-time investment, not a read-time scramble. Pay it forward.

Try the tier guard yourself.

Same regex set that gates every bash a worker runs in production. Type any command — see the classifier decide.

~/bailder
→ start typing. Try ls, git push, rm -rf /opt.
GREEN · autonomous YELLOW · audit & continue RED · ask human

Last 6 decisions made in production.

Pulled live from the decisions table. Every row is a moment the agent made a real choice with reasoning attached.

tool_choice_forced 2 dias ago
chose write_todos
“round 1, prompt grande, modelo open”
tool_choice_forced 2 dias ago
chose write_todos
“round 1, prompt grande, modelo open”
repair_activated 2 dias ago
chose repair
“4 erros consecutivos, threshold=2”
tool_choice_forced 2 dias ago
chose write_todos
“round 1, prompt grande, modelo open”
tool_choice_forced 2 dias ago
chose write_todos
“round 1, prompt grande, modelo open”
tool_choice_forced 2 dias ago
chose write_todos
“round 1, prompt grande, modelo open”

Architecture, on one screen.

WhatsApp is the cockpit. Agents are specialized routers, not decomposers. Workers run claude code headless. Everything destructive pauses through a confirmation.

   ┌──────────────────────────────────────────────────────────────┐
   │  WhatsApp  (Evolution API · 1 instance per brand)            │
   └──────────────────────────────┬───────────────────────────────┘
                                  ▼
                ┌─────────────────────────────────────┐
                │  ConversationDispatcher             │
                │    actor_kind: owner / delegate /   │
                │                customer (per brand) │
                └────────────┬────────────────────────┘
       ┌─────────────────────┼──────────────────────────────┐
       ▼                     ▼                              ▼
  ┌─────────┐         ┌────────────┐             ┌─────────────────┐
  │Orchestr.│ <─────> │ Mission /  │             │ Brand Attendant │
  │ (Sergio)│  Agent  │ Marketing /│             │ (Salgadelli,    │
  └────┬────┘ Message │ DevOps /PO/│             │  per-brand)     │
       │       bus    │ QA agents  │             └────────┬────────┘
       ▼              └─────┬──────┘                      ▼
  ┌─────────────┐           ▼                     ┌─────────────┐
  │ Tool        │     ┌──────────┐                │ Order /     │
  │ Registry    │ ──► │ Workers  │                │ Customer    │
  │ (~80 tools, │     │ claude-p │                │ Contact     │
  │  whitelist) │     │ headless │                └─────────────┘
  └─────┬───────┘     └────┬─────┘
        │                  │
        ▼                  ▼
  ┌──────────────────────────────┐         ┌──────────────────────┐
  │ TIER GUARD (green/yel/red)   │ ──red──►│ PendingConfirmation  │
  └──────────────┬───────────────┘         │ (Sergio approves on  │
                 │                         │  WhatsApp)           │
                 ▼                         └──────────────────────┘
        ┌─────────────────────────────────────┐
        │ Audit Log · Decision · Event bus    │
        │ (everything is a queryable row)     │
        └─────────────────────────────────────┘
  

Built by one engineer.
Shipped by the agent it built.

Bailder runs in production at bailder.com. The agent inside it can edit its own source. When it does, GitHub Actions runs the deploy externally — if it breaks the build, the old container keeps serving. That escape hatch is structural, not a flag.

I built it because I needed it. Multi-LLM routing, tier-classified safety, decision traces, replay, eval — every piece solves a real failure I had operating agents in production. Most of them have a war story attached. Some of them cost me money to learn.

If you're hiring for AI agent engineering and this resonates — let's talk.