01
Tier system enforced in code.
Every action — bash command, API call, tool call — gets classified by a deterministic Ruby classifier, not by the prompt. Green is autonomous, yellow audits and continues, red pauses for human approval over WhatsApp. The LLM is never asked to judge destructiveness.
RED_BASH = /\b(rm\s+-rf|force[- ]push|git reset --hard|drop table|droplet.*destroy)\b/
Why: an LLM told to 'be careful' will still rm -rf when reasoning gets tangled. Safe-by-code, not safe-by-vibes.
02
Decision Trace — structured artifacts, not logs.
Every meaningful decision — model choice, tool selection, scope change, repair activation — becomes a row: input, candidates considered, chosen, reason. Debugging agents goes from log archaeology to SQL.
Decision.record!(kind: "model_routing",
input: prompt, candidates: ["glm", "sonnet"],
chosen: "sonnet", reason: "tools required")
Why: a year of prod taught me that 'what did the agent think?' is the most-asked question. So it became a first-class table.
03
Human-in-the-Loop over WhatsApp.
Tier-red actions don't execute — they open a PendingConfirmation that pings me with summary + payload. I reply 'APPROVE 42' or 'REJECT 42'. The dispatcher resumes. Async, durable, auto-expiring.
PendingConfirmation.create!(
tier: :red, tool_name: "droplet_destroy",
action_summary: "destroy giants production")
Why: chat is the only async UX where a busy human stays responsive — and the audit trail is automatic.
04
Tool Registry + per-agent whitelist.
Tools are declarative Ruby classes: description, JSON Schema, execute. A global registry holds them all. Each AgentDefinition declares its allowed_tools — Mission sees 6, Marketing sees 25, Attendant sees 10. Whitelist enforced server-side.
allowed_tools: %w[
create_order_draft confirm_order
escalate_to_sergio get_customer_history
]
Why: least-privilege for agents. Adding a tool to the registry doesn't expose it — agents declare what they need.
05
Multi-agent: router, not decomposer.
Six specialized agents — Orchestrator, Mission, Marketing, DevOps, PO, QA, plus per-brand Attendant — talk via an AgentMessage bus. Crucial design call: the orchestrator does not decompose tasks. Each agent owns its piece and decides internally whether to spawn subagents.
AgentMessage.create!(from: "devops",
to: "po", kind: :alert,
payload: { app: "giants", status: 500 })
Why: top-down decomposition by an LLM is brittle. Bottom-up with autonomous agents that hand off is way more robust.
06
Workers: Claude Code headless with tier guard.
When real coding is needed, the orchestrator spawns claude -p --output-format stream-json as a background worker. Tier guard hook installed via temporary settings.json. Working directory pinned to the right repo. Worker can edit Bailder itself — deploys run externally as a structural escape hatch.
WorkerSessionRunner.spawn!(
project: :bailder, prompt: "fix bug X",
model: "sonnet-4-6", tier_guard: true)
Why: 'agent edits itself' is a meme until you make the deploy loop external. Then it's just leverage.
07
Replay + A/B — skills versioned like code.
Any past AgentRun can be re-executed with a tweaked prompt. A/B runs the same scenario against two prompts in isolated sandboxes; reports tool calls, cost, escalation, order created. Skill evolution is data-driven, not vibes-driven.
BrandAttendanceAbTest.new(
brand: salgadelli, scenario: msg,
prompt_a: current, prompt_b: candidate
).call
Why: I treat the system prompt like a unit under test. Because it is.
08
Conversation-first modeling.
Conversations are first-class. Entities (Order, Mission, Decision) are derived state — projections of the conversation, not forms a user fills in. A traditional CRM is entity-first; Bailder inverts it. The model already speaks conversation; let it.
Order.new(items: agent_extracted,
conversation: conv,
fulfillment: derived_from_messages)
Why: every CRM I've ever seen lies because the salesperson fills the pipeline for show. Derived state can't lie.
09
Multi-LLM routing — model is commodity.
Every chat call goes through OpenRouter. Cheap models for triage (GLM-5.1), Sonnet for default reasoning, Opus gated behind explicit opt-in. The choice of model itself is a Decision row.
OpenRouterClient.chat(
model: ModelRouter.pick(task),
context: { agent_run_id:, conversation_id: })
Why: I burned $16 once when retry auto-escalated GLM → Sonnet → Opus. Lesson learned in a table.
10
Audit log + Events — everything is a row.
Every tool call, shell command, webhook in/out, message sent flows through structured audit logs and an event bus. Payload as JSONB, queryable. Forensics is SQL, not grep.
AuditLog.log!(action: "post.published",
tier: :yellow, target: brand.slug,
payload: { post_id:, platforms: })
Why: post-mortems are a write-time investment, not a read-time scramble. Pay it forward.