builder-ai

What Claude skips by default

Four things Claude does without this pack

Not edge cases. What happens by default when Claude Code builds an AI feature with no quality rules.

No Eval

"Shipped on vibes"

The feature launched. Nobody measured whether it actually worked. Three weeks later, a prompt change made quality drop from 89% to 72% and nobody noticed until users started complaining — because there was no test to catch it.

Blocked by eval-before-ship — requires pass rate, baseline, failure analysis before merge

No Versioning

"Which prompt is live?"

The instruction sent to the AI was edited directly in the server settings. No history, no way to undo. When something broke, nobody could tell which of the three recent edits caused it.

Blocked by prompt-versioning — semver files, git history, one-line rollback path required

No Fallback

"3am incident"

The AI service went down. The product threw an error. Users saw a blank screen. The developer added a fallback handler at 3am. It should have been designed before the feature ever shipped.

Blocked by fallback-required — all 4 failure modes handled before ship

No Safety Review

"Day-two jailbreak"

The AI had no protection against manipulation. A user found a way to make it ignore its instructions on day two and posted it publicly. The fix took three days. A two-hour review would have caught it.

Blocked by ai-safety-review — prompt injection, hallucination risk, output safety, agentic scope

How it works

Claude must produce real proof — not just say it's done

Each rule has a required output format that must contain real values. If Claude skips testing the feature, it cannot produce a real pass rate. If it skips the safety check, it cannot produce a real safety report. There is no "it looks good" path.

eval-before-ship — completion statement

Eval complete.
Feature: document-classifier
Suite: evals/classifier/suite_v2.json (120 examples)

Metric: accuracy
Pass rate: 91.7% (threshold: 85%)
Baseline: v1.0.0 @ 88.3% (+3.4%)

Failures: 10 / 120
Pattern: ambiguous category boundary (contract vs. agreement)
Mitigation: added 8 examples to training set — retest pending

Harness: evals/run_eval.py (reproducible)
CI: eval workflow added to .github/workflows/eval.yml ✓

Verdict: PASS — feature cleared for merge ✓

Claude cannot write Pass rate: 91.7% and Baseline: v1.0.0 @ 88.3% without actually running the test. That is the enforcement.

The 8 rules

One rule per thing Claude skips by default

Three mandatory checks that block shipping if not satisfied. Five workflow rules that enforce how AI development work should be done.

eval-before-ship A real test suite with measured pass rates and failure analysis — not 'I tested it manually and it looked good' Hard Gate

ai-safety-review Four security checks: manipulation attacks, wrong AI outputs, unsafe responses, and AI doing more than intended — all reviewed before users see it Hard Gate

fallback-required What happens when the AI times out, gives a garbled response, isn't confident, or refuses to answer — all four planned before shipping Hard Gate

prompt-versioning Every instruction sent to the AI saved with version history, so you can see exactly what changed and undo it in one step Workflow

rag-pipeline-design Before building AI-powered search: check data quality, understand what users will ask, and measure how accurate the results are before going live Workflow

model-benchmarking Test at least three AI models on your specific task before committing — based on your own data, not published benchmarks Workflow

context-optimization Structured process for reducing AI costs: cut unnecessary content first, then compress, then split, then cache — with measurements before and after Workflow

ai-cost-audit Calculate actual token usage, project what costs look like at 10× users, and find the real levers before claiming any savings Workflow

The 5 specialists

Five AI roles Claude can take on

Each specialist has a narrow focus and explicit rules for what it won't do. Each produces output at a real file path — not a chat message.

Sonnet

prompt-engineer

Writes and improves AI instructions. Saves every version. Never claims quality without running a test first.

Sonnet

eval-designer

Builds test suites for AI features. Defines what 'working' means with real metrics. Writes the tests that the prompt-engineer must pass.

Opus

rag-architect

Designs AI-powered search and retrieval features. Always checks your data first. Measures how accurate the results are before the feature goes live.

Sonnet

model-selector

Runs a structured comparison of AI models for your specific task. Never picks based on name recognition or general benchmarks.

Opus

ai-safety-reviewer

Reviews every AI feature before it reaches users. Checks for 4 risk categories. Every feature gets PASS or BLOCK — nothing in between.

Before and after

What changes when enforcement exists

The difference is work done before the PR was opened, not problems caught after the feature shipped.

✗ Without builder-ai

✗ Prompt shipped without an eval suite. Quality tracked by support tickets.

✗ System prompt edited in the production env var. Rollback requires memory.

✗ API timeout throws an unhandled exception. Users see a 500 error.

✗ No safety review. Injection vector found on day two by a user.

✗ Model chosen based on "it felt better." No benchmark. No cost comparison.

✗ RAG pipeline designed without a data audit. Recall@k unknown at launch.

✓ With builder-ai

✓ 120-example eval suite. 91.7% accuracy. +3.4% over baseline. Runs in CI.

✓ prompts/classifier/v1.2.0.md in git. Rollback: change one line.

✓ All 4 failure modes handled. Graceful degradation. Alerts on fallback rate.

✓ Safety review passed. 5 injection test cases. Refusal tested. Report on file.

✓ Benchmarked 3 tiers. Mid-tier passes the bar at 10× lower cost. Documented.

✓ Data audit done. Recall@8 = 0.87. Chunking strategy justified. Baseline set.

Installation

30 seconds to enforce quality

Sparse-clone installs only the skills and agents directories. Nothing else touches your project.

macOS / Linux

curl -fsSL https://raw.githubusercontent.com/
RBraga01/builder-ai/master/install.sh | bash

Windows (PowerShell)

irm https://raw.githubusercontent.com/
RBraga01/builder-ai/master/install.ps1 | iex

The installer copies skills/ and .claude/agents/ into your project using cp -n (non-destructive — won't overwrite existing files). No global changes, no dependencies installed, no configuration required.

The builder-* ecosystem

One pack per domain

Each pack enforces one domain. All work standalone. All share the same enforcement model — Completion Statement Formats that require real values.

Foundation

Your AI assistant will
skip the eval.
This pack won't let it.

Four things Claude does without this pack

"Shipped on vibes"

"Which prompt is live?"

"3am incident"

"Day-two jailbreak"

Claude must produce real proof — not just say it's done

One rule per thing Claude skips by default

Five AI roles Claude can take on

prompt-engineer

eval-designer

rag-architect

model-selector

ai-safety-reviewer

What changes when enforcement exists

✗ Without builder-ai

✓ With builder-ai

30 seconds to enforce quality

One pack per domain

A Team

builder-design

builder-product

builder-growth

Your AI assistant willskip the eval.This pack won't let it.

Four things Claude does without this pack

"Shipped on vibes"

"Which prompt is live?"

"3am incident"

"Day-two jailbreak"

Claude must produce real proof — not just say it's done

One rule per thing Claude skips by default

Five AI roles Claude can take on

prompt-engineer

eval-designer

rag-architect

model-selector

ai-safety-reviewer

What changes when enforcement exists

✗ Without builder-ai

✓ With builder-ai

30 seconds to enforce quality

One pack per domain

A Team

builder-ai

builder-design

builder-product

builder-growth

Your AI assistant will
skip the eval.
This pack won't let it.