AI Engineering Skills Pack

Your AI assistant will
skip the eval.
This pack won't let it.

When Claude Code builds AI features, it skips the quality checks. This pack makes those checks required.

Drop one folder into your project. Claude Code now must test every AI feature before shipping, keep a history of every prompt change, plan what happens when the AI fails, and check for security holes — or it cannot mark the task done.

3
Hard Gates
8
Skills
5
Agents
0
Skippable Steps

What Claude skips by default

Four things Claude does without this pack

Not edge cases. What happens by default when Claude Code builds an AI feature with no quality rules.

No Eval

"Shipped on vibes"

The feature launched. Nobody measured whether it actually worked. Three weeks later, a prompt change made quality drop from 89% to 72% and nobody noticed until users started complaining — because there was no test to catch it.

Blocked by eval-before-ship — requires pass rate, baseline, failure analysis before merge
No Versioning

"Which prompt is live?"

The instruction sent to the AI was edited directly in the server settings. No history, no way to undo. When something broke, nobody could tell which of the three recent edits caused it.

Blocked by prompt-versioning — semver files, git history, one-line rollback path required
No Fallback

"3am incident"

The AI service went down. The product threw an error. Users saw a blank screen. The developer added a fallback handler at 3am. It should have been designed before the feature ever shipped.

Blocked by fallback-required — all 4 failure modes handled before ship
No Safety Review

"Day-two jailbreak"

The AI had no protection against manipulation. A user found a way to make it ignore its instructions on day two and posted it publicly. The fix took three days. A two-hour review would have caught it.

Blocked by ai-safety-review — prompt injection, hallucination risk, output safety, agentic scope

How it works

Claude must produce real proof — not just say it's done

Each rule has a required output format that must contain real values. If Claude skips testing the feature, it cannot produce a real pass rate. If it skips the safety check, it cannot produce a real safety report. There is no "it looks good" path.

eval-before-ship — completion statement
Eval complete.
Feature: document-classifier
Suite: evals/classifier/suite_v2.json (120 examples)

Metric: accuracy
Pass rate: 91.7% (threshold: 85%)
Baseline: v1.0.0 @ 88.3% (+3.4%)

Failures: 10 / 120
Pattern: ambiguous category boundary (contract vs. agreement)
Mitigation: added 8 examples to training set — retest pending

Harness: evals/run_eval.py (reproducible)
CI: eval workflow added to .github/workflows/eval.yml ✓

Verdict: PASS — feature cleared for merge ✓

Claude cannot write Pass rate: 91.7% and Baseline: v1.0.0 @ 88.3% without actually running the test. That is the enforcement.


The 8 rules

One rule per thing Claude skips by default

Three mandatory checks that block shipping if not satisfied. Five workflow rules that enforce how AI development work should be done.

eval-before-ship A real test suite with measured pass rates and failure analysis — not 'I tested it manually and it looked good' Hard Gate
ai-safety-review Four security checks: manipulation attacks, wrong AI outputs, unsafe responses, and AI doing more than intended — all reviewed before users see it Hard Gate
fallback-required What happens when the AI times out, gives a garbled response, isn't confident, or refuses to answer — all four planned before shipping Hard Gate
prompt-versioning Every instruction sent to the AI saved with version history, so you can see exactly what changed and undo it in one step Workflow
rag-pipeline-design Before building AI-powered search: check data quality, understand what users will ask, and measure how accurate the results are before going live Workflow
model-benchmarking Test at least three AI models on your specific task before committing — based on your own data, not published benchmarks Workflow
context-optimization Structured process for reducing AI costs: cut unnecessary content first, then compress, then split, then cache — with measurements before and after Workflow
ai-cost-audit Calculate actual token usage, project what costs look like at 10× users, and find the real levers before claiming any savings Workflow

The 5 specialists

Five AI roles Claude can take on

Each specialist has a narrow focus and explicit rules for what it won't do. Each produces output at a real file path — not a chat message.

Sonnet

prompt-engineer

Writes and improves AI instructions. Saves every version. Never claims quality without running a test first.

Sonnet

eval-designer

Builds test suites for AI features. Defines what 'working' means with real metrics. Writes the tests that the prompt-engineer must pass.

Opus

rag-architect

Designs AI-powered search and retrieval features. Always checks your data first. Measures how accurate the results are before the feature goes live.

Sonnet

model-selector

Runs a structured comparison of AI models for your specific task. Never picks based on name recognition or general benchmarks.

Opus

ai-safety-reviewer

Reviews every AI feature before it reaches users. Checks for 4 risk categories. Every feature gets PASS or BLOCK — nothing in between.


Before and after

What changes when enforcement exists

The difference is work done before the PR was opened, not problems caught after the feature shipped.

✗ Without builder-ai

Prompt shipped without an eval suite. Quality tracked by support tickets.
System prompt edited in the production env var. Rollback requires memory.
API timeout throws an unhandled exception. Users see a 500 error.
No safety review. Injection vector found on day two by a user.
Model chosen based on "it felt better." No benchmark. No cost comparison.
RAG pipeline designed without a data audit. Recall@k unknown at launch.

✓ With builder-ai

120-example eval suite. 91.7% accuracy. +3.4% over baseline. Runs in CI.
prompts/classifier/v1.2.0.md in git. Rollback: change one line.
All 4 failure modes handled. Graceful degradation. Alerts on fallback rate.
Safety review passed. 5 injection test cases. Refusal tested. Report on file.
Benchmarked 3 tiers. Mid-tier passes the bar at 10× lower cost. Documented.
Data audit done. Recall@8 = 0.87. Chunking strategy justified. Baseline set.

Installation

30 seconds to enforce quality

Sparse-clone installs only the skills and agents directories. Nothing else touches your project.

macOS / Linux
curl -fsSL https://raw.githubusercontent.com/
RBraga01/builder-ai/master/install.sh | bash
Windows (PowerShell)
irm https://raw.githubusercontent.com/
RBraga01/builder-ai/master/install.ps1 | iex
The installer copies skills/ and .claude/agents/ into your project using cp -n (non-destructive — won't overwrite existing files). No global changes, no dependencies installed, no configuration required.

The builder-* ecosystem

One pack per domain

Each pack enforces one domain. All work standalone. All share the same enforcement model — Completion Statement Formats that require real values.

Foundation

A Team

25 pre-configured engineering specialists, a lead orchestrator, 18 workflow skills, and a Pipeline Auditor. The base layer.

You are here

builder-ai

AI engineering quality gates. Eval, prompt versioning, fallback design, RAG pipelines, safety review.

Design

builder-design

AI UI design enforcement. States, streaming UI, prompt UX, accessibility, design tokens — the design layer AI features need.

Product

builder-product

Product quality gates. PRD, feature scoping, metric definition, research synthesis, A/B test design, AI feature validation.

Growth

builder-growth

Growth quality gates. Positioning, copy, funnel analysis, experiments, retention design, AI messaging review.