Spec-Driven Development
Build it live. Bugs first. Principles last.
Spec-Driven Development
Build it live. Bugs first. Principles last.
The Brief
It’s 4:47 PM on a Friday. PM pings you in Slack:
"we need POST /auth/forgot-password by EOD.
just the basic flow — email, link, done."You have an hour. You open Cursor. You reach for the AI.
This deck is what happens next — twice. Once the way most of us actually do it. Then once with a spec.
Attempt 1 — The Prompt
You type the obvious thing.
"add a forgot-password endpoint that emails the user a reset link"13 words. Ship it.
Attempt 1 — The AI’s Output
Looks completely reasonable. Compiles. The happy path works.
async function forgotPassword(req, res) {
const { email } = req.body
const user = await db.users.findByEmail(email)
if (!user) return res.status(404).json({ error: "User not found" })
const token = crypto.randomUUID()
await db.tokens.create({ userId: user.id, token })
await sendEmail(user.email, `Reset link: /reset?token=${token}`)
return res.json({ ok: true })
}We’re going to audit this line by line. Bring a security person.
Bug #1 — Account Enumeration
Look at the two response branches.
if (!user) return res.status(404).json({ error: "User not found" }) // unknown email
return res.json({ ok: true }) // known emailThe endpoint just told an attacker which emails belong to real customers. Loop it over a leaked email list — you have a customer roster.
Bug #2 — No Rate Limit
The endpoint will happily accept ten thousand requests per second.
$ for email in $(cat customer-list.txt); do
> curl -X POST /auth/forgot-password -d "{\"email\":\"$email\"}"
> doneMailbomb
Every customer gets a reset email they didn’t ask for.
SMTP Exhaustion
Your real password resets stop working when the provider throttles you.
Bug #3 — Token Never Expires
await db.tokens.create({ userId: user.id, token })No expiresAt. No usedAt. The reset link works forever.
Leaked Screenshot
A token leaked in a screenshot 18 months ago still resets the password today.
Token Replay
A token used on Tuesday can be re-used on Wednesday by anyone who has it.
Bug #4 — No Audit Trail
After every reset request you have no answer to:
Who triggered it? (userId)
From what IP? (req.ip)
At what time? (timestamp)
Did the user complete it? (resetCompletedAt)The Pattern
Four bugs. One root cause.
| Bug | Gap the AI guessed at |
|---|---|
| Account enumeration | ”what should I respond when user is missing?” |
| No rate limit | ”how often can this endpoint be called?” |
| Token never expires | ”how long is the token valid for?” |
| No audit trail | ”what should I record about this action?” |
The Pivot
Same model. Same prompt-length budget. Same Friday afternoon.
What if we spent 90 seconds writing the gaps down before the AI fills them in?
-
List every gap
Surface every ambiguity before the agent generates anything.
-
Close each one with a decision
Each gap gets a concrete answer — not a description.
-
Hand the decisions to the agent
Not a description of the feature — the decisions themselves.
Three steps. Same feature. Watch the bug count change.
Step 1 — Clarify
The gaps don’t announce themselves. You hunt them.
| Gap | Decision |
|---|---|
| Response when user not found? | Same shape as success — never leak |
| Rate limit per IP? | 5 / hour |
| Rate limit per email? | 3 / hour |
| Token TTL? | 15 minutes |
| Token reusable? | No — single-use, set usedAt on first use |
| Token storage? | Hashed, never plaintext |
| Audit trail? | Always log: userId?, ip, requestedAt |
| HTTP status on success? | 202 Accepted |
| HTTP status on unknown email? | 202 Accepted (identical) |
Nine gaps. Nine decisions. Zero of these decisions are in the AI’s head.
Step 2 — Spec
Translate the table into a file the agent can read.
# specs/forgot-password.yaml
endpoint: POST /auth/forgot-password
version: "1.0"
inputs:
- name: email type: Email required: true
invariants:
- response is identical for known and unknown emails # no enumeration
- max 5 requests / hour / IP
- max 3 requests / hour / email
- token TTL <= 15 minutes
- token is single-use (usedAt set on first use)
- token stored as SHA-256 hash, never plaintext
outputs:
status: 202 Accepted
body: { ok: true }
side_effects:
- email sent IF user exists
- audit log entry: { userId?, ip, requestedAt }Each line is one closed gap. The spec is the clarify table, machine-readable.
Step 3 — Tests From the Spec
One test per closed gap. Written first. All red.
// tests/forgot-password.test.ts — every test names a spec line
it("returns identical 202 body for known and unknown emails", ...)
it("rate-limits at 5 requests / hour / IP", ...)
it("rate-limits at 3 requests / hour / email", ...)
it("rejects tokens older than 15 minutes", ...)
it("rejects already-used tokens", ...)
it("stores token as SHA-256 hash, not plaintext", ...)
it("writes audit log entry on every request, including unknown emails", ...)Step 4 — Implement Against the Spec
Hand the agent the spec, not a description.
async function forgotPassword(req, res) {
const email = await EmailSchema.parseAsync(req.body.email)
await rateLimitByIp(req.ip, { max: 5, window: "1h" }) // ← invariant
await rateLimitByEmail(email, { max: 3, window: "1h" }) // ← invariant
const user = await db.users.findByEmail(email)
if (user) {
const token = crypto.randomBytes(32).toString("hex")
await db.tokens.create({
userId: user.id,
tokenHash: sha256(token), // ← invariant
expiresAt: addMinutes(new Date(), 15), // ← invariant
usedAt: null, // ← invariant
})
await sendEmail(user.email, `Reset: /reset?token=${token}`) // ← side_effect
}
await audit.log({ userId: user?.id, ip: req.ip, action: "forgot-password" })
return res.status(202).json({ ok: true }) // ← invariant
}Every line traces to a spec statement. Nothing is invented.
Step 5 — Verify
✓ returns identical 202 body for known and unknown emails
✓ rate-limits at 5 requests / hour / IP
✓ rate-limits at 3 requests / hour / email
✓ rejects tokens older than 15 minutes
✓ rejects already-used tokens
✓ stores token as SHA-256 hash, not plaintext
✓ writes audit log entry on every request, including unknown emailsTwo Months Later — Requirements Change
Security: “passwords reset via this flow must be at least 12 characters.”
# specs/forgot-password.yaml — one new invariant
invariants:
- new password length >= 12 chars # ← added// src/auth/reset.ts
if (newPassword.length < 12) throw new PasswordTooShort()
await db.users.update({ id: user.id, password: hash(newPassword) })What You Just Watched
-
13 words produced 4 bugs
A short vague prompt let the AI make every security decision silently.
-
Named the gaps
Before the AI got to fill them in, we listed every open question.
-
Closed each gap in one file
The spec — not Slack, not memory, not comments.
-
Turned each gap into a failing test
Tests trace to spec lines, not code lines.
-
Let the agent implement against the spec
No description, no guessing. Every line traces back.
-
Changed the spec first
When requirements shifted, code followed. Always in that order.
The Mechanics Behind It
Three patterns were running underneath what you just watched.
Vertical Slicing (Tracer Bullet)
Each spec describes one narrow feature from request to storage — not a layer of the system, but the full path through all of them. The agent gets complete context in a single read: every constraint, every side effect, every layer in one file.
Because the slice is self-contained, it can be tested the moment it’s built. There is no “wire everything up later” phase where bugs are discovered.
Test-Driven Development
Tests are written from spec lines before the agent writes code. Each test names an invariant. When a test fails it points at a spec line — which decision broke — not just which code line.
Ralph Loop
The agent doesn’t run once and stop. It loops until it earns the commit.
while ! all_tests_pass; do
agent --spec forgot-password.yaml
done
git commit -m "7/7 AC green"The spec and the tests are the exit condition. The loop runs until every decision you made is satisfied.
The Four Principles
Gaps Before Code
Surface every ambiguity before the agent generates anything. If a question can’t be answered, the spec isn’t done.
Spec Is the Contract
The agent reads the spec, not your memory or your Slack thread. If the agent and the spec disagree, the spec wins.
Tests Trace to Spec Lines
Every test names the invariant it protects. A failing test names a spec line — not a code line.
Spec Is the Changelog
When requirements change, the spec changes first. Code, tests, and prompts follow downstream. Always in that order.
Tools Mechanise This
| Tool | Who writes the spec | Best for |
|---|---|---|
| Spec Kit | You — slash commands | Solo / max control |
| BMAD | AI agents, you approve | Teams with review gates |
| Matt Pocock Skills | You + AI guidance | Lightweight Claude Code flow |
| Superpowers | You + forced gates | Reviewer-first culture |
| OpenSpec | You — but as deltas | Brownfield / legacy code |
All five enforce: gaps before code · spec is the contract · tests trace to spec · spec is the changelog.
Pick a Tool
The discipline is the same. The pen changes hands.