Spec-Driven Development

Build it live. Bugs first. Principles last.

The Brief

It’s 4:47 PM on a Friday. PM pings you in Slack:

"we need POST /auth/forgot-password by EOD.
 just the basic flow — email, link, done."

You have an hour. You open Cursor. You reach for the AI.

This deck is what happens next — twice. Once the way most of us actually do it. Then once with a spec.

Attempt 1 — The Prompt

You type the obvious thing.

"add a forgot-password endpoint that emails the user a reset link"

13 words. Ship it.

Attempt 1 — The AI’s Output

Looks completely reasonable. Compiles. The happy path works.

async function forgotPassword(req, res) {
  const { email } = req.body
  const user = await db.users.findByEmail(email)
  if (!user) return res.status(404).json({ error: "User not found" })

  const token = crypto.randomUUID()
  await db.tokens.create({ userId: user.id, token })
  await sendEmail(user.email, `Reset link: /reset?token=${token}`)

  return res.json({ ok: true })
}

We’re going to audit this line by line. Bring a security person.

Bug #1 — Account Enumeration

Look at the two response branches.

if (!user) return res.status(404).json({ error: "User not found" })  // unknown email
return res.json({ ok: true })                                         // known email

The endpoint just told an attacker which emails belong to real customers. Loop it over a leaked email list — you have a customer roster.

Bug #2 — No Rate Limit

The endpoint will happily accept ten thousand requests per second.

$ for email in $(cat customer-list.txt); do
>   curl -X POST /auth/forgot-password -d "{\"email\":\"$email\"}"
> done

Mailbomb

Every customer gets a reset email they didn’t ask for.

SMTP Exhaustion

Your real password resets stop working when the provider throttles you.

Bug #3 — Token Never Expires

await db.tokens.create({ userId: user.id, token })

No expiresAt. No usedAt. The reset link works forever.

Leaked Screenshot

A token leaked in a screenshot 18 months ago still resets the password today.

Token Replay

A token used on Tuesday can be re-used on Wednesday by anyone who has it.

Bug #4 — No Audit Trail

After every reset request you have no answer to:

Who triggered it?            (userId)
From what IP?                (req.ip)
At what time?                (timestamp)
Did the user complete it?    (resetCompletedAt)

The Pattern

Four bugs. One root cause.

Bug	Gap the AI guessed at
Account enumeration	”what should I respond when user is missing?”
No rate limit	”how often can this endpoint be called?”
Token never expires	”how long is the token valid for?”
No audit trail	”what should I record about this action?”

The Pivot

Same model. Same prompt-length budget. Same Friday afternoon.

What if we spent 90 seconds writing the gaps down before the AI fills them in?

List every gap

Surface every ambiguity before the agent generates anything.
Close each one with a decision

Each gap gets a concrete answer — not a description.
Hand the decisions to the agent

Not a description of the feature — the decisions themselves.

Three steps. Same feature. Watch the bug count change.

Step 1 — Clarify

The gaps don’t announce themselves. You hunt them.

Gap	Decision
Response when user not found?	Same shape as success — never leak
Rate limit per IP?	5 / hour
Rate limit per email?	3 / hour
Token TTL?	15 minutes
Token reusable?	No — single-use, set `usedAt` on first use
Token storage?	Hashed, never plaintext
Audit trail?	Always log: `userId?`, `ip`, `requestedAt`
HTTP status on success?	202 Accepted
HTTP status on unknown email?	202 Accepted (identical)

Nine gaps. Nine decisions. Zero of these decisions are in the AI’s head.

Step 2 — Spec

Translate the table into a file the agent can read.

# specs/forgot-password.yaml
endpoint: POST /auth/forgot-password
version: "1.0"

inputs:
  - name: email   type: Email   required: true

invariants:
  - response is identical for known and unknown emails    # no enumeration
  - max 5 requests / hour / IP
  - max 3 requests / hour / email
  - token TTL <= 15 minutes
  - token is single-use (usedAt set on first use)
  - token stored as SHA-256 hash, never plaintext

outputs:
  status: 202 Accepted
  body:   { ok: true }

side_effects:
  - email sent IF user exists
  - audit log entry: { userId?, ip, requestedAt }

Each line is one closed gap. The spec is the clarify table, machine-readable.

Step 3 — Tests From the Spec

One test per closed gap. Written first. All red.

// tests/forgot-password.test.ts — every test names a spec line

it("returns identical 202 body for known and unknown emails", ...)
it("rate-limits at 5 requests / hour / IP", ...)
it("rate-limits at 3 requests / hour / email", ...)
it("rejects tokens older than 15 minutes", ...)
it("rejects already-used tokens", ...)
it("stores token as SHA-256 hash, not plaintext", ...)
it("writes audit log entry on every request, including unknown emails", ...)

Step 4 — Implement Against the Spec

Hand the agent the spec, not a description.

async function forgotPassword(req, res) {
  const email = await EmailSchema.parseAsync(req.body.email)

  await rateLimitByIp(req.ip, { max: 5, window: "1h" })        // ← invariant
  await rateLimitByEmail(email, { max: 3, window: "1h" })       // ← invariant

  const user = await db.users.findByEmail(email)
  if (user) {
    const token = crypto.randomBytes(32).toString("hex")
    await db.tokens.create({
      userId:    user.id,
      tokenHash: sha256(token),                                  // ← invariant
      expiresAt: addMinutes(new Date(), 15),                     // ← invariant
      usedAt:    null,                                           // ← invariant
    })
    await sendEmail(user.email, `Reset: /reset?token=${token}`) // ← side_effect
  }

  await audit.log({ userId: user?.id, ip: req.ip, action: "forgot-password" })
  return res.status(202).json({ ok: true })                      // ← invariant
}

Every line traces to a spec statement. Nothing is invented.

Step 5 — Verify

✓  returns identical 202 body for known and unknown emails
✓  rate-limits at 5 requests / hour / IP
✓  rate-limits at 3 requests / hour / email
✓  rejects tokens older than 15 minutes
✓  rejects already-used tokens
✓  stores token as SHA-256 hash, not plaintext
✓  writes audit log entry on every request, including unknown emails

✓ CWE-203 Account enumeration — closed

✓ Mailbomb Rate limit — closed

✓ Token Replay Single-use TTL — closed

✓ Audit Gap Logging — closed

Two Months Later — Requirements Change

Security: “passwords reset via this flow must be at least 12 characters.”

# specs/forgot-password.yaml — one new invariant
invariants:
  - new password length >= 12 chars      # ← added

// src/auth/reset.ts
if (newPassword.length < 12) throw new PasswordTooShort()
await db.users.update({ id: user.id, password: hash(newPassword) })

What You Just Watched

13 words produced 4 bugs

A short vague prompt let the AI make every security decision silently.
Named the gaps

Before the AI got to fill them in, we listed every open question.
Closed each gap in one file

The spec — not Slack, not memory, not comments.
Turned each gap into a failing test

Tests trace to spec lines, not code lines.
Let the agent implement against the spec

No description, no guessing. Every line traces back.
Changed the spec first

When requirements shifted, code followed. Always in that order.

The Mechanics Behind It

Three patterns were running underneath what you just watched.

Vertical Slicing (Tracer Bullet)

Each spec describes one narrow feature from request to storage — not a layer of the system, but the full path through all of them. The agent gets complete context in a single read: every constraint, every side effect, every layer in one file.

Because the slice is self-contained, it can be tested the moment it’s built. There is no “wire everything up later” phase where bugs are discovered.

Test-Driven Development

Tests are written from spec lines before the agent writes code. Each test names an invariant. When a test fails it points at a spec line — which decision broke — not just which code line.

Ralph Loop

The agent doesn’t run once and stop. It loops until it earns the commit.

while ! all_tests_pass; do
  agent --spec forgot-password.yaml
done
git commit -m "7/7 AC green"

The spec and the tests are the exit condition. The loop runs until every decision you made is satisfied.

The Four Principles

Gaps Before Code

Surface every ambiguity before the agent generates anything. If a question can’t be answered, the spec isn’t done.

Spec Is the Contract

The agent reads the spec, not your memory or your Slack thread. If the agent and the spec disagree, the spec wins.

Tests Trace to Spec Lines

Every test names the invariant it protects. A failing test names a spec line — not a code line.

Spec Is the Changelog

When requirements change, the spec changes first. Code, tests, and prompts follow downstream. Always in that order.

Tools Mechanise This

Tool	Who writes the spec	Best for
Spec Kit	You — slash commands	Solo / max control
BMAD	AI agents, you approve	Teams with review gates
Matt Pocock Skills	You + AI guidance	Lightweight Claude Code flow
Superpowers	You + forced gates	Reviewer-first culture
OpenSpec	You — but as deltas	Brownfield / legacy code

All five enforce: gaps before code · spec is the contract · tests trace to spec · spec is the changelog.

Pick a Tool

The discipline is the same. The pen changes hands.

Spec Kit

Solo, max control. github.com/github/spec-kit

BMAD

Teams with review gates. github.com/bmad-code-org/BMAD-METHOD

Matt Pocock Skills

Lightweight Claude Code slash commands. github.com/mattpocock/skills

Superpowers

Hard gates the AI cannot skip. github.com/obra/superpowers

OpenSpec

Legacy codebases. Delta-spec, not full rewrites. github.com/Fission-AI/OpenSpec