Prompt Engineering Is Still Engineering← Blog7 min read
LLMPrompt EngineeringAI

Prompt Engineering Is Still Engineering

May 28, 2025 · 7 min read


The Prompt Is Code

I've watched three startups break production because someone edited a prompt directly in the database at 11pm to "just fix a small thing." Prompts are strings. Strings feel casual. That's the trap.

Prompts are code. They need version control, tests, review, and rollback strategies.

Version Control Your Prompts

Store prompts in your repo, not in a database or env var. Every change gets a diff, a commit message, and a reviewer.

prompts/
  summarize-article/
    v1.txt
    v2.txt
    changelog.md
  extract-entities/
    v1.txt

The changelog matters more than you think. "Why did we add this sentence to the prompt?" is a question you'll ask six months from now.

Test Your Prompts

A prompt change that improves one case can regress five others. Build a test suite of input/expected-output pairs and run it before merging.

const cases = [
  { input: "Apple released iPhone 17", expected: ["Apple", "iPhone 17"] },
  { input: "Elon Musk tweeted about Tesla", expected: ["Elon Musk", "Tesla"] },
]

for (const { input, expected } of cases) {
  const result = await extractEntities(input)
  assert.deepEqual(result.sort(), expected.sort())
}

LLMs are non-deterministic, so set temperature to 0 for tests and run each case 3 times to catch flaky behavior.

Feature Flags for Prompts

Don't ship a new prompt to 100% of users immediately. Treat it like a feature flag:

  • Ship to 5% of traffic
  • Compare output quality metrics
  • Ramp if metrics hold
  • Rollback immediately if they don't

You need this infrastructure before you need it. The middle of an incident is the wrong time to build rollback tooling.

The Metrics That Matter

What does "better" mean for your prompt? Define it before you start iterating:

  • Accuracy: Does the output match a golden set?
  • Format compliance: Is the JSON valid? Does it have all required fields?
  • Latency: Did the new prompt add 500ms to every request?
  • Cost: Longer prompts = more tokens = higher bills

I track all four for every prompt change. Usually there are tradeoffs — a more accurate prompt might cost 30% more. Making that tradeoff consciously is engineering. Making it accidentally is how you get surprised by your API bill.

The Mindset Shift

Prompts feel like writing. They're actually configuration — configuration that controls a complex stochastic system. Treat them with the same rigor you'd apply to infrastructure config: review, test, version, monitor.

The teams that do this ship better AI products and sleep better.


← Back to Blog