Your LLM tests are flaky, slow, and expensive. Freeze them.
Every team building on LLMs eventually has the same conversation. Someone writes an integration test that calls the model for real. It works. Then three weeks later it’s red. Not because the code broke, but because the model phrased something differently. Someone slaps a retry on it. Now the suite takes eleven minutes and the CI bill has a line item nobody wants to explain. Eventually the test gets a @skip and the team is back to shipping LLM code with no tests at all.
I built promptecho because I kept having that conversation. It’s a small Python library. Record & replay for LLM API calls, in the lineage of vcrpy and nock, but rebuilt around the specific ways LLM traffic breaks the classic HTTP-VCR assumptions.
This post covers the problem it solves, what it deliberately doesn’t do, a getting-started tutorial, three real-world scenarios, how it fits into CI/CD, and where it’s going next.
The problem, precisely
Tests that hit a live LLM API have three independent failure modes, and it’s worth separating them because people tend to blur them together:
- Flakiness. LLM outputs are samples from a distribution. Even at
temperature=0, providers don’t guarantee bit-identical outputs across runs, hardware, or silent model updates. An assertion like"refund" in responsepasses 97% of the time, which means a 40-test suite fails roughly every other run for reasons that have nothing to do with your code. - Latency. A real chat completion is 1–30 seconds. Reasoning models are worse. A suite that should take 8 seconds takes 8 minutes, and slow suites stop being run.
- Cost. Tokens × tests × CI runs compounds quietly. Forty tests averaging 2K tokens each, on a team pushing 30 CI runs a day, is ~2.4M tokens per day spent re-asking questions you already know the answers to.
Here’s the insight that makes all three solvable at once: almost none of your test code is actually testing the model. It’s testing the code around the model. Prompt construction, response parsing, tool-call dispatch, streaming rendering, retry logic, fallbacks. That layer is fully deterministic. It only looks nondeterministic because there’s a probabilistic system wired into the middle of it.
So: record the model’s real response once, then replay it forever. The deterministic 95% of your code gets fast, free, deterministic tests. The model itself gets judged by a different tool, on a different cadence (more on that below).
What promptecho is
One decorator:
import promptecho
from anthropic import Anthropic
@promptecho.use_cassette("cassettes/summarize.yaml")
def test_summarize():
client = Anthropic()
msg = client.messages.create(
model="claude-opus-4-8",
max_tokens=100,
messages=[{"role": "user", "content": "Summarize: the cat sat on the mat."}],
)
assert "cat" in msg.content[0].text.lower()
First run: one real API call, recorded to a human-readable YAML “cassette”. Every run after: replayed from disk. No network, no tokens, no API key, no flake.
The end-to-end test that gates every release records against a local server, shuts the server down, then replays. If the response can come back with the upstream gone, the cassette is genuinely doing the work, not a partial proxy.
Why not just point vcrpy at it?
You can, and for a single provider with hand-tuned matchers, vcrpy gets you maybe 70% of the way. promptecho exists because LLM traffic breaks a generic HTTP VCR in five specific places:
1. Matching. A raw-bytes VCR matches requests byte-for-byte. LLM request bodies carry volatile noise - client-injected request IDs, re-serialized key order, whitespace that changes the bytes without changing the meaning, so replays miss. promptecho instead computes a normalized fingerprint over only the fields that determine the response:
fingerprint(request) = sha256( canonical_json(
{method, url_path, fields: pick(body, match_on)}
) )
The default match_on is ["model", "messages", "system", "tools", "tool_choice", "reasoning_effort", "reasoning", "thinking"]. Those last three matter more than they look: reasoning knobs change the response without changing the prompt. If they weren’t matched, a test using reasoning_effort="high" would silently replay the recording made for "low", a wrong-fixture bug that’s nearly impossible to catch by eye.
2. Cross-provider canonicalization. The same logical prompt has different wire shapes per provider. promptecho normalizes before fingerprinting, so these two requests map to the same recording:
# Anthropic shape
{"model": "m", "system": "be terse",
"messages": [{"role": "user", "content": "hi"}]}
# OpenAI shape — system prompt as a role message
{"model": "m",
"messages": [{"role": "system", "content": "be terse"},
{"role": "user", "content": "hi"}]}
It also knows content: "hi" ≡ content: [{"type": "text", "text": "hi"}], OpenAI’s developer role ≡ system, and an Anthropic input_schema tool definition ≡ an OpenAI function.parameters. A raw-bytes VCR fundamentally cannot do this.
3. Streaming. Most production LLM calls are SSE streams. promptecho records the ordered event stream and re-emits it on replay, so stream=True and token-by-token iteration work identically against a cassette, including reasoning deltas and tool-call deltas.
4. Debuggable failures. When a generic VCR misses, you get “no match.” When promptecho misses in CI, you get the exact path that changed, diffed against the most similar recording in the cassette:
CassetteMiss: Cassette miss in 'cassettes/summarize.yaml' (mode=none).
The incoming request differs from the nearest recording on these fields:
messages[0].content:
recorded: summarize: the cat sat on the mat
incoming: summarize: the dog sat on the mat
If the change is intentional, re-record with mode='once' (or delete the
cassette and re-run). If not, fix the call so it matches the recorded
fingerprint.
That difference, “no match” versus “this field changed, here are both values” — is the difference between a five-second fix and twenty minutes of printf archaeology in CI logs.
5. Secrets. Cassettes are meant to be committed, so they’re safe by default: auth headers (authorization, x-api-key, …), set-cookie, and every URL query-string value (Google-style ?key=… auth) are redacted before anything touches disk.
What promptecho is not
Two boundaries, both deliberate:
- Not a cache. Replay matching is exact and deterministic, on purpose. It does not semantically match “different prompt, close enough” that would reintroduce nondeterminism into the exact tool you adopted to remove it, and can silently serve the wrong recording. Semantic matching is a caching concern; this is a testing tool.
- Not an eval. Replaying a frozen response tests the deterministic code around the model. Judging whether the model’s output is good is an eval, a genuinely different job with a different cadence and budget (deepeval, promptfoo, braintrust live there). You need both. promptecho is deliberately only the first, and pretending one tool can be both is how you end up with neither.
This separation is, I’d argue, the most important architectural decision in testing LLM applications: split your test suite along the deterministic/probabilistic boundary. Deterministic code → recorded fixtures, run on every commit, free. Probabilistic behavior → evals, run on prompt/model changes, budgeted. Most teams I’ve seen struggle with LLM testing are struggling because they’re running one undifferentiated suite that’s trying to be both.
Getting started
pip install promptecho
Requires Python ≥ 3.9 and any SDK built on httpx, which is the Anthropic and OpenAI SDKs, Mistral, Cohere v5+, google-genai, and the OpenAI SDK pointed at OpenRouter / Together / Groq / Fireworks / your own vLLM or Ollama via base_url=. If the SDK uses httpx, promptecho sees the call.
Three ways to use it:
# 1. Decorator (sync or async — both work)
@promptecho.use_cassette("cassettes/foo.yaml")
def test_foo(): ...
# 2. Context manager
with promptecho.use_cassette("cassettes/foo.yaml"):
client.messages.create(...)
# 3. pytest fixture — auto-named cassette per test
def test_bar(promptecho_cassette): # → cassettes/test_bar.yaml
client.messages.create(...)
Record modes, borrowed from vcrpy so the mental model is free:
| mode | absent cassette | present cassette | use for |
|---|---|---|---|
once (default) | record | replay | normal dev |
none | error | replay | CI — guarantees no live calls |
new_episodes | record | replay + record new | evolving tests |
all | record | re-record everything | refreshing fixtures |
And the cassette itself is reviewable YAML, designed to diff cleanly in PRs:
version: 2
match_on: [model, messages, system, tools, tool_choice, reasoning_effort, reasoning, thinking]
interactions:
- request:
method: POST
url: https://api.anthropic.com/v1/messages
match_key: 7d206bed48a0bc0c
body:
model: claude-opus-4-8
messages:
- {role: user, content: "Summarize: the cat sat on the mat."}
response:
status: 200
body:
content: [{type: text, text: "A cat sat on a mat."}]
usage: {input_tokens: 14, output_tokens: 8}
That last point is underrated: when a teammate changes a prompt, the PR shows the recorded model behavior changing, right there in the diff. Code review starts covering your LLM interactions.
Three real-world scenarios
1. The agent tool-loop you can’t test
You’ve built an agent: the model picks a tool, your code executes it, results go back, the model picks the next step. Testing this live is the worst of all three problems at once: a 5-step loop is 5 sequential API calls (slow), any step can phrase its tool call differently (flaky), and the loop runs on every CI push (expensive). Most teams end up mocking the model with hand-written responses — and hand-written mocks of a 40-field streaming tool-call payload are wrong in ways you won’t discover until production.
With promptecho, you run the agent against the real model once, and the entire multi-step trajectory freezes:
@promptecho.use_cassette("cassettes/agent_refund_flow.yaml")
def test_agent_resolves_refund():
result = run_agent("Customer wants a refund for order #4521")
# Five real tool-roundtrips recorded on first run; replayed forever after.
assert result.tool_calls == ["lookup_order", "check_refund_policy", "issue_refund"]
assert result.final_state == "resolved"
Each loop iteration appends messages, so each step has a distinct fingerprint. The cassette holds the whole trajectory in order. What you’re now regression-testing is your dispatch logic, state machine, and tool plumbing against responses the real model actually produced - not against your guess at them. When you change the orchestration code, the suite runs in milliseconds. When you change the prompt, the fingerprint misses with a field-level diff telling you exactly what changed — which is your cue that this trajectory needs re-recording and re-review.
2. The streaming UI nobody tests
Progressive rendering, “stop generating” buttons, token-budget cutoffs, partial tool-call assembly. Streaming code paths are where LLM apps actually break, and they’re nearly untestable with conventional mocks because what matters is the sequence of events, not the final assembled string. A mock that returns the complete text in one chunk tests nothing about your streaming code.
promptecho records the SSE event stream as ordered events and replays them with the same boundaries:
@promptecho.use_cassette("cassettes/stream_render.yaml")
def test_renderer_handles_mid_word_chunks():
chunks = []
with client.messages.stream(
model="claude-opus-4-8", max_tokens=200,
messages=[{"role": "user", "content": "Explain SSE in one paragraph."}],
) as stream:
for text in stream.text_stream:
chunks.append(render_incremental(text)) # your real streaming code
assert len(chunks) > 1 # actually exercised the streaming path
assert is_valid_markdown("".join(chunks)) # no broken mid-token rendering
The recorded cassette preserves the real chunking the provider produced, including the awkward mid-word splits and the message_delta ordering that hand-written mocks always get too clean. Your streaming bugs reproduce deterministically, in CI, forever.
3. The gateway migration you’re afraid of
Your team is moving inference from OpenAI’s API to OpenRouter, or to a self-hosted vLLM, or to the in-house gateway platform just stood up. The risky part isn’t the mode. It’s every request/response assumption buried in your code, and the fact that your existing fixtures were recorded against the old host. promptecho’s fingerprint deliberately excludes the host (it matches on method + URL path + the normalized body), so a recording made against one OpenAI-compatible endpoint replays against another:
# Cassette recorded months ago against api.openai.com...
@promptecho.use_cassette("cassettes/triage.yaml", mode="none")
def test_triage_through_new_gateway():
# ...replays against the new gateway. Same path (/v1/chat/completions),
# same logical request → same fingerprint. Zero re-recording.
client = OpenAI(base_url="https://gateway.internal/v1", api_key="unused-in-replay")
r = client.chat.completions.create(
model="meta-llama/llama-3.1-70b-instruct",
messages=[
{"role": "system", "content": "You triage support tickets."},
{"role": "user", "content": TICKET},
],
)
assert triage_label(r) == "billing"
(One honest caveat: if the model id changes as part of the migration, that’s a fingerprint miss by design. model is in the match set because it determines the response. You’ll re-record those, and the field-level diff tells you exactly which ones.)
The body-shape canonicalization earns its keep here too: when an SDK upgrade silently switches system to the newer developer role, or starts sending content as a block list instead of a bare string, your fixtures keep matching. Promptecho normalizes both spellings to the same canonical shape before fingerprinting. Shape churn you didn’t ask for stops invalidating fixtures; changes you did make surface as precise cassette misses instead of production incidents.
Fitting it into your team’s workflow and CI
The workflow promptecho is designed around:
Three properties make this safe by default:
CI can’t make live calls. The pytest fixture defaults to mode="once" locally and mode="none" whenever the CI env var is set, so a forgotten recording fails the build instead of silently burning tokens against production quotas. No API keys in CI at all: replay needs none.
# .github/workflows/test.yml — note what's absent: any LLM API key.
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: "3.12"}
- run: pip install -e ".[dev]"
- run: pytest # CI=true → mode=none → replay-only, zero tokens
Cassettes are reviewable artifacts. They live in the repo next to the tests. A prompt change shows up in the PR as both the code diff and the recorded-behavior diff. Stale fixtures after a deliberate prompt rework are a one-liner:
PROMPTECHO_MODE=all pytest tests/agents/ # re-record this directory's cassettes
Recording is guarded too. A subtle failure mode in any record/replay system: a transient 429 or an expired key during recording gets baked into the cassette and replays as a green-looking test forever, masked by your app’s retry logic. promptecho’s on_record_error policy surfaces it. "warn" by default, or hard-fail your re-record pipelines:
@promptecho.use_cassette("cassettes/foo.yaml", on_record_error="raise")
def test_foo(): ...
# RecordedErrorResponse: Refusing to record HTTP 429 into 'cassettes/foo.yaml'.
# Cassette was NOT written.
Per-test configuration goes through a pytest marker, so teams can tune matching without abandoning the auto-named fixture:
@pytest.mark.promptecho(match_on=["model", "messages", "temperature"])
def test_sampling_config(promptecho_cassette):
...
A note on honest limits, because every tool should state them: promptecho intercepts at the httpx transport layer, so SDKs on other stacks (boto3-Bedrock, HF InferenceClient) aren’t covered yet. They pass straight through to the network, with no silent degradation. One cassette is active at a time per process (a nested activation raises immediately rather than corrupting recordings; pytest-xdist is fine since workers are processes). The full support matrix with workarounds lives in SUPPORT.md.
Where this goes next
Three roadmap items, in order:
promptecho lint: run your suite in observe-only mode and report every LLM call that isn’t covered by a cassette, per test, with the suggested fix. The inverse guarantee ofmode="none": not just “recorded tests can’t go live,” but “no test goes live unnoticed.”- A
requests/urllib3interception backend: this unlocks Bedrock via boto3 and the HuggingFaceInferenceClient. The architecture already cooperates: matching, cassettes, and normalization are transport-agnostic; httpx wiring is one isolated module, and a second backend slots in beside it. (Bedrock’s model-id-in-URL-path quirk is already handled: the URL path is part of the match fingerprint.) - Semantic snapshot assertions: a
toMatchLLMSnapshot()-style sibling for the other side of the deterministic/probabilistic boundary: jest-style snapshot review for model outputs, with an optional LLM-as-judge on mismatch. The judge’s own API calls get recorded by promptecho, naturally. It’ll be a separate package, because the core’s identity: deterministic replay, nothing else, which is worth protecting.
What stays out of scope, permanently: In-process models (transformers, llama-cpp). There’s no HTTP call to freeze, and becoming a generic function memoizer would dilute the one thing this tool does well. Run local models behind their OpenAI-compatible servers and promptecho covers them today.
Try it
pip install promptecho
The repo has a scenario-based TUTORIAL.md, the full support matrix, and a DESIGN.md covering the why-not-the-other-way decisions — fingerprints vs raw bytes, why semantic matching is fenced off, how SSE re-emission works.
If your client is on httpx and a call isn’t captured, that’s a bug — file an issue. If your team is stuck on Bedrock-via-boto3, tell me: that’s how the second backend gets prioritized.
MIT licensed. Cassettes safe to commit. Your CI bill will thank you.