03Open Source

AgentTape

A flight recorder for Claude Code. Record, replay, diff, regression-test.

TypeScriptNode.jsClaude Code HooksnpmJSONL

GitHub Blog Post

Project Stats

npmPublished packagenpm install -g agenttape

6Monorepo packagesScoped @agenttape/*

0Runtime depsOutside Node built-ins

v0.5Current versionWeekend build

The Problem

Claude Code shows you everything it's doing while it works — every file read, every command run, every commit. That part is fine. The problem is what happens after the session ends.

The session becomes unstructured terminal text. You can scroll up. That's it. There's no way to:

✗Replay a past session to verify it behaves the same way

✗Compare this run against yesterday's run of the same task

✗Pipe the session into CI or a test harness

✗Write a regression test that fails when the agent drifts

✗Commit the session as a versioned artifact alongside your code

Every other layer of the software stack has solved this. Services have distributed traces. Databases have query logs. Deployments have changelogs. AI agents — supposedly the most powerful development tool we've ever built — ship none of that.

The Tape Format

Every session is recorded as a JSONL tape file — one JSON object per line, one line per event. The format is simple enough to read in a text editor and structured enough to parse, diff, and query programmatically.

JSONL Tape — example session

{"lineType":"meta","runId":"run_abc123","agent":"claude","startedAt":"2026-03-06T10:00:00Z"}
{"lineType":"event","eventType":"read_file","payload":{"path":"src/index.ts"}}
{"lineType":"event","eventType":"read_file","payload":{"path":"src/auth/middleware.ts"}}
{"lineType":"event","eventType":"file_written","payload":{"path":"src/auth/session.ts","linesChanged":42}}
{"lineType":"event","eventType":"command_executed","payload":{"command":"pnpm typecheck","exitCode":0}}
{"lineType":"event","eventType":"command_executed","payload":{"command":"pnpm test","exitCode":0}}
{"lineType":"event","eventType":"run_completed","payload":{"answer":"Done. Refactored session handling, all tests pass."}}

Each tape is a self-contained, replayable artifact. It can be committed to version control, compared to another tape, or used as a regression baseline. The runId ties all events to a single session.

Architecture

A TypeScript pnpm monorepo with six scoped packages. Each package has a single responsibility. The CLI composes them.

Package

Responsibility

@agenttape/core

JSONL tape format, read/write utilities, shared TypeScript types

@agenttape/replay-engine

Deterministic session replay, invariant evaluation

@agenttape/diff-engine

Semantic diff between two tape runs (tool sequence, file delta, output drift)

@agenttape/test-runner

Regression test runner — compares live session against a baseline tape

@agenttape/integration-claude

Claude Code PostToolUse hook handler, stdin → tape event

@agenttape/cli

The agenttape CLI — init, record, replay, diff, test commands

Data Flow

Claude Code

Running a task

PostToolUse hook

claude-hook

stdin payload

writes

JSONL Tape

session.jsonl

replay

HTML viewer

diff

vs other tape

test

vs baseline

The Four Core Commands

agenttape record

Hook installation (one-time)

agenttape hooks install adds three PostToolUse matchers to ~/.claude/settings.json — for Write/Edit, Bash, and Read. Once installed globally, it silently no-ops on any project that hasn't been initialised.

Session recording

Running agenttape record --session sets AGENTTAPE_TAPE_PATH and starts Claude Code. Each hook fires, sends its payload to agenttape claude-hook on stdin, and the event is appended to the tape.

Auto-generated HTML viewer

When the session ends, generateTapeHtml(tape) produces a fully self-contained HTML file — no server, no build step, no dependencies — and opens it in the browser automatically.

agenttape diff

Run the same task twice, then compare. The diff engine looks at tool call sequences, file counts, and output similarity — not raw text — so minor wording changes don't trigger false positives.

CLI output

$ agenttape diff run-monday.jsonl run-tuesday.jsonl --summary

Diff result: changed
Severity: major

Tool sequence:
- monday:  read_file → write_file → run_command
- tuesday: read_file → write_file → write_file → run_command

Differences:
- [major] Tool call count changed (3 → 4)
- [minor] Output drift detected in run_completed

agenttape test

Copy a tape from a session you're happy with into agent-tests/. Future sessions that deviate from that baseline — different files touched, extra commands, changed output — fail the test. You get the diff and decide if the change is intentional.

CI usage

# In package.json scripts:
"test:agent": "agenttape test agent-tests/"

# Output on failure:
FAIL  agent-tests/auth-refactor.tape.jsonl
  Expected: 3 tool calls
  Received: 5 tool calls
  [major] Unexpected files written: src/auth/legacy.ts

Claude Code Hook Configuration

The integration works through Claude Code's native hook system. Three PostToolUse matchers are added to the user's global settings. If AGENTTAPE_TAPE_PATH isn't set, every hook is a silent no-op — zero interference with normal Claude Code usage.

~/.claude/settings.json (added by agenttape hooks install)

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit|MultiEdit",
        "hooks": [{ "type": "command", "command": "agenttape claude-hook" }]
      },
      {
        "matcher": "Bash",
        "hooks": [{ "type": "command", "command": "agenttape claude-hook" }]
      },
      {
        "matcher": "Read",
        "hooks": [{ "type": "command", "command": "agenttape claude-hook" }]
      }
    ]
  }
}

Key Technical Decisions

Format

→ JSONL over JSON or SQLite

JSONL is append-only — each event is a line. You can tail it live, cat it, grep it. No schema, no migrations, no lock files. Simple enough that anyone can inspect a tape without tooling.

Dependencies

→ Zero runtime deps (outside commander)

A CLI tool that installs globally shouldn't pull in a node_modules tree. All tape I/O, diffing, and HTML generation uses Node built-ins. commander is the only exception for arg parsing.

HTML Viewer

→ Pure function, self-contained

generateTapeHtml(tape) returns a string. No build step, no Webpack, no server. The output HTML file is an artifact you can email, commit, or open offline forever.

Monorepo structure

→ Six scoped packages

Each concern is isolated. The diff engine doesn't know about the CLI. The test runner imports the diff engine but not the Claude integration. This makes each package independently testable and usable.

Roadmap

Short Term

→/agenttape slash command

→VSCode extension

→Better HTML viewer (search, filter, collapse)

Medium Term

→Support for Cursor, Windsurf, Copilot

→MCP server mode — stream events live

→agenttape watch — continuous drift alerts

Longer Term

→Shared tape registry / benchmarks

→AI-powered diff narration

→Team dashboards for agent reliability

AI agents are writing more and more of our production code. But we're testing and auditing them like it's 2010 — which is to say, barely at all. AgentTape is a start at fixing that. If you're going to trust an agent with your codebase, you should be able to replay what it did, diff it against last time, and catch regressions before they ship.

Want to build something like this?

I'm available for new projects and collaborations.

Get in touch