I let Claude dream about how Boris uses Claude Code

An experiment with Anthropic's Dreaming research preview · May 12, 2026 · 6 min read

Disclosure: this post was researched and drafted by Claude Code, the agents were Claude, the curator was Claude, and the playbook was written by Claude. The whole thing ran on free API credits I got at Anthropic's Code with Claude SF event.

$ claude --dream "how does Boris use Claude Code?" generated by Claude · powered by Claude · curated by Claude · by Boris

Anthropic gave me research-preview access to a new feature called Dreaming, inside their Claude Managed Agents product. Claude has been writing things to a memory store across many past conversations (Agent Memory, which Anthropic just launched to public beta). Dreaming reads that store plus the conversations themselves and turns it into a curated version (duplicates merged, stale entries dropped, patterns surfaced). So you get less "Claude responds to a question," more "Claude does a small research project for you while you sleep."

My use case: take the 87 Claude Code tips, replay each one as its own little task where Claude tries to apply the tip, then run Dreaming over the resulting sessions. The dream's output (call it "the playbook") is a small collection of curated markdown entries, one per tip or theme. Only got 20 of the 87 done before having to cancel for cost reasons. The playbook came out decent anyway.

It's Claudes all the way down

Boris wrote Claude Code at Anthropic.
Boris wrote tips about how he uses Claude Code, on X.
I built howborisusesclaudecode.com entirely with Claude Code to catalog those tips.
Opened a fresh Claude Code session, asked it to write Python that calls the Anthropic API.
That Python kicked off 20 Managed Agents sessions, each one a different Claude applying one of Boris's tips to an invented task.
A Dreaming Claude read those 20 sessions and wrote the curated playbook.
Another Claude Code session read the curated playbook and drafted this post.

Claude is using Claude to study Claude Code from the guy who made Claude Code.

Anyway, onto the experiment.

The first try sounded right but made stuff up

Two tips for the test run. Output looked fine until I checked specific claims against Boris's actual tips.

It read like a polished Claude Code manual...you know, bullet lists, crisp invented industry claims like "Teams that do this consistently report Claude 'gets the codebase' within a few sprints," and made-up anti-patterns like "Skipping the second-Claude review when the task touches shared interfaces / public APIs", all stated confidently like a fact, none of it from Boris.

Going back to source: Boris's original Plan Mode tip is ~80 words. The dream's version came to ~390 so that's 300+ words of plausible-sounding extrapolation ("schema design, CLI interfaces, streaming vs. batch, error surface") combined with hallucination. What happened was the model read Boris's terse advice and made things up. The issue here is that the output sounds like Boris and no reader can tell which sentences came from him and which are Claude's hallucinations.

Two paragraphs of prompt fixed it

Same corpus/model/architecture. We only rewrote the instructions to require source attribution on every claim.

The two edits:

The little Claude that applies each tip has to label its memory entry in three sections: [Boris] for things drawn from the source, [illustrative] for the invented task it ran, [synthesis] for hedged judgments it adds itself. The illustrative section is explicitly prefixed "Illustrative simulation (invented by the applying agent — NOT a real workflow observation)."
The Dreaming Claude is told every sentence in the final playbook must end with [Boris], [illustrative], or [synthesis]. I.e., no fabricated industry claims or hallucination of Boris's words into things he didn't say on X. I also put a 200-word cap per entry.

I re-ran the dream on the same 2 tips and compared the output line by line. The run avoided fabricated claims this time (good!). Every invented example was tagged as a simulation, and every hedged judgment was qualified with "one reading" or "likely." The output was 36% shorter.

Dreaming is steerable. Without changing the input / model, tighter instructions produced a better different artifact.

Scaling burned ~$40 in 18 minutes

I was blissfully ignorant of the costs.

2 tips in a dream equals ~$2 and took 3 minutes. Who cares, right? But scaling to 20 tips for the actual playbook run flipped that. Claude Code estimated $15 and 10 minutes; 18 minutes in, the dream was still going. The console billing page doesn't auto-update, so checking the cost meant manually refreshing every 20 seconds and watching the balance fall. By that point ~$40 of credits had drained across the runs and it kept dropping. So I cancelled.

Two surprises in hindsight. First, Dreaming's cost doesn't scale linearly with how much you feed it. Claude Code told me it would. The internal loop seems to re-read everything from cache on every turn, and each new source session multiplies the loop. 10x more input cost roughly 20x more, not 10x. An 87-tip run extrapolated at that ratio would have run well into the triple digits. Somewhere between $150 and $250 by my rough math.

Second, cancelling a dream is graceful. The 18 memory entries Claude had already written before the cancel stayed in the output store, and the rendering script gave back a real, source-attributed 9,000-word playbook from the partial run.

Watching the dream run

Watching it run felt very satisfying. The clip below replays the dream. The agent.thread_message_received events flickering past are the dream pulling in each of the 20 seed sessions one at a time. The inline cache_read=... totals climb as it runs, and the Dream complete summary at the end shows where it landed: 9.3M cache_read tokens. Same thing that was draining the credit balance.

The clip is a replay of the run, sped up 2x for screen capture. Same events Claude emitted, same colors, same Dream complete summary at the end.

What the dream wrote

Here's an excerpt of what the v2 dream produced for the bug-fixing tip. Source tags make Boris's words and Claude's synthesis visibly separable: [Boris] marks what came from the original tip, and [illustrative] marks the invented example the applying Claude ran (offset with "Illustrative scenario:" so a reader can't mistake it for a real workflow observation). A third tag, [synthesis], appears elsewhere in the playbook for the dream's hedged judgments. This particular entry didn't generate any.

What to notice: every load-bearing claim is tagged. The [Boris] tags are paraphrases of Boris's actual advice; the [illustrative] section is invented by the seed agent and clearly flagged.

Bug fixing: hand Claude the artifact, not a summary

"Don't micromanage how." [Boris]

Mechanic [Boris] — Three delegation paths: enable the Slack MCP and say "fix" on a pasted bug thread; say "Go fix the failing CI tests" without prescribing the approach; point Claude at docker logs to troubleshoot distributed systems.

Illustrative scenario: The applying agent simulated a 500-error incident with a 15-message Slack thread; on the docker logs route Claude surfaced temporal clustering (gRPC deadline exceeded on checkout→inventory) that casual reading would miss. [illustrative]

Source: bug-fixing

What I'd do differently

I'd do this again but smaller. Two or three tips per dream, with the billing page open in another tab so the cost doesn't sneak up on me again. This stuff is expensive.

But it's not about the money. The key thing is knowing how much the instructions matter. We used the same model and the same two tips. The v1 output would have been authoritative enough to publish if I hadn't gone back to check Boris's actual words line by line. The v2 output is more defensible and the difference was two paragraphs of prompt.

Next time I'd run smaller dreams in parallel. Three or four over different tip subsets, then a meta-dream to merge them. Untested, but per-dream cost stays low and the merging is what the model already does internally per dream anyway.

Before publishing this post I went back and spot-checked the 20-tip playbook too. Pulled every [Boris]-tagged claim that looked invented (specific URLs, file paths, percentages, wildcard examples, the named subagent files, etc.) and diffed them against Boris's actual tips. All 18 candidates passed. The tagging system held at 10x the input scale, not just on the 2-tip rehearsal. Good sign.

For engineers

Everything technical that didn't make this post (the SDK shape, the prompt diffs, the cost curve, the cancellation behavior, the docs gaps I bumped into) is in a structured feedback log I published alongside this post for the Anthropic Dreaming team: /dreaming.md.

The full 20-tip curated playbook (the actual dream output) is at /playbook.md.

Companion piece to the Code with Claude SF recap.