# Boris's Brain — feedback log

This is what I learned building **Boris's Brain** on Anthropic's [Dreaming](https://platform.claude.com/docs/en/managed-agents/dreams) research preview — friction, surprises, wins, and wishlist items at the level of detail an Anthropic engineer would find actionable. Published alongside the [narrative post](/dreaming). Evergreen — entries get added as more runs happen.

If you're on the Dreaming team, [DM me on X](https://x.com/Daniel_An23).

**Entry tags:** `[win]` · `[surprise]` · `[confused]` · `[bug]` · `[docs-gap]` · `[sdk]` · `[wishlist]`

---

## Run setup

- **SDK**: `anthropic-python` via Stainless research preview URL (anthropic 0.100.0).
- **Beta headers auto-set by the SDK**: `managed-agents-2026-04-01`, `dreaming-2026-04-21`.
- **Models tested**: `claude-opus-4-7` (dream), `claude-sonnet-4-6` (seed sessions).
- **Use case**: Curate a publishable playbook from Boris Cherny's 87 Claude Code tips on `howborisusesclaudecode.com` by replaying each tip as a Managed Agents session and dreaming over the resulting memory store.
- **Scale of inputs tested in this run**: 1 input memory store, dream runs over 2 and 20 seeded sessions. The 20-session dream was cancelled mid-run at ~18.5 min when token counts implied a cost run-rate past $40 and still climbing (see the cost section below for the share-attribution math).

---

## Log

### 2026-05-12

- `[docs-gap]` `[sdk]` `pip install <stainless-url>` — The research preview SDK URL `https://pkg.stainless.com/l/anthropic-python/<id>` serves a valid wheel but the URL has no `.whl` extension. `pip install <url>` downloads the file then errors:
  > `does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.`
  Workaround: `curl -sL -o /tmp/anthropic.whl <url>` then `pip install /tmp/anthropic.whl`. The auto-generated wheel filename inside the zip (`anthropic-0.100.0-py3-none-any.whl`) confirms it IS a standard wheel — the URL is just missing the extension hint pip needs. Easy fix on the Stainless side: serve with `Content-Disposition: attachment; filename="anthropic-0.100.0-py3-none-any.whl"`.
- `[confused]` `[docs-gap]` `sessions.create` — `client.beta.sessions.create` doesn't accept a `model` param. Models are bound to the agent at `agents.create` time. I'd assumed sessions could override the agent's model per-invocation (the analogue would be passing `model=` to `messages.create`). If there's a less-discoverable way to override per-session (`extra_body`?), it's not in the docs I found.
- `[win]` `[sdk]` `agents.retrieve` / `environments.retrieve` / `memory_stores.retrieve` — Idempotency is clean: each retrieve raises if the ID is stale, otherwise returns the resource. Makes setup scripts trivially safe to re-run.
- `[win]` `[sdk]` `sessions.events.stream` — Iterating the stream yields typed events with `.type` and `.model_dump()`. Trivial to write to JSONL and watch for `session.status_idle`.
- `[surprise]` `agent_toolset_20260401` — Passing `{"type": "agent_toolset_20260401", "configs": [], "default_config": {"enabled": True}}` to `agents.create` enabled all default tools. The empty `configs` array felt weird (I expected an error or a "specify what you want" message) but it worked. Worth documenting that `default_config.enabled=true` is the "give me everything" shortcut.
- `[win]` `dreams.create` — 2-session rehearsal dream completed end-to-end in **167 seconds** with the playbook in `outputs[].memory_store_id`. Faster than I expected. Status transitions: `pending` → `running` → `completed`.
- `[win]` **Output quality at small scale.** 2 input tips (`plan-mode`, `claude-md-investment`) yielded 3 well-organized output memories: `/_index.md` (theme-organized cross-linked index), `/daily-workflow.md`, `/project-setup.md`. The dream pulled in **operational examples the seed agent invented** ("the streaming CSV→JSON CLI" task) as concrete grounding in the curated playbook. That's a real signal that seeding via "apply the tip" sessions is doing meaningful work, not just summarization.
- `[confused]` `[docs-gap]` `sessions.events.stream` (on a *dream* session) — Almost all events came through as low-level spans: `span.model_request_start`, `span.model_request_end`, `thread status_running`, `thread status_idle`, plus a few bare `message` events without the `agent.` prefix. The docs example I was working from implied `agent.message` / `agent.tool_use` / `agent.tool_result` would be the dominant types — those didn't appear in the dream session stream. The visualization layer needs a different event-rendering strategy than ordinary-session docs anticipate. Worth documenting which event types ordinary sessions vs. dream sessions emit.
- `[docs-gap]` `memory_stores.memories.list` — Defaults to `view='basic'` which returns memory metadata but **`content` is `None`**. Have to pass `view='full'` to get content back. Easy miss; I spent a beat figuring out why my "playbook" markdown was empty until I noticed. Would suggest defaulting to `full` when listing small stores (heuristic: `total_size_bytes < N MB`).
- `[wishlist]` `dreams.retrieve.usage` — Reports tokens (`input`, `output`, `cache_creation`, `cache_read`) but not `$cost`. For a research preview where credit visibility matters, a `usage.cost_usd` (or equivalent) on `dreams.retrieve()` would save a lookup. The rehearsal pulled 21k output + 74k cache_creation + 693k cache_read tokens — I can compute the cost but it's a step I'd rather not.

### 2026-05-12 — v1 → v2 dream-prompt iteration

The most important *quality* finding of the run, structured as before/after:

**v1 failure modes (with casual instructions like "produce a high-signal playbook"):**

The dream produced an authoritative-sounding artifact with **at least 5 distinct fabrication classes**:

| Class | v1 example |
| --- | --- |
| Fabricated industry trend | *"Teams that do this consistently report Claude 'gets the codebase' within a few sprints"* |
| Invented specificity / authority laundering | Boris said "complex tasks"; v1 expanded to *"schema design, CLI interfaces, streaming vs. batch, error surface"* |
| Invented anti-pattern | *"Skipping the second-Claude review when the task touches shared interfaces / public APIs"* |
| Invented statistic | *"seed CLAUDE.md with 5-10 team conventions up front; halves early correction cycles"* |
| Simulated → presented as evidence | Seed agent's simulated CSV→JSON CLI task was promoted to *"**Observation** from applying it"* in the playbook, indistinguishable from a real workflow data point |

A reader could not tell which claims came from Boris vs. the model. Length: ~330 words/entry, ~30k extrapolated for the full 87-tip corpus.

**v2 fix — two-sided prompt change:**

1. `02_seed.py` user message now requires the applying agent to write memory entries with **three labeled sections** in order: `[Boris]` (paraphrase tightly from tip body), `[illustrative]` (clearly prefixed *"Illustrative simulation (invented by the applying agent — NOT a real workflow observation)"*), `[synthesis]` (hedged judgments only).
2. `03_dream.py` `instructions` made every claim's source attribution non-negotiable: every sentence in the playbook must end with `[Boris]`, `[illustrative]`, or `[synthesis]`, with explicit "no fabricated industry claims" / "no invented specificity" / "no placeholder index sections" rules and a 200-word per-entry cap.

**v2 result on the same 2-tip corpus:**

- **0 fabricated claims** across both entries (manually audited line-by-line against tip source).
- Every illustrative scenario is tagged `[illustrative]` and offset with `*Illustrative scenario:*` so it cannot be mistaken for evidence.
- Every synthesis claim is hedged ("One reading is…", "may not warrant the overhead").
- Index lists only themes with real entries; no anticipatory placeholders.
- 600 words total (-36% vs v1, 934).
- Wall clock: 249s (v1: 167s; longer instructions cost ~80s).
- Tokens: 38,111 output (v1: 21,450) — more output because the labeled-section format is more structured per claim. Cache hits dominate cost.

**Takeaways for the Dreaming team:**

- `[win]` Dreaming IS steerable on attribution and length. The same input corpus + tighter instructions produced a publishable artifact where v1 was unshippable.
- `[wishlist]` The default instructions example in the docs encourages free-form synthesis; for any artifact intended for publication, that default invites fabrications. Suggest adding a "high-fidelity curation" example to the docs that demonstrates source-attribution tagging + word-budget enforcement.
- `[surprise]` Output tokens went UP under the tighter prompt (38k vs 21k) even though the artifact got shorter. The structured per-claim labeling format costs more tokens than free-form prose. Worth mentioning in cost docs.
- `[wishlist]` There's no built-in way to inspect *what the dream considered as input* per output entry. Provenance metadata in `dream.outputs[]` (e.g. which session IDs contributed to each memory file) would make trust calibration much easier for downstream use.

### 2026-05-12 — source-attribution audit of the 20-tip playbook

After publishing, I spot-checked the v2 20-tip playbook output for `[Boris]`-tagged claims that *looked* like they could have been fabricated by the dream and laundered through the tag. The candidates were the kind of specific detail that's easy to invent and hard to catch: exact URLs, file paths, command syntax, named subagent files, slash command names, percentage claims, wildcard examples, and quoted opener lines I didn't immediately recognize as canonical Boris phrasing.

Initial flag list (10 items + 8 quoted lines, ~18 claims total):

- `https://slack.mcp.anthropic.com/mcp` with `type: "http"` config (looked invented)
- `git worktree add .claude/worktrees/<name> origin/main` (specific path)
- "Claude generates well-formatted code 90% of the time" (suspiciously specific number)
- `"Bash(bun run *)"` / `"Edit(/docs/**)"` as Boris's wildcard examples
- "prompt-injection detection, static analysis, sandboxing, human oversight" four-pillar permissions framing
- Aliases `za`/`zb`/`zc` (specific names)
- All five named subagent files: `build-validator.md`, `code-architect.md`, `code-simplifier.md`, `oncall-guide.md`, `verify-app.md`
- `/commit-push-pr`, `/techdebt` as Boris's named slash commands
- "native worktree support shipped in the Claude Desktop app"
- ralph-wiggum plugin / @GeoffreyHuntley attribution
- Eight quoted opener lines I couldn't place from memory (e.g. *"A good plan is really important to avoid issues down the line"*, *"Think of subagents as automations for the most common PR workflows"*, *"If you do something more than once a day, turn it into a skill or command"*)

Audit method: read each candidate's source tip file (`data/tips/part-01-tip-NN-<slug>.md`) and check for verbatim or close-paraphrase support.

**Result: 18 of 18 candidates verified as real Boris content.** Every flagged URL, path, percentage, wildcard, file name, slash command, and quoted line is in Boris's tips verbatim or as a close paraphrase. The four-pillar permissions framing — which read most like marketing taxonomy — turned out to be *verbatim* from tip 20 (permissions-management): *"Claude Code uses a sophisticated permission system with prompt injection detection, static analysis, sandboxing, and human oversight."*

- `[win]` The v2 source-attribution tagging system held under audit at the 20-tip scale. The v1→v2 fabrication delta I previously documented (5 distinct fabrication classes → 0 on 2 tips) was not a small-corpus artifact; the discipline transferred to a 10x larger corpus. This was the finding I most expected to be wrong after stress-testing the output.
- `[surprise]` My own prior over the audit was that I'd find at least 2–3 fabrications. The bias appears to be: claims tagged `[Boris]` that *sound* AI-generated (statistically-typical sentence structure, generic framing) feel like fabrications even when they're verbatim quotes. The "AI-toned" texture of the playbook is partly a property of Boris's tip-writing voice itself — short, declarative, structured — not evidence of fabrication.
- `[wishlist]` This experience suggests a feature for `dreams.create`: an optional `instructions.source_grounding_mode = "verbatim_or_paraphrase"` that constrains every load-bearing claim to be traceable to a specific input session, plus per-claim provenance metadata in the output store. Right now the source-attribution tags are enforced via prose instructions and held; making them a first-class API surface would let downstream tools auto-verify against the input corpus instead of requiring a human audit pass.
- `[wishlist]` Related: a `dream.outputs[].claims[]` array where each claim object carried `{text, source_session_id, source_offset, confidence}` would make audits like this one a `diff`, not a read-and-search exercise. Would be a major trust unlock for any Dreaming usage targeting publication.

The audit was not exhaustive across every line of every entry — only the candidates I'd manually flagged as fabrication-shaped. Bayesian update on this: the priors that worried me most all came back clean, so my expected fabrication rate on the unaudited claims is now low. Not zero.

### 2026-05-12 — two cost-visibility blind spots

Pulled the May 12 token-export CSV from the Cost dashboard mid-experiment to reconcile our actual spend. Two surprises:

- **`[wishlist]` Dreaming costs are billed against the `console` API key, NOT the project's API key.** The CSV's two rows for May 12: Sonnet (my seed sessions) billed to `boris-brain`, Opus (the dreams) billed to `console`. The Cost dashboard filtered to `boris-brain` showed $1.41 in token cost — which is just the seed sessions. The actual ~$48 in Dreaming spend was invisible at that filter level. A per-project budget owner can't see Dreaming spend from the dashboard at all. Suggest either (a) routing Managed Agents API calls to the user's own API key, or (b) adding a "Managed Agents (Dreaming, Sessions)" breakdown card to the Cost dashboard.
- **`[surprise]` `[wishlist]` `dream.usage` undercounts billed tokens by roughly 3x.** Summing `dream.usage` across all three of my dream runs (2-tip v1, 2-tip v2, 20-tip cancelled): ~10.7M cache_read, ~287k output. The CSV's billed totals for Opus 4.7 the same day: 32.5M cache_read, 654k output. So `dream.usage` reports about a third of what's actually billed for the dream. I assume there are coordination/planning model calls inside the pipeline that don't surface in the `usage` field. Users can't budget from `dream.usage` alone, and the gap is large enough to matter.

### 2026-05-12 — dream cost does NOT scale linearly with session count

**The single most important *operational* finding of this run for anyone planning to use dreams at scale** — and it contradicts the docs' Billing section, which states *"Cost scales roughly linearly with the number and length of input sessions."*

Same `03_dream.py` instructions, same agent + environment, same per-session output volume. Only varying the number of seeded sessions in the dream input:

| Sessions | Wall clock | `dream.usage` output | `dream.usage` cache_read |
| --- | --- | --- | --- |
| 2 (v1) | 167s | 21,450 | 693,137 |
| 2 (v2) | 249s | 38,111 | 695,209 |
| 20 (cancelled mid-run) | 1115s | 226,992 | 9,296,977 |

10x input → ~6x output tokens, ~13x cache_read tokens, ~4.5x wall clock. Calibrating against the actual ~$48 balance drop and the share-attribution of each dream's `dream.usage` in the day's total: the 20-session run consumed roughly **$38-42** of the actual-billed spend before I cancelled it; the two 2-session runs together consumed roughly **$4-7**. That's a 6-10x cost ratio for 10x input growth, on a partial run. A completed 20-session dream would have been higher; an 87-session dream extrapolates to well into triple digits.

For the 20-session run I had to **cancel via `client.beta.dreams.cancel(...)` after 18.5 minutes** when the cost trajectory was accelerating with no clear indicator in `dreams.retrieve()` of how much wall time or token spend remained.

Note on absolute pricing: my standard-rate cost calculation from the CSV tokens computes to ~$127, but the actual balance dropped only ~$48 — implying research-preview pricing applies a roughly 60% discount on standard Opus rates. The *ratio* claims above hold regardless of the absolute pricing curve.

Key observations:

- `[surprise]` `[wishlist]` **Cache_read is the dominant cost driver as input grows.** The 20-session dream consumed 9.3M cache_read tokens vs 696k for the 2-session dream (13x growth from 10x more sessions). The dream's internal agent re-reads context across many self-prompted turns, multiplying token cost faster than linearly with input size.
- `[surprise]` **Output volume also explodes.** 2-session dream produced a 600-word playbook. 20-session dream produced 9,036 words (15x the words from 10x the input). The model didn't honor the "200 words/entry hard cap" instruction at 20-session scale — entries averaged ~500 words. Either the cap is hard to enforce in dream context, or the instruction wasn't weighted high enough.
- `[wishlist]` **No mid-run cost ceiling / budget control.** `dreams.create` accepts `instructions` and `model` but not a `max_output_tokens`, `max_cost_usd`, or `max_wall_clock_s` budget. You're committing to whatever the dream's internal agent decides to spend. For a research-preview product where users are tracking $100 in credits, an upfront cost ceiling — even a soft `target_cost_usd` — would be transformational. Cancellation is the only safety valve and it requires you to be actively watching.
- `[win]` **`dreams.cancel(dream_id)` preserves partial work.** The 18 memory entries the dream had written to the output store before cancellation persisted; `05_compare.py` rendered a real, source-attributed 9k-word playbook from the partial output. So `cancel` is actually a clean partial-result exit, not a hard kill. Nice property — worth documenting prominently.
- `[wishlist]` **No progress signal from `dreams.retrieve()`.** Status is just `pending` / `running` / `completed` / `failed` / `canceled`. No "X of N source sessions consumed", no estimated time remaining, no streaming token-rate. Without the underlying session's event stream the dream is a black box that just charges your account. The session-event stream helps but is verbose and low-level.
- `[confused]` **Why does the dream make so many internal turns?** The session event stream during the 20-tip dream showed pattern: many cycles of `thread status_running → agent.thread_message_received → span.model_request_start → span.model_request_end → thread status_idle → message ("Acknowledged") → thread status_running → ...`. The "Acknowledged" responses on `agent.thread_message_received` suggest the dream agent is processing one source session at a time in a loop, with each loop re-reading prior context from cache. If that's the architecture, the cost growth is structural — adding tips multiplies the loop iterations. Worth documenting so users can plan accordingly.

**Practical guidance the docs should give users:**

1. The Opus 4.7 dream is currently optimized for **small, focused source-session counts**. Sweet spot in this experiment was 2-3 sessions (~$2-3, ~3 min, controllable). 20 sessions stretched the model past its comfortable budget.
2. For larger corpuses, **fan out into multiple smaller dreams** and run a meta-dream over their outputs. (Hypothesis — would need testing, but matches the "merge curated outputs" pattern the model already does internally per dream.)
3. **Always start with a 2-tip rehearsal** to validate the instructions before scaling. The v1→v2 prompt iteration would have been ~10x more expensive at 20 sessions than at 2.
4. **Watch the dream from a separate process** with a soft cost-cap loop — if `usage.input_tokens + usage.output_tokens + usage.cache_read_input_tokens` crosses your budget, call `dreams.cancel()` proactively. Build this into any production usage.

---

## Distilled summary

### What worked great

1. **Dreaming as a primitive really does work.** A 2-tip dream on Opus 4.7 finished in 167s with a coherent, theme-organized output memory store and a usable playbook. With source-attribution tagging in the `instructions` it's publishable as-is.
2. **`dreams.cancel()` is graceful.** It preserves the partial output store. The 20-tip dream we cancelled at ~18.5 min still produced 18 memory entries we could render into a real playbook. That makes cancel a clean partial-result exit rather than a hard kill — important safety property for budget-bounded users.
3. **The Managed Agents SDK pattern is clean for this use case.** `agents.create` → `environments.create` → `memory_stores.create` → `sessions.create` per tip → `sessions.events.send` → `sessions.events.stream` → `dreams.create` is exactly the right level of primitive. Idempotency via `retrieve`-before-`create` Just Works.
4. **The seeding-via-apply-the-tip pattern produces meaningfully better dream inputs than summarization would.** The dream picked up operational examples the applying agent invented and used them as concrete grounding. That's a feature-of-the-architecture finding, not just an artifact-of-this-experiment one.
5. **Steerability via `instructions` is real and large in magnitude.** Same corpus, tighter instructions, 0 fabrications vs 5 fabrication classes. The lever exists and it works.

### What surprised me

1. **Dream cost does not scale linearly with input session count.** 2 → 20 sessions = ~$2 → $41+. Cache_read tokens dominate (9.3M for 20 sessions vs 700k for 2). This was the single biggest gap between my mental model and reality.
2. **The 200-word/entry hard cap held at 2 sessions and broke at 20.** Length-control instructions appear to compete with the model's tendency to fill space as input scope grows. Either the cap weighting is too soft, or the architecture needs an explicit per-entry token budget.
3. **Output tokens went UP under the tighter v2 prompt** (38k vs 21k for the same 2-session corpus). Structured per-claim tagging costs more tokens than free-form prose. Worth flagging in cost docs.
4. **Dream-session event streams are dominated by low-level `span.*` and `session.thread_*` events**, not the `agent.message` / `agent.tool_use` types the docs example showcases. Live visualization scripts need to filter aggressively.
5. **The seed agent's *simulated* tasks got promoted to "Observations"** in the v1 dream output — a contagious authority-laundering failure mode that's invisible unless you audit line by line against the source.

### What confused me

- `sessions.create` doesn't take a `model` param — model is locked to the agent at create time. I assumed it would override. If there's an extra-body path, it's not discoverable.
- `memory_stores.memories.list` defaults to `view='basic'` which silently returns `content=None`. Caught me once; would be a bigger gotcha for someone working through the quickstart.
- The dream session's internal loop pattern (many `agent.thread_message_received → "Acknowledged"` cycles) isn't documented anywhere I could find — and it's the key to understanding why cost scales the way it does. Knowing the architecture upfront would have changed my approach.

### Bugs and issues

- `pip install https://pkg.stainless.com/l/anthropic-python/<id>` fails with `does not appear to be a Python project` because the URL serves a valid wheel body without an `.whl` extension. Workaround: `curl` then `pip install` the local file. Fixable in Stainless's response headers.

### Wishlist (ranked by impact)

1. **`max_cost_usd` / `max_output_tokens` / `max_wall_clock_s` parameters on `dreams.create`.** Even a soft cap that surfaces a warning event when crossed would be transformational for credit-sensitive users. This is the single biggest UX improvement available short of changing the dream architecture.
2. **A progress signal on `dreams.retrieve()`** beyond `pending` / `running` / `completed` — at minimum "X of N source sessions consumed" or current token-rate.
3. **Provenance metadata in `dream.outputs[]`** — which session IDs contributed to each memory file. Trust calibration for downstream use.
4. **`usage.cost_usd` on `dreams.retrieve()`** alongside the token counts.
5. **`view='full'` default on `memory_stores.memories.list`** for small stores.

### SDK ergonomics

- `Bash`-friendly install path: the `pkg.stainless.com` URL needs `Content-Disposition: attachment; filename="anthropic-0.100.0-py3-none-any.whl"` so `pip install <url>` works straight from the beta email.
- `client.beta.sessions.create` not taking a `model` param is correct architecturally (the agent owns the model) but should be called out prominently — first-time users will reach for it.
- The session event stream's `.model_dump()` + `.type` ergonomics are great. Typed events were a delight to work with.

### Docs gaps

- Add a "high-fidelity curation" example to the dreams docs that demonstrates per-claim source-attribution tagging + a word budget in `instructions`. Without this default, the first-pass dream output invites fabrications — the most likely failure mode for users intending to publish dream outputs.
- Document the dream session's internal event-type mix (heavy on `span.*` and `session.thread_*`, light on `agent.message`) so visualization layers don't bet on the wrong types.
- Document the linear-input-but-superlinear-cost growth pattern and recommend the **fan-out + meta-dream** architecture for large corpuses.
- Note that `dreams.cancel()` is graceful and preserves partial output — useful safety valve worth promoting.
