Capsule — Research Project
What this is
A research project investigating whether HTML can be disciplined into a portable knowledge-artifact format with a machine-readable contract, content provenance, and a structured feedback loop — without becoming a SaaS platform, a new file format, or a new browser standard.
The project produces a spec, a reference implementation, and empirical evidence about whether the spec works in practice. The hypothesis is that the substrate (HTML) has won and what's missing is discipline, not a new format.
Started: 2026-05-15 Current Core spec: v0.3.0 · Full spec: v0.3.2 Repo: bigfancygarden/htmlcapsule · Site: htmlcapsule.org
Project identity
A capsule is a sealed, self-contained HTML memory object for work worth preserving. The smallest portable structured unit any kind of knowledge work can resolve into — human-readable + machine-readable + provenance-bearing in one object. Not a working format — you still edit in your text editor, design in Figma, cook in your kitchen, think in your LLM chat. A publish / preserve / share format that any domain can emit.
Every domain today has good working tools and bad publish formats. PDFs lose interactivity. PNGs lose vector data. Recipe cards lose chef's notes. Exported chats lose structure. LLM conversations lose synthesis to the archive. Capsules are designed to be the universal publish format that preserves more than the alternatives, because:
- HTML is alive (interactive, scriptable, queryable) where PDF is dead.
- The manifest is machine-readable, so the artifact is self-describing.
- Provenance travels with the artifact, not as separate metadata that gets lost.
- The data block is whatever the domain needs; the envelope is consistent.
- Self-contained: opens in any browser, archives anywhere, shares to anyone, re-loadable into any LLM.
The same outer contract serves recipes, research notes, decision briefs, journal entries, design specs, log entries, learning artifacts, project handoffs — and most importantly, the synthesis that comes out of LLM conversations that today disappears into chat archives.
Framing arc
The project's framing has sharpened through the research. Each version was less narrow than the last:
- "Compile from your private DB into shareable HTML" — too narrow; assumed a structured source.
- "Boundary object between private system and external recipient" — better; named the sharing pattern.
- "Save state for useful LLM conversations" — closer; named the most common production path.
- "Atomic unit of preserved work, across any domain" — broader, served the spec well during v0.1–v0.2.
- "Sealed, self-contained HTML memory object for work worth preserving" — current; emerged from peer review in v0.3 (see F18). Adds the human/machine/provenance trio as a differentiating wedge.
The format itself supports each framing without changes — technical work over previous iterations turns out to have been correct under broader interpretations than we started with.
What this is not
Capsules are not trying to replace working tools. Recipes will still be edited in cooking apps; designs in Figma; data analysis in Jupyter; thinking in LLM chats. The capsule is the export from these tools when the work is done, not the editing surface. This is exactly the role PDFs play today — they're just lifeless. Capsules give the same role to HTML, which is alive.
Capsules are also not trying to be a universal data interchange format like JSON-LD or RDF. The capsule's outer contract is universal; the inner content is domain-specific. This split is what gives the format both portability and expressivity.
Origin
Thariq Shihipar's public observation that LLMs and agents are already producing single self-contained HTML files as their default artifact format. The substrate is winning. The question this project asks: what does it take to make those files trustworthy — to give them a contract, provenance, versioning, and a structured way for recipients to respond?
Research questions
Primary: Can a one-page spec, given to an LLM as context, produce a conformant Capsule?
Secondary:
- What discipline makes HTML useful as a boundary object between private knowledge systems and external recipients?
- Where does the spec need to be strict vs. permissive?
- What's the gap between compiler-produced and LLM-produced capsules — and is that gap useful (fidelity gradient) or broken?
- Where does the format break down empirically — size, browser support, distribution friction?
- Can the recipient side respond in a structured way that the author can programmatically ingest?
- Will LLMs honestly declare themselves and their limitations when producing capsules?
- Can a deterministic compiler produced by a third party round-trip through the reference validator at full fidelity? (Substantially answered: yes — see F18's note on independent compiler-kind producers.)
Methodology
Iterative spec evolution against real artifacts:
Hypothesis → Draft spec → Build reference compiler → Compile real artifacts
↑ ↓
| External review
| ↓
|←─── Adjust spec ←─── What broke or what felt off ←──────┘
Three classes of "real artifact" are tested:
- Compiler-produced — deterministic output from our reference Python compiler. Establishes the strict end of conformance.
- LLM-produced — capsules generated by giving the Core spec to commercial LLMs (Claude, Gemini, ChatGPT) and asking them to produce a capsule on a real topic. Establishes the loose-but-honest end.
- Hand-written / hybrid — the spec itself was originally dogfooded as a capsule. Tests whether the format can document itself.
External review at each iteration: code review on the implementation side, plus design review from independent LLM agents and (in v0.3) from third-party producers building compiler-kind capsules against the spec.
The spec can only loosen (backward-compatible additions) unless a major breaking issue is found. Tightening would invalidate prior artifacts and would also discourage LLMs from producing capsules at all.
Findings
F1: The Core spec works as an LLM prompt
Experiment: Pasted CAPSULE_CORE.md (one-page short spec, ~120 lines) into fresh Claude, Gemini, and ChatGPT sessions with one prompt: "Using this spec, can you give me a summary of [public regulatory topic] as a Capsule?"
Round 1 result: Three structurally compatible capsules. All passed validation with 18/21 pass + 3 warn + 0 fail (identical pattern). Each opened in a browser, rendered correctly, declared itself honestly as generator.kind: "llm", included working exports, and presented a useful summary.
Round 2 result (same day, more specific prompt): Same pattern, plus prompt specificity successfully disambiguated topic interpretation.
Conclusion: Yes. The Core spec works as an LLM prompt. The format propagates through being readable and useful, not through enforcement.
F2: LLMs deviate from the spec toward honesty
The most significant finding. Across both experiment rounds, LLMs consistently disagreed with the spec in five specific places. In every case, the LLMs were objectively more honest than the spec required:
| Spec field | What spec said | What LLMs reached for | Why LLMs were right |
|---|---|---|---|
source.origin | Constant "private_database" | "web_research", "public_documents", "official_public_sources" | An LLM synthesizing from public content has no private database |
source.snapshot_type | Database-flavored enum | "synthesis", "research_summary", "bounded_public_legislative_summary" | A summary isn't a "portable_excerpt" |
synthesis.kind | ai_extraction/ai_summarization/etc. | "llm", "llm_summary", "web_summary" | The natural words are clearer |
type | Strict enum | "summary", "briefing" | None of the original types described what the capsule actually was |
feedback_payload | Required rating/comments/suggestions only | Structured form with position/concern/notes | Real feedback isn't always a 1-5 rating |
In every case the spec was adjusted to accept the more honest values. The pattern: usage shapes spec, not the other way around. The spec is a description of what disciplined capsules look like, not a prescription that LLMs must obey.
F3: The fidelity gradient is real and useful
The validator distinguishes three result tiers (pass / warn / fail). Compiler-produced capsules pass strict. LLM-produced capsules pass degraded — typically missing the integrity block (no canonical-JSON content hash) and triggering a capability-marker heuristic false-negative.
This is a designed feature, not a workaround. Recipients of an LLM-produced capsule can see exactly what's verified and what isn't. They can calibrate trust appropriately. A compiler-produced capsule comes with cryptographic integrity; an LLM-produced one comes with structural conformance.
Conclusion: The format works for multiple production paths with different trust profiles. The validator's tier system is the load-bearing piece that makes this possible.
F4: Capability honesty is enforceable
The spec rule "a capsule must implement every capability it declares" was tested against LLM-produced capsules. In every case, declared capabilities matched implemented capabilities:
- One LLM declared
["about", "copy_as_json"], implemented exactly those two - Another declared five capabilities, implemented all five
- A third declared
export_response, built an actual feedback form with response export
No LLM over-declared. This is meaningful because it shows the LLMs treated the capabilities list as a contract, not as aspirational marketing. Implementation honesty is a property the format can preserve even when LLMs are the producers.
F5: The format scales empirically through 13 MB
Experiment: Synthetic capsules at three sizes (1.35 MB / 6.6 MB / 13.15 MB) with embedded base64 blobs to simulate photo albums.
Result:
- Browser parse + JSON parse scales linearly at ~5 MB/sec.
- 13 MB capsule loads in 123ms total, settles to ~14 MB JS heap.
- Sub-millisecond interaction (tab switches, filter changes) on the 13 MB capsule.
JSON.stringifyof the full data block: 15ms (well under perceptible-jank threshold).
Conclusion (at time of finding): The 15 MB hard cap in the spec is correctly positioned. Browser performance isn't the bottleneck. Distribution is — Gmail's 25 MB attachment limit is the real ceiling, hit before browser strain.
Updated by F20 (2026-05-21): A real production Mintel capsule arrived at 13.7 MB and several real production channels (MinDev hosting, AirDrop, Slack, cloud-storage links) have no equivalent of the email-attachment constraint. Spec v0.3.3 raised the hard cap to 20 MB and added a 15 MB soft warning specifically for email-attachment compatibility — the 15 MB number was always proxying for email-friendliness, not browser strain. The conclusion above still holds; the cap moved up because the distribution-channel landscape has more than one shape.
F6: An LLM built half the feedback loop unprompted
The most surprising finding. In round 2, one LLM received only the Core spec and a one-line prompt. It produced a capsule with:
export_responsecapability declared- A structured feedback form (position dropdown, concern dropdown, notes textarea)
- A
buildResponseExport()function emitting validresponse.jsonwithcapsule_referencelinking back to the originating capsule
The recipient side of the feedback loop was implemented end-to-end by the LLM, without us telling it to. This was always part of the spec's design, but it wasn't part of the prompt. The LLM reached for the architecture.
Reinforcement: A later meta-capsule (produced under v0.1.2 with the standard one-line prompt) invented a spec_compliance_self_check field — an array grading the capsule against all eleven Core rules with pass/n/a and a per-rule note. The LLM cited rule 11 ("Runtime JS string-literal rule") by number. The numbered-rule format introduced in v0.1.2 is being consumed as machine-readable structure, not just human guidance.
F7: Structured response payloads are mostly tally bait; notes carry the meaning
Experiment: A recipient opened an LLM-produced capsule, filled out its built-in feedback form (position dropdown + concern dropdown + notes textarea), and exported response.json.
Result: The structured fields (position, most_important_issue) contained little information that wasn't already in the notes field. The notes carried the actual meaning — the reasoning, the nuance, the position. The structured fields were essentially redundant.
Generalization: Structured response fields are aggregation infrastructure. They earn their weight when you have many respondents — you can tally positions, group by issue, scan notes within each group. For a single respondent, structured fields are decoration; notes are the response.
Implication for the spec: The response_schema_version envelope is correct. The eight response types are probably more than needed; the real axes are (per-record vs. whole-capsule) and (structured-for-aggregation vs. prose-only). The feedback_payload schema was correctly loosened in this iteration to allow arbitrary fields — its rigidity was preventing the most common real use case.
Implication for the build: The "import side as registry + database ingestion" framing was overstated. What's actually useful is much lighter — an archive + a pair viewer (open response + originating capsule side-by-side). The author still does the qualitative reading; the system doesn't try to merge or auto-process.
F9: The single-document data shape is the natural LLM choice for conversation summaries
Observation across ~20 personal-use conversation-summary capsules: Almost every one used the single-document shape from §4.1 of the full spec — a top-level JSON object whose keys are themes (summary, key_takeaways, decision_matrix, quick_recommendations, etc.) — rather than the records[] shape.
The specific top-level keys vary per topic — that's expected and good. The shape definition isn't "must contain key X"; it's "top-level object with thematic named sections, each appropriate to the content." LLMs reach for this shape unprompted when summarizing a conversation; they reach for records[] when producing decision boards or list-shaped artifacts (the compiler templates).
Implication: Section 4.1's two shapes correctly carve the space. The example in the spec for the single-document shape is one possible filling; LLMs invent their own thematic keys per topic, which is the intended behavior.
F10: The format absorbs primary artifacts (not just syntheses)
Experiment: Build capsules that are the work product, not summaries of one. Specifically: print-targeted 8.5×11 property-scale claim maps (both an illustrative synthetic one and one built from a public claims GeoJSON snapshot).
Result: Both validate cleanly (same shape as chat-summary capsules). No new failure modes appeared in the domain switch. The format absorbed:
- A new manifest type
- Inline SVG rendering (~300 lines of runtime drawing claim polygons, graticule, scale bar, north arrow)
- Print-targeted CSS (
@page size: letter portrait) - Honest provenance for non-real data (
generator.kind: "llm",synthesis.kind: "illustrative_synthesis"where appropriate)
A third data shape emerged on its own: the map capsules' data block isn't records[] and isn't single-document. It's a feature collection: a property metadata header + bbox + per-feature-class arrays. This is the natural GIS / GeoJSON-ish shape.
Implication for the spec: Section 4.1's two-shape carve (records / single-document) may want a third bucket called "feature collection" for geospatial / typed-feature-set domains. Documented as the seed of the domain.exploration_map schema in DOMAIN_CAPSULES.md.
F11: The hybrid producer pattern is the most reliable production path for real-data capsules
Observation: Three production paths have produced capsules in this project:
| Path | Who writes HTML | Bug surface | Pattern |
|---|---|---|---|
| A. Pure LLM in chat | LLM session | High (rule 11 bug class, manifest drift) | One-off content |
| B. Pure Python compiler + templates | Reference compiler + per-type template dir | Zero (deterministic) | Records-shaped artifacts |
| C. LLM-authored Python generator | One Python script per artifact class, written by LLM, then frozen | Zero (deterministic shell + real data) | Recurring real-data artifacts |
| D. Pure human handcoding | n/a regularly | n/a | Rare |
Path C is the new one. The LLM writes a Python generator once (with all the HTML, CSS, JS frozen as Python strings + a render_body() function), then the generator runs from real data on demand.
Why it works: the runtime JS is the same code every time, reviewed once, frozen. The manifest fields are computed by Python (validator-clean). The data block contains real data. Path A's recurring failures — JS string-literal bugs (the rule 11 bug class), manifest drift, capability marker mismatches — all disappear because the LLM never re-generates the shell.
Cost: Adding a new artifact class (e.g. a recipe capsule, a journal entry capsule) requires writing a new generator. Per-instance cost is near zero.
Implication: For recurring content (photos, claim maps, perhaps recipes/journals/decisions), Path C is the right default. Path A stays useful for one-off chat-summary capsules where the per-instance content is bespoke. Path B (the reference compiler) is the seed and the validator's intellectual reference, but produces fewer capsules in practice than C.
F12: Photo-shaped capsules — one artifact, one capsule (atomic-unit framing in its purest form)
Build: Example photograph capsules — one image, embedded as base64 in an <img src="data:image/jpeg;base64,..."> tag. Plus an associated voice memo (m4a/AAC) embedded similarly. Plus metadata: caption, people[], location (lat/lon + accuracy), date (value + precision + is_approximate), tags, alt_text.
Architectural pivot mid-build: the first attempt packed multiple photos as records[] inside a single album-capsule. That conflicted with the project's atomic-unit thesis — a photograph is itself an atomic unit of preserved work, not a row in a parent file. Rewrote to one-capsule-per-image; the album becomes the index listing them, not a container holding them.
Manifest signal: new type: "photograph", new collection field referencing the conceptual album by name (loose linkage, no parent file). The included_records is always 1.
Data shape: single-document with a top-level photo object containing the photograph's metadata + (originally) the data URIs. After F14's refactor, the data URIs live in the HTML <img> and <audio> tags directly, and the JSON data block is metadata-only.
F13: First real CSP loosening — media-src data: for embedded audio
Background: all prior CSPs across the corpus had been identical:
default-src 'none'; style-src 'unsafe-inline'; script-src 'unsafe-inline';
img-src data:; connect-src 'none'; base-uri 'none'; form-action 'none';
That permits inline base64 images via data: URIs but not audio (audio falls back to default-src 'none' and is blocked).
Change: added media-src data: to the photo capsule's CSP. One directive. It does not open the door to external audio — default-src 'none' and connect-src 'none' still block remote media. The capsule remains sealed; only inline base64 audio is permitted.
This was the first feature-driven CSP change in the format. Documented in the spec as the canonical pattern: if your capsule has embedded audio or video, add media-src data:. Don't broaden it further.
Format choice for audio: AAC in M4A container (.m4a). Universal browser support, best compression-to-quality ratio. Python's mimetypes.guess_type() claims .m4a is audio/mp4a-latm, which browsers reject (LATM is a different stream format). Required an explicit .m4a → audio/mp4 mapping in the build script.
F14: Capsules are archives, not apps — the JS-render-everything failure mode
The biggest learning of the early sessions. Discovered when a photo capsule was AirDropped to iPhone and "didn't load properly."
Root cause: iOS Files preview (the QuickLook HTML viewer) doesn't execute inline JavaScript, or restricts it severely. The chat-LLM capsules — and, by pattern-copying, the first version of the photo capsule — were 100% JS-rendered: the static HTML had empty containers (<h2 id="title"></h2>, <figure id="photo-frame"></figure>) and runtime JS filled them on load. With JS disabled or restricted, the capsule rendered as a near-blank page.
Honest acknowledgment: the pattern had been copied from the existing chat-LLM corpus without examining whether it fit. The thesis says "capsules are archives, portable across decades, self-contained." The implementation said "tiny single-page app that needs my runtime to be useful." Mismatch.
Architectural fix: progressive enhancement. Move all rendering to build time in Python. The static HTML, as written to disk, already contains the rendered artifact (image, audio, caption, metadata, description, tags, alt-text, manifest dump). JavaScript shrinks to ~3 KB of button click handlers (Print / Copy / Download). With JS fully disabled, the capsule still renders the full content; the three buttons just don't respond.
Spec response — Core v0.1.3, rule 12: promoted the principle to a numbered first-class rule, mirroring rule 11's structure (mechanical instruction + WRONG/RIGHT code example). Same hypothesis as rule 11 — LLMs follow syntax-level mechanical rules better than content-level prose guidance.
Validator response: check_progressive_enhancement heuristic — counts visible text inside <main id="capsule-root"> after stripping <script> and <style> blocks and HTML tags. Under 200 chars, the capsule is flagged. WARN, not FAIL — existing JS-rendered fixtures remain validatable; the warning signals they don't follow the v0.1.3 convention.
Implication for the project's identity: the failure was the most informative thing in the corpus that session. The atomic-unit framing isn't just a slogan — it has implementation consequences. Archives must be readable by any HTML renderer, not just one that runs the producer's specific JS.
F14 follow-up: Rule 12 propagation result — first batches under v0.1.3
Experiment: Produce fresh batches of conversation-summary capsules through the same LLM pipeline that produced the v0.1.0–v0.1.2 capsules, this time with the v0.1.3 Core attached. Two batches of five capsules each (10 total), spanning unrelated topical domains.
Result: 10/10 PASS rule 12. Every capsule pre-renders its full readable content (title, summary, takeaways, tables, glossary, source URLs, conversation transcripts, manifest dump in <details>) directly in <main id="capsule-root">. JS shrunk to button handlers in every one.
Visible-text counts inside capsule-root (validator threshold: 200 chars) ranged from ~6,000 to ~13,400 — every capsule cleared the threshold by 30× to 67×.
Rule-12 trajectory (mirrors rule 11's trajectory table):
| Batch | Spec version | Mitigation | Rule 12 PASS rate |
|---|---|---|---|
| 1–20 + early | v0.1.0 – v0.1.2 | none (pattern not yet recognized) | 0/23 |
| Batch A (5) | v0.1.3 | promoted to numbered rule 12 + WRONG/RIGHT code example | 5/5 |
| Batch B (5) | v0.1.3 | (same) | 5/5 |
Epistemic update after second batch: the result replicates. Two consecutive batches, 10/10 PASS, same producer, spanning 10 unrelated topical domains. Within-producer replication is solid; cross-producer confirmation is still the remaining open evidence gap before broad generalization.
Hypothesis confirmed: the "deeper instinct to build a tiny app" did not persist when rule 12 was promoted to a numbered rule with a code example. The same model that produced the JS-render-everything capsules in the earlier batches immediately switched to progressive enhancement when given the v0.1.3 Core.
F15: Mobile responsiveness is a CSS-layer concern, not a format-layer one
Trigger: After F14's fix, an AirDropped photo capsule rendered on iPhone but looked like a thumbnail of an 8.5in letter page in a 375px viewport. Tiny. Required pinch-zoom to read.
Fix: mobile-first responsive CSS — same HTML body, three CSS modes:
- Default (mobile / narrow): fluid layout, touch-friendly buttons, readable typography (no sub-12px sizes), stacked title block.
@media (min-width: 900px): switches to the 8.5×11 letterhead view — fixed page dimensions in inches, two-column grids, desktop typography scale.@media print: locks to letter portrait independent of viewport.
Key insight: the 8.5×11 page is a print target, not a screen requirement. The screen view can be fluid. Conflating the two was the design mistake.
Implication for the spec: This is implementation detail, not a Core rule. No spec change needed. Worth a note in the full spec's UI section that capsules should be screen-readable on any viewport size, with the 8.5×11 form factor reserved for print output.
F16: Chat-LLM capsules embed source-conversation images when the conversation is image-grounded
Pattern across two batches under v0.1.3: When the source conversation included an image (a chart, screenshot, diagram, photo), the LLM embedded that image inline in the resulting capsule as a data:image/...;base64,... URI. Each used the same spontaneously-invented embedded_media data-block field structure (kind / description / filename / mime_type / embedded_as).
| Batch | Source image type | Capsule file size | CSP change required |
|---|---|---|---|
| Batch A | Screenshot of a public chart | ~254 KB | No (img-src data: already in baseline) |
| Batch B | Chart/document from a public source | ~2.2 MB | No |
Epistemic state: n=2 from same producer. Cross-producer confirmation still pending. But the within-producer pattern is consistent enough to treat as expected behavior, not anomaly.
Implications for the spec: no new rule warranted. The format already absorbs this:
- CSP
img-src data:(in place since v0.1.0) permits the inline embedding - Data block is free-form JSON, so the
embedded_mediafield is admissible - File-size cap (15 MB) is well above 2.2 MB
The spec's documentation now notes that when conversations include images, embedding the source image as a data: URI is an established pattern. The embedded_media data-block field (or a similar shape) is recognized as a recommended convention.
F17: Prompt-fragment-only Core revisions are a valid spec-evolution mode
Background: all Core revisions to date (v0.1.1, v0.1.2, v0.1.3) introduced or promoted at least one numbered rule. v0.1.4 was the first that didn't. It added only prompt-fragment guidance:
- "Be thorough about real content" — a paragraph pushing back against LLM brevity-truncation, with explicit permission to include all takeaways / sources / caveats / open questions, and an explicit floor on inventing content the conversation didn't produce.
- "Capture sources and links" — a paragraph recommending a structured
sourcesarray in the data block, with a shape example.
Neither is a rule. Neither has validator enforcement. Both are producer-behavior hints in the prompt fragment that producers actually see.
Why these are worth a Core version bump: the prompt fragment IS the Core to producers. If we silently amend it, the version line lies — producers under "v0.1.3" would actually see different content than the v0.1.3 fragment captured by git tag. The two self-documenting fields (source.spec_received, source.prompt_received) would lose meaning if the content of a given version drifted.
So: every change that producers will see gets a version bump. Rule changes get major attention. Guidance changes get minor attention.
Hypothesis: prompt-fragment guidance will work similarly to rule promotions — explicit, mechanical, included alongside the numbered rules in the producer's context, with examples. The "numbered rule + WRONG/RIGHT code example" pattern addresses mechanical failures. Prompt-fragment guidance addresses underexplored options — behaviors that aren't broken but aren't being chosen. Different mechanism, different bar. Worth tracking both separately.
F8: The atomic-unit framing explains everything we've built
Reflection rather than experiment. Across multiple framings the project has tried — "compile from private DB", "boundary object", "save state for LLM chats" — the format itself didn't need to change. Each framing was the same format viewed through a narrower lens. The framing that explains all the previous ones is: a capsule is the atomic unit of preserved work.
Evidence supporting the broader framing:
| Domain | Working tool | Existing publish format | What capsule preserves |
|---|---|---|---|
| Decision-making | Spreadsheets, meetings | PDF / email thread | Per-option records, evidence, decisions |
| News annotation | Browser + memory | Forwarded link | Article + extracted claims + verdicts |
| Research synthesis | LLM chat | Copy-paste into doc | Synthesis + sources + provenance |
| Recipes | Cooking apps / notebook | Recipe card | Ingredients + steps + scaling + notes |
| Journal | Notion / paper journal | Locked in app | Entry + mood + context |
| Map / geospatial | QGIS / GIS tools | PNG / map service | Features + layers + popups |
| Logs | System logs | Text dump | Events + context + severity |
In every row, the existing publish format loses something the working format had. PDFs lose interactivity. PNGs lose vector data. Recipe cards lose the chef's notes. Capsules preserve more because they're alive (HTML + structured data + provenance + UI).
The atomic property matters because:
- Atomic units are searchable individually
- Atomic units compose into larger structures via
parents[](the capsule forked from another, the capsule that responds to another) - Atomic units have their own provenance, not inherited from a container
- Atomic units survive movement between systems
Consequence for the project's identity: capsules are to preserved work output what JSON is to data interchange. A universal portable envelope that any domain can fill with appropriate content. F18 sharpens the framing further into "memory object" but the atomic-unit point remains the structural argument.
F18: Peer review (2026-05-19) — sharpest framing, landscape position, and trust-model gaps
A peer-review pass on the v0.3.2 state of the project produced three things worth recording in the research log: a sharper one-sentence thesis, a 2026 landscape position, and an explicit naming of the format's open trust-model questions.
Sharpest framing. The strongest one-sentence definition that emerged from review:
"A capsule is a sealed, self-contained HTML memory object for work worth preserving."
"Memory object" is doing real work in this sentence. It captures the human-readable + machine-readable + provenance-bearing trio in one noun phrase — the property no neighboring format provides simultaneously. PDF is human-only, JSON export is machine-only, MHTML lacks a manifest, ZIP lacks rendering, .docx lacks a programmatic data block, Notion exports are platform-dependent. The previous framing ("atomic unit of preserved work") remains internally accurate but lacks a differentiating wedge. The new framing has been adopted in README, CAPSULE_CORE.md, and index.html.
The second insight: multi-producer interop is the strongest empirical claim the format makes. LLMs (Claude, ChatGPT, Gemini), deterministic compilers (third-party build scripts), and human authors all produce the same envelope shape. That's what makes capsules different from yet another save format. Personal/team memory is the most accessible adoption vector; multi-producer interop is the differentiator. Don't narrow positioning to wave-one adoption.
The first independent compiler-kind producer (a third-party Python build script) shipped capsules that round-trip through the reference validator at 26/26 in v0.3. Crucially, the producer re-derived the integrity-hash recipe from spec prose alone (§9.1.1) without reading the validator source, and produced bit-identical hashes on first attempt. This is the spec earning its keep as a normative document.
2026 landscape position. Neighbors mapped:
| Neighbor | Layer | Relationship |
|---|---|---|
| HTML artifacts (Thariq / Blake Crosley) | Live agent output / control surface | Aligned but upstream — capsules are the seal step downstream |
| Durable interactive artifacts (AgentPatterns) | Workspace objects | Aligned but platform-bound; capsules are portable across tools |
| Intermediate artifacts in agentic systems (arXiv) | Multi-agent internal state | Same instinct, systems-internal scope |
| ARA agent-native research artifacts | Research deliverables | Heavier research-world cousin |
| RO-Crate | Sealed research packages | Direct competing format — capsules differ in single-file constraint |
| WACZ/WARC | Web archives | Different layer (archived web, not authored work) |
| C2PA / Content Credentials | Signed media provenance | Complementary trust layer, not format competition |
| Agent manifests (agent.json, JSON Agents) | Agents themselves | Adjacent 2026 instinct (machine-readable manifests around AI) |
Strategic conclusion: HTML is unlikely to be usurped soon as the rendering substrate. The likely future is HTML remaining the human-inspectable surface while JSON / RO-Crate / C2PA-style metadata wrap around or live inside it. Web Bundles were the only direct technical challenger; their IETF draft is stale and Chrome removed the navigation experiment in 2023. Capsules are betting on the stable layer.
Open trust-model gap. The current spec answers "what is this? where does it claim to come from?". It does not answer "did the claimed author actually publish these exact bytes?". The UUID asserts identity but doesn't enforce it — anyone can ship a modified capsule with the same UUID.
A full trust story would require four pieces:
- Two-hash split.
content_hash(canonical manifest+data, survives DOM round-trip) +file_hash(raw bytes, doesn't). Lets a recipient verify two different questions independently. - Author signing, identity-anchored via a Sigstore/Fulcio-style OIDC issuance. Without identity infrastructure, "signed by author" is just another lie waiting to happen.
- Transparency log (Sigstore/Rekor-shaped). Append-only public record of signed releases, detecting same-UUID-different-content games and backdating.
- Out-of-band verification. Capsule never calls home (Rule 2 preserved). The QR code already embedded in the capsule (Core convention) resolves on the recipient's phone/reader to a verification URL that queries the transparency log. Friction lives on the verifier's side; the capsule stays mute.
Three trust tiers would emerge: Self-describing (current baseline), Signed, Logged.
Decision: parked, not built. No reported real-world tampering incident exists in the corpus or among independent producers. Building infrastructure ahead of empirical pressure would be exactly the "spec gravity before daily-use pressure" failure mode the peer review explicitly warned against. Captured in spec/CAPSULE_SPEC.md Appendix E.6 as a v0.5+ candidate.
Two strategic risks named in the review, now internalized as ongoing discipline:
- Spec gravity. Every spec addition should be triggered by a real producer/consumer hitting a real problem. v0.4 candidates (E.1–E.8 in the parked-direction appendix) should be pressure-tested against this rule before any v0.4 work. The corpus is empirical evidence; spec additions that don't respond to empirical gaps are anticipatory engineering.
- Trust theatre. Hashes / manifests / capabilities are useful only if they stay honest and legible. The strongest trust signal isn't "this validates perfectly" — it's "you can see what produced it, what data is inside, what was omitted, and what actions are actually supported." The blind re-derivation of the integrity-hash recipe by an independent producer (producing a bit-identical hash from §9.1.1 prose alone, no peeking at validator source) is the bar for trust signals earning their keep through actual second-party verification rather than self-validation.
Both risks are now ongoing discipline rather than one-time fixes.
F19: Design-tool integration experiment — Claude Design with CAPSULE_CORE.md attached
Experiment. Asked a design tool (Claude Design, claude.ai/design) to produce a landing-page design and export the result as a Capsule per Core v0.3.0, with CAPSULE_CORE.md attached as conversation context. The session produced three relevant artifacts, with quite different structural shapes — all worth recording. Two were valid; one was not.
Note on finding evolution. This finding was substantially revised after the model self-corrected its export choice. An initial draft treated the bundler-wrapped 52 KB file as "the model's capsule output" and framed the discrepancy as "two verifiers checking different criteria." The model's reply clarified — and our reference validator confirmed — that the model first wrote a 40 KB file that does validate cleanly (24/25 pass, 0 fail), then a separate "Save as standalone HTML" step ran a general-purpose bundler over that valid file and produced the 52 KB shell. So the finding is not "spec-aware intent, non-conforming output" — it's "spec-aware conforming output, destroyed by a downstream pipeline step that should not have run." The revised version below is the accurate record.
Output A — three design-variation files from the canvas (dc-card wrapped).
These were the JSX/HTML mockups exported earlier in the same session, before the user asked for a capsule. Each file:
- Was 4.8–5.7 MB on disk, ~75% of which was per-element inline
style="accent-color:auto; align-content:normal; ..."CSS resets that Claude Design's canvas applies (≈10 KB per element × ~385 elements). These are normalization for the design canvas, not the design itself. - Wrapped visible content in
<div class="dc-card" data-om-id="...">— Claude Design's canvas-card container. - Had two
<style>blocks: one in the head (woff2 fonts asdata:URIs) and one inside the body (the actual design CSS, with a redundant@import url('https://fonts.googleapis.com/...')that violated Rule 2 even though the same fonts were already embedded asdata:URIs above). - Had no
id="capsule-*"blocks of any kind. Pure design exports — not capsules, design mockups.
A structural transformation script (strip resets, strip data-om-*, strip the redundant @import, unwrap dc-card, merge <style> blocks into <style id="capsule-style">, wrap content in <main id="capsule-root">, inject <script id="capsule-manifest"> + <script id="capsule-data"> + <script id="capsule-runtime">, add CSP meta) converted each to a valid capsule at 25/25, ~1.5 MB final (essentially the embedded woff2 fonts plus thin content). The visual design was preserved bit-for-bit; the bloat removed was Claude Design's canvas safety net, not the design itself.
Output B1 — the model's actual capsule output (40 KB, pre-bundler).
After the user asked explicitly for "the landing page as a Capsule per the attached Core spec," the model wrote a 40 KB single HTML file. Validated against the reference validator:
24/25 pass, 1 warn, 0 fail.
| Validator check | Result | Notes |
|---|---|---|
| HTML5 doctype | PASS | — |
<html> + <body> tags | PASS | — |
| All required sections present | PASS | All five id="capsule-*" blocks at the byte level |
| No external resource references | PASS | Zero network fetches |
| Manifest section parseable | PASS | Honest provenance (see below) |
| All required manifest fields | PASS | — |
| Manifest field types | PASS | — |
| Capsule version present | PASS | capsule_version: "8.0.0" |
| Recommended manifest fields | PASS | — |
generator.kind recognized | PASS | "llm" |
source block | PASS | origin: "authored", snapshot fields populated |
privacy block | PASS | external_dependencies: false |
spec_version recognized | PASS | 0.3.0 |
spec_version ↔ source.spec_received agree | PASS | — |
capabilities include about + one export | PASS | — |
| Data section parseable | PASS | — |
| Content hash verifies | PASS | No integrity block (optional for LLM-kind) — passes by absence |
| Field format patterns | PASS | — |
| All capabilities have impl markers (heuristic) | WARN | copy_as_prompt — implementation exists but the function name doesn't match the validator's marker regex (false negative on a soft check). |
| Runtime JS strings well-formed | PASS | — |
| Content pre-rendered in HTML | PASS | 5388 chars of visible text in <main id="capsule-root"> |
| File size under 15 MB | PASS | 41,724 bytes |
This is a third independent producer kind reaching conformance — joining the reference compiler (generator.kind: "compiler") and the hand-authored landing (generator.kind: "human" / "hybrid"). All three producer kinds in the spec's interop claim are now empirically demonstrated.
What the model got right structurally:
- Five reserved IDs at the byte level —
capsule-manifest,capsule-data,capsule-style,capsule-root,capsule-runtime, all present in the parsed-as-written file with no JavaScript needing to run. <main id="capsule-root">populated with 5,388 chars of pre-rendered visible text — directly satisfies Rule 12 without ambiguity.- Honest provenance:
generator.kind: "llm",name: "claude.ai",version: "claude". Declined to guess its own model ID and noted the user could pin tighter (e.g.,"claude-opus-4-7"). - Correctly handled the Rule 2 / Google Fonts conflict before writing: picked system-font fallback (
ui-monospace, SF Mono, Cascadia Code, Menlo, Consolas, DejaVu Sans Mono) rather than@importor fetched fonts. The right call. - Skipped QR code per spec guidance — the qrcode library wasn't available in its environment, so it followed the spec's "don't fake a QR by hand" directive rather than producing a wrong one.
- Capabilities declared with implementation intent —
about,copy_as_json,copy_as_markdown,copy_as_prompt,download_json,download_capsule,print_to_pdf. Added a self-check that warns to the console for any declared-but-not-implemented capability (Rule 7 self-audit baked into the file). - Sensible defaults —
source.origin: "authored"rather than"private_database", accessibility nods, print stylesheet.
The single validator warn was a heuristic false-negative on copy_as_prompt — the function existed but didn't match the marker regex pattern. Zero hard failures. The pre-bundler file is a deployable, conforming capsule.
Output B2 — the same file after "Save as standalone HTML" ran (52 KB, post-bundler).
The user then clicked "Save as standalone HTML." Claude Design's bundler — a general-purpose pipeline built to inline external assets for designs that aren't self-contained — ran over the already-self-contained B1 and wrapped it in a single-page-app hydration shell:
<head>
<style>… thumbnail + loading styles only …</style>
<noscript>This page requires JavaScript to display.</noscript>
</head>
<body>
<div id="__bundler_thumbnail">… social-preview SVG …</div>
<div id="__bundler_loading">Unpacking…</div>
<script>
// 6 KB bundler that:
// reads script[type="__bundler/manifest"]
// reads script[type="__bundler/template"]
// base64-decodes + gzip-decompresses assets
// fetch()-rewrites blob URLs
// replaces the thumbnail with the actual content via DOM injection
</script>
<script type="__bundler/manifest">… base64-encoded assets …</script>
<script type="__bundler/template">… HTML template as JSON …</script>
</body>
Validator score: 4/10 pass, 1 warn, 5 fail — required sections missing (the bundler uses script[type="__bundler/"] instead of id="capsule-"), fetch(s.src) in the asset-assembly step violates Rule 2, manifest unfindable, content not pre-rendered (zero visible text in <main id="capsule-root"> because no such element exists at parse time).
Architecturally, this is exactly the failure mode Rule 12 was written to catch — content packed into JS, rehydrated on load, body empty at parse time. Open the file with JavaScript disabled (iOS Files preview, email previewer, archive viewer, old browser) and you see the loading spinner forever, then a <noscript> fallback. Same shape as F14's JS-render-everything failure pattern, but inverted into a deliberate hydration architecture.
The mechanism is innocent — Claude Design's bundler exists for a legitimate purpose (inlining externally-referenced assets into a transportable single file). The bug is that it should be skipped, not run, when the input is already capsule-shaped. Running a "make this self-contained" pipeline over a file that is already self-contained is destructive, not idempotent.
The actual integration boundary: a process-ordering issue, not a verifier-criteria mismatch.
An earlier draft framed B1 ↔ B2 as a discrepancy between two verifiers checking different things. That was wrong. The clarified mechanism (confirmed by the model and re-checked against our validator):
- The model wrote B1 and ran its own verification on the as-written bytes. B1 validates against both the model's verifier and our reference validator. Both agree it's a conforming capsule.
- The user then triggered a separate "Save as standalone HTML" step that ran the bundler over B1, producing B2. The verification gate had already cleared on B1; it did not re-run on B2.
- B2 fails both verifiers, because it is structurally not a capsule — it's a bundler shell that contains a capsule template inside an
__bundler/templateblock.
So the lesson generalizes as: "verify-before-mutate, but always re-verify after any pipeline step that touches the artifact." A multi-step export pipeline that mutates the artifact between verifications can ship a file that the verifier never actually checked. In the Claude Design case, the verifier ran at the right point in the conversation flow but not at the right point in the file-mutation flow.
(This is closer in spirit to the build-pipeline / artifact-signing problem in software supply chains than to a spec-interpretation disagreement. The signature/verification gate has to be the last thing that touches the artifact before it leaves the producer, or there is a window where the artifact and the gate disagree.)
What the model self-corrected.
After being shown the validator score on B2 alongside the diagnosis, the model's reply was sharp: "You're right, and the diagnosis is accurate… The file is already standalone. Running the bundler will wrap it in an SPA hydration shell that violates Rules 2, 3, and 12. Skipping the bundler — the file you have is the deliverable." It correctly identified the actual deployable as B1, named the bundler as the source of the destruction, and rephrased the original "two verifiers" framing as a process bug ("I shipped two files in sequence and only validated the first").
This is itself a relevant data point for multi-producer interop: a spec-aware model, when given the empirical evidence, can self-diagnose the integration boundary and correctly route around it.
Implications for multi-producer interop:
- All three producer kinds are now empirically demonstrated. Compiler (
compile.py), LLM (B1, claude.ai/design withCAPSULE_CORE.md), and hand-authored / hybrid (the canonical landing page) all produce files passing the reference validator. The interop claim in the README is now backed by an independent third producer. - The model can produce conforming capsules from the Core spec prompt alone. No special tooling, no per-capsule template, no human cleanup step —
CAPSULE_CORE.mdplus a description of the desired content was sufficient. - The downstream pipeline is the integration risk, not the model. A spec-aware producer can be undone by a generic post-processing step that doesn't know about the spec. The mitigation is process discipline (re-verify after every mutation), not a rule change.
- Self-containment is necessary but not sufficient for capsule-compliance. B2 is self-contained at runtime; it is still not a capsule. The five-required-blocks contract + reserved IDs + pre-rendered-in-HTML + no-network-at-render are what the format means by "capsule," beyond mere bundling.
Implications for the spec:
- No rule change motivated. Rule 12 caught exactly what it was designed to catch. The spec's normative content is unchanged by this finding.
- A formal compatibility note lives in Appendix E.10 as a v0.4 candidate. It documents the bundler-incompatibility pattern, names the integration point (skip the bundler on already-capsule-shaped input; or, equivalently, the bundler's input is the integration point, not its output), and articulates the verify-before-mutate / re-verify-after-mutate process discipline.
- The conversion bridge (the structural-transformation script from the Output A path) is reusable for future canvas-shaped exports from any tool with a similar
dc-card-style wrapper.
What we did with the outputs:
- Output A (dc-card raw HTML from the canvas): converted three design-variation files to valid capsules at 25/25 each, ~1.5 MB after stripping the canvas-reset bloat. Deployable.
- Output B1 (the model's pre-bundler 40 KB file): originally validated 24/25, 0 fail, with one heuristic warn. After a validator improvement motivated by this finding (see postscript below), it validates 25/25, 0 fail. Documented here as the empirical LLM-producer exemplar. Not shipped as the project's canonical landing (we already have one), but recorded as evidence of the LLM-producer path working end-to-end.
- Output B2 (post-bundler 52 KB shell): not shipped; failed validation; serves as the empirical evidence for E.10 and for the "skip the bundler on capsule-shaped input" guidance.
The clean record: the LLM-producer path works. A spec-aware model with the Core spec attached can produce a conforming capsule on the first try. The integration risk is downstream pipeline steps that mutate the artifact after verification has cleared — the mitigation is process, not a rule change.
Postscript — validator improvement motivated by this finding.
The original 1/25 warn was a heuristic false-negative on copy_as_prompt. The model used a cleaner Rule 7 verification pattern than our reference examples:
- Manifest:
"capabilities": ["copy_as_prompt", …] - DOM binding:
<button data-capsule-action="copy_as_prompt">copy prompt fragment</button> - JS handler:
var actions = { copy_as_prompt: function () { … } };
Same literal string in three places — the most direct manifest-to-implementation link possible, auditable by eyeball without needing a regex translation table from copy_as_prompt → copyPrompt → btn-copy-prompt (our reference examples' three-name convention). The validator's marker regex (copy[-_]?prompt) didn't anticipate this convention and false-negatived the cleaner one.
When asked, the model offered to rename the handler to match our regex. We declined: the right fix was the validator, not the file. We added two uniform patterns to the marker-check that apply to every known capability automatically:
escaped_cap = re.escape(cap)
clean_convention_patterns = [
rf'data-capsule-action\s*=\s*["\']{escaped_cap}["\']',
rf'\b{escaped_cap}\s*:\s*function\b',
]
Both patterns are specific to implementation context — the data-attribute only appears in HTML markup, and the : function form requires the function keyword, which cannot appear in JSON. No false-positive risk on declared-but-unimplemented capabilities (Rule 7's actual guarantee is preserved bit-for-bit).
Result: the Claude Design file now validates 25/25 clean. All existing examples (the canonical landing, briefing_example.html, implementation_notes_example.html, the three converted theme files) still validate at their previous scores. The patch is strictly additive.
The lesson generalizes: when an independent producer finds a cleaner convention than the reference examples, the right response is to recognize the cleaner convention in the tooling, not to demand the producer rename. This is the difference between a spec that ossifies around its reference implementation and one that improves through external pressure. The patch isn't adding spec surface area — it's improving the validator's ability to recognize compliance. The spec discipline principle ("the corpus drives the spec; spec inflation runs the other direction") cuts toward the patch, not against it.
F20: First publicly-fetchable Mintel production capsule validates spec at scale
Date: 2026-05-21
Mintel now publicly serves a real production exploration_map capsule via MinDev. First time the project has end-to-end validated a production third-party capsule (not LLM-corpus, not sanitized example) against the reference validator.
The capsule:
- URL:
https://mindev.ca/api/c/9357a933-7ce1-4061-9488-2ca61d81bded/raw - Type:
domain.exploration_map - Title: "Copper Dome — BC · Project location"
- Size: 13.73 MB (99.43% data block — GeoJSON for 47 claim polygons)
- Generator:
mintel/build_exploration_map_capsule v0.1.0,kind: "compiler" - Integrity:
sha256:60282cbd…,hash_scope: "data+manifest"— content hash verifies clean - Validator result: 26/26 PASS, 0 warn, 0 fail
Five empirical findings:
1. The 15 MB cap was always a proxy for email-friendliness, not browser strain. F5 set the cap at 15 MB from Gmail's 25 MB attachment limit. Mintel's distribution channel is MinDev hosting (no equivalent of the email cap), and the empirical desktop parse ceiling is well above 15 MB. v0.3.3 splits the constraints: hard cap raised to 20 MB, with a 15 MB soft warning explicitly for email-attachment compatibility. The number that was always proxying for one thing now names two things.
2. Rule 12 vs. visualization geometry — image-fallback resolves E.5. The Copper Dome capsule pre-renders chrome (title, legend, north arrow, info panel, attribution, QR code; 1,373 chars visible) but draws polygons into an empty <svg id="map-svg"> container at runtime. The validator's surrounding-text heuristic passes; strictly the data-bearing content depends on JS — iOS Files preview would show an empty white box where the map should be.
The principled resolution (now documented in spec §2.3 "Carve-out for visualization geometry"): visualization geometry rendered into a pre-declared named container is allowed IF a static image rendering is embedded as the JS-disabled fallback in the same container. Preserves Rule 12's intent (content IS in the HTML — as a raster) while accommodating geometry that can't reasonably be pre-rendered as static markup. The image rendering is typically free — it's the same raster the pipeline already produces for non-capsule deliverables (PDF/JPEG exports). One extra <img> element and a one-line visibility toggle in runtime.
E.5 was parked specifically waiting for this case. v0.3.3 ships the resolution.
3. MinDev's hosting model is now empirically demonstrated. The MinDev response includes:
x-capsule-content-hash: sha256:60282cbdad54708f...
x-capsule-uuid: 9357a933-7ce1-4061-9488-2ca61d81bded
The host attests independently via response headers without modifying the file body — "wrap, don't modify" per Appendix B distribution guidance and E.7 (MinDev pattern). First publicly-fetchable example of this hosting model. Caveat: header attestation is honest about what the host computed; it's not a signature from the original author (that's still E.6 signing territory).
4. The compiler-kind integrity path works end-to-end on real production data. Full integrity block present (content_hash + hash_scope: "data+manifest"), generator.kind: "compiler", and the validator confirms the hash verifies on a 13.7 MB file. Mintel re-derived the integrity-hash recipe from spec prose alone — bit-identical hashes (noted earlier in F18; this finding adds concrete production-capsule evidence at scale).
5. Custom namespace use is exemplary. The x-mintel block (project_id, project_version_id, project_version_number) uses the x- extension prefix correctly per E.3's recommendation. Consumers that don't know about Mintel ignore the block; domain-specific consumers can dereference back to the source.
Spec moves landed in v0.3.3:
- §2.3 Rendering Model — image-fallback carve-out for visualization geometry, with worked example
- §6.3 Size Limits — hard cap 15 MB → 20 MB; 15-20 MB soft-warn tier added for email-friendliness
- §14 Validation — list item 11 updated to reflect the new cap and the soft warn
- §16.2 Out of scope — boundary list mention updated
domain.exploration_map— image-fallback as required convention; file-size note updated- E.5 — resolved (moved from parked-items to shipped)
compiler/validate.py—MAX_FILE_SIZE15 → 20 MB; file-size check now emits a soft-warn note when the file is between 15 MB and 20 MB
Open questions remaining:
- The header-attestation pattern (
x-capsule-content-hash,x-capsule-uuid) should be formalized as a "host contract" if/when there are multiple MinDev-shaped hosts. Currently lives as MinDev convention; would benefit from doc-only canonicalization in a future patch. - Mobile browser parsing above 15 MB is undertested. F5's linear-scaling result was desktop-only. Worth an empirical test on iOS Safari and Android Chrome at the new 15-20 MB range before the cap is taken as fully load-tested.
- The "compact variant" idea (a view-only capsule without the full GeoJSON, just the rendered image + minimal manifest) is interesting for view-only sharing — could shrink a 13.7 MB capsule to ~50 KB. Lacks current empirical pressure but worth flagging.
- Legacy compiler templates (
templates/decision_board,templates/news_capsule) still don't pre-render data-bearing content. Now that the image-fallback carve-out exists, they could either adopt the pattern or be documented as historical. Not urgent.
F21: Independent convergence on the host-contract pattern (MinDev + htmlbin)
Date: 2026-05-21
Two independent hosting layers have converged on the same shape for serving Capsule-style HTML artifacts, without coordination between them:
- MinDev (private, Mintel-tied; serves the F20 Copper Dome capsule and other Mintel-produced exploration_map capsules).
- htmlbin.dev (public, agent-first; launched ~May 17-18, 2026 by Utkarsh Sengar, Cloudflare D1 + KV stack). Independent project; not aware of htmlcapsule at launch.
Shared shape observed:
| Aspect | MinDev | htmlbin.dev |
|---|---|---|
| Short URL identity | UUID — mindev.ca/api/c/{uuid} | Slug — htmlbin.dev/p/{slug} |
/raw byte-identical endpoint | /api/c/{uuid}/raw | /p/{slug}/raw |
| Host chrome | Recedes to a left rail | Small header + footer attribution |
| Authorship attribution | Response headers (x-capsule-content-hash, x-capsule-uuid) | Footer text ("content authored by the agent that uploaded it") |
| Content mutation | None — serves uploaded bytes byte-identically | None — serves uploaded bytes byte-identically |
| Validates on upload | Yes (against Capsule spec) | No (accepts any self-contained HTML) |
| Visibility | Private | Public, OAuth-gated first publish |
Why this matters:
The convergence is empirical evidence that the host-contract pattern referenced in Appendix E.7 ("hosting-platform auth gates per the MinDev pattern... the platform controls delivery; the capsule itself doesn't gate its internal contents") is a real shape that independent producers reach on their own. The format/host split — the format defines the artifact, the host serves it — appears stable across implementations.
This is the empirical pressure the project was waiting for to formalize a host contract beyond the single-implementor MinDev reference. The "what a host should do (and not do)" doc that was previously parked can now be drafted as a description of an observed convention across two independent implementations, not a proposal made in a vacuum.
Practical implication — the format is hosting-agnostic, demonstrably:
A valid Capsule can be hosted on MinDev, on htmlbin, or self-hosted, with no format change. Hosts adopt the format optionally; the format imposes nothing on the host beyond "serve the bytes you received." The "format-not-platform" stance is now concrete and verifiable, not aspirational.
Spec moves to consider:
- A new doc —
spec/HOSTING.mdorspec/HOST_CONTRACT.md— describing the observed pattern: short URL identity +/rawbyte-identical endpoint + minimal host chrome + (optional) integrity attestation in response headers + no content mutation. Not normative; documentary. Cites MinDev and htmlbin as the two convergent implementations. - Possibly update Appendix B (Distribution Guidance) to add htmlbin alongside MinDev as a concrete hosting example.
- E.7 (MinDev pattern reference) can be annotated as having independent empirical support; no resolution change needed.
Open questions:
- Will htmlbin add integrity-attestation headers (
x-capsule-content-hashstyle) over time as Capsule-format artifacts get hosted there? If so, the convergence deepens — every host independently arrives at the full pattern, including the attestation layer. - Will the host contract crystallize as a formal spec, or stay descriptive? Probably the latter for now — formalizing would require multiple host implementations to agree on protocol particulars (header names, slug format, etc.), which would need outreach. Descriptive documentation captures the observation without prescribing.
- Should the project provide a "verify a hosted Capsule" mode in
validate.py? Currently the validator works on local files; could optionally fetch a URL, recompute the integrity hash, and check it against MinDev'sx-capsule-content-hashheader attestation. Small addition, real utility for recipients who want to verify they got what the host claims they got. - Worth tracking whether other hosts emerge in this space over the next quarter. Two implementations is convergence; three or more is a pattern that probably deserves formal documentation.
Cross-references:
- F20 — the MinDev side of the convergence, with the Copper Dome capsule.
- PRECEDENTS.md "Current voices in HTML-for-AI" — Utkarsh / htmlbin added as a hosting-layer voice; the three-position picture extended to acknowledge format-layer vs. hosting-layer slots.
- Appendix E.7 in spec/CAPSULE_SPEC.md — the "MinDev pattern" reference that hinted at this convention; F21 is the empirical validation.
F22: Independent convergence on the live-editing layer pattern (html-docs + workplane)
Date: 2026-05-21
Two independent live-editing tools shipped in approximately the same window (mid-May 2026) with substantially the same workflow shape, without coordinating. This is the parallel finding to F21 (host-pattern convergence): two layer-level patterns have now been observed converging independently within this project's first two weeks of running. The pattern of convergence is itself becoming a recurring methodological observation.
The two tools:
| html-docs.com | workplane.co | |
|---|---|---|
| Creator | Raunaq Bhutoria (Meta engineer; @raunaqbn) | Matan (GitHub: matanrak; based in Israel) |
| Repo | Not public | work-plane/workplane-skills (MIT); /workplane repo linked but 404s |
| Org created | (n/a — closed SaaS) | work-plane GitHub org created 2026-03-29 |
| Most recent push | (closed) | 2026-05-20 (workplane-skills) |
| Tagline | "Create beautiful docs and webpages with your Agents." | "Turn AI outputs into live pages." / "The working plane between AI and humans." (README) |
| Open source | No (SaaS, closed) | Partial — agent skill is open MIT; main service may be closed |
| Agent integration | Claude Code skill + MCP server + HTTP API; 6 named tools (publish, publish_file, update, read, comment, list_comments) | MCP-first; works with Claude Code, Codex, Cursor, Devin, Claude Desktop |
| Account gates | Required for some imports | Free for individuals; no account required for commenters |
| Endorsements on homepage | Karpathy, Thariq, Ryan Carson | None visible |
Shared workflow shape (the actual convergence):
- Agent generates HTML/markdown
- Publish to the tool → live URL with stable identity
- Humans review with inline comments
- Agent reads the comments and revises
- Iteration loop continues until "if good, go build" (Raunaq's framing)
Both tools implement steps 1-5 with MCP as a primary integration path and inline comments as the review surface.
Differences (mostly orthogonal to the workflow):
- html-docs is closed/SaaS; workplane is partially open
- html-docs has high-profile endorsements; workplane has none visible
- html-docs requires accounts for some imports; workplane is free without account
- html-docs is HTML-first; workplane lists markdown, HTML, and screenshots
- Creator visibility differs sharply: Raunaq is named and uses html-docs.com publicly in his Meta workflow; Matan is anonymous on the workplane.co homepage and only surfaces via GitHub commit history
Why this matters:
The pattern (agent ↔ human review loop with publish-and-comment as the primitive) is now empirically observable as something multiple independent producers reach for. Just like F21 named the hosting-layer convergence (short URL + /raw + minimal chrome + honest attribution), F22 names the live-editing-layer convergence:
- Publish endpoint that accepts agent-generated content
- Stable URL identity per published doc
- Inline comments as the review surface
- Agent integration via MCP (in both implementations)
- Version history of the iteration
The MCP common denominator is itself notable — both tools lead with MCP integration, which suggests MCP adoption is the enabling substrate for this layer's convergence. Without a standard agent-to-service protocol, each tool would have to ship bespoke integrations; with MCP, the same skill works against any host.
This is the "canvas step" Capsule explicitly doesn't compete with. F22 names that the canvas step is a real, reproducible layer in the lifecycle — not a one-off product idea.
The composition story is now empirically backed at every layer:
- Live editing: html-docs + workplane converge on the pattern (F22)
- Format / seal: Capsule (this project; multi-producer interop already validated across LLMs + Mintel)
- Hosting: MinDev + htmlbin converge on the pattern (F21)
- Discovery: llms.txt (one major implementation; adoption signal via Chrome Lighthouse rather than convergence)
Four lifecycle layers; convergence-pattern findings at three of them. The format-and-host split + the editing-and-format split + the host-and-discovery split are all real.
Spec implications: None directly. Capsule occupies the seal step downstream of the live-editing layer; the live-editing layer doesn't need Capsule's discipline because the artifact is still mutating. The composition is what matters, not Capsule mandating anything in the upstream layer.
Open questions:
- What happens when a Workplane or html-docs doc graduates to a sealed Capsule? Neither tool currently has Capsule export. Would there be value in proposing a "freeze to Capsule" capability? Not yet — empirical pressure not there.
- Will the live URLs themselves become canonical (no need to seal)? Or will users still want a sealed downstream artifact for archival? Depends on durability of the live-editing tools over years.
- The MCP-as-enabling-substrate observation deserves its own follow-up — is MCP the common denominator across multiple converging patterns? Worth checking against F21 (do MinDev/htmlbin both use MCP-style standardization for upload?).
- Workplane's "the working plane between AI and humans" framing is sharper than html-docs.com's positioning; worth borrowing the layer name itself — possibly rename "live editing layer" to "working plane" in the project's terminology going forward. Defer until the term is stress-tested.
- Will a third live-editing implementation appear, validating this as a proper pattern (per F21's "three or more is a pattern that probably deserves formal documentation" framing)? Track.
Cross-references:
- F21 — the parallel hosting-layer convergence; same shape of finding
- PRECEDENTS.md — Raunaq / html-docs.com + Matan / Workplane entries added to "Current voices in HTML-for-AI"; position table grew to 9 rows
- voices/README.md — queue tracking and graduation rule
F23: URN-not-URL QR encoding — empirical validation of a deliberate spec choice
Date: 2026-05-21
The Core spec (CAPSULE_CORE.md Rule 4 supplementary QR-code guidance) recommends embedding a QR code that encodes urn:uuid:<uuid> — the URN form, not a live URL. The reasoning at the time was that URNs are non-resolvable but honest about being non-resolvable, while URLs encode a host's distribution policy and that policy can change without the format changing. A real-world incident on 2026-05-21 validated this reasoning concretely.
What happened:
Mintel's build_exploration_map_capsule.py had been encoding https://mindev.ca/c/<uuid> in the on-map QR (the rationale: a phone scanning a printed map could land directly on the live capsule). This was a deviation from the spec — fine in isolation because MinDev was a known host and the URLs worked at the time.
On 2026-05-21 MinDev shipped a security-driven schema change that removed the public visibility tier entirely. Existing public rows migrated to org; the mindev.ca/c/<uuid> URL pattern now returns 403 {"error":"forbidden"} to anonymous callers. Org members keep access via Firebase auth; external recipients need a share-token URL (mindev.ca/api/c/share/<token>) instead.
Immediate consequence: every previously-printed Mintel map carrying a QR pointing at the live URL now resolves to a 403 for any anonymous scanner. The QR didn't break structurally — it still scans, still produces a URL — but the URL has changed semantic meaning. Was: "fetch this capsule". Now: "fetch this capsule if you happen to be authenticated to the right org on the device scanning the code". The producer (Mintel) had no way to know in advance that this change would happen on the host side; the printed maps in the wild can't be recalled.
What this validates:
- URN form for the QR is the right default, not the URL form. URN is honest about being a pointer-without-resolution-guarantee. URL encodes an assumption about host behavior that the format has no business making.
- The format/host split documented in
spec/HOSTING.mdis real, not theoretical. Format-layer artifacts (capsules) should not bake in host-layer policy decisions (visibility tiers, auth gating, resolution semantics) because those decisions belong to the host and can change without the format changing. The capsule's bytes are identical before and after MinDev's change; what changed was who-can-fetch-them — a pure host concern. - The deliberate spec choice was correct, even though the URN form is "less convenient" for the immediate scan-and-view use case. Convenience that's contingent on host policy isn't durable; honest pointers that require an extra step are durable. The convenience-vs-durability trade-off has now been demonstrated empirically, not just argued abstractly.
The fallback pattern that does work:
If a producer wants the QR to resolve to a live capsule via the URL form, the right path is:
- Producer asks host to mint a share token at upload time (the Mintel-side ask currently flagged in MINTEL_TODOS.md:
?mint_share_token=trueon the upload endpoint, returning ashare_urlin the response) - Producer encodes
https://<host>/api/c/share/<token>in the QR - This URL is anonymous-resolvable by design, has revocation, has audit, has expiry, has view-cap — and survives host policy changes because the share-token endpoint exists specifically for anonymous resolution
The URN form remains the right default; the share-URL form is opt-in for cases where the producer explicitly wants anonymous resolution AND has minted a token at build time AND has accepted the share-token's audit/revocation tradeoffs.
Spec implications:
None directly — the spec already says URN. This is post-hoc empirical validation, not a spec change. A one-paragraph addition to spec/HOSTING.md (landed alongside this finding) names visibility tiers as host-side policy and cites this case as the canonical example of why format artifacts shouldn't bake in resolution-semantics assumptions.
Open question:
What should a producer's build script do for capsules whose host visibility is org (where anonymous scan won't resolve)? Three reasonable patterns, currently unsettled across implementations:
- Always encode the URN (safe default; recipient has to type or paste UUID into a host UI to view).
- Encode the URL but add an alt-text/caption like "Sign in to
to view" so a scanner knows what to expect. - Encode the share-URL when a token has been minted (opt-in to anonymous resolution; requires producer to have requested a share token at upload time).
Currently the canonical convention to recommend isn't settled. Worth tracking whether other compiler-kind producers reach for one shape vs. another — if a second independent producer makes a different choice and ships, the convergence (or divergence) becomes a future finding.
Methodological note — the agent-to-agent collaboration pattern:
This finding emerged from a Claude-on-MinDev-side conversation pushed through to a Claude-on-Mintel-side conversation via the user as a human-router. Each agent owned its own system's concerns: MinDev's agent diagnosed the threat model + drove the schema change + posted prod verification; Mintel's agent audited the producer-side fallout + flagged the QR-encoding gap + committed to build-script patches in its own domain. The htmlcapsule project's record (this finding) is then the third surface that absorbs the cross-domain learning. Worth tracking as a pattern: multi-agent + human-router collaboration is producing real research artifacts (this F-finding) faster than a single-agent loop probably would.
Refinement (2026-05-21, F24): The URN-as-default recommendation in this finding is correct for producers without signal about host commitments. The case where a producer knows their target host has declared registry compliance opens a different reasonable choice — encoding the URL becomes a calibrated bet against a published contract rather than a gamble. F24 introduces the host vs. registry distinction and sketches a Capsule Registry Compliance v1 contract in spec/HOSTING.md. The default for general-purpose producers (and for the Mintel build script today, since MinDev has not declared compliance) remains URN; the option to encode URLs becomes available when the destination host has declared compliance.
Cross-references:
- CAPSULE_CORE.md Rule 4 supplementary QR guidance — the deliberate URN-not-URL spec choice
- spec/HOSTING.md "Visibility tiers as host-side policy" — the addition that names this as the canonical example
- F20 — the Mintel Copper Dome capsule observed there used the URL form in its QR
- F21 — the broader host-contract pattern; visibility tiers are one axis hosts vary on
- F24 — the synthesis that refines this finding
F24: Host vs. registry — the missing commitment layer
Date: 2026-05-21
F23 documented the empirical case where Mintel's URL-encoded QR codes broke after MinDev removed the public visibility tier. The first reading of that finding was: URN is the right default; URL was a deviation that bit Mintel. In a conversation following the F23 commit, the maintainer pushed back with a sharper question: at build time, Mintel knows it's uploading to MinDev; the URL form is more useful than URN for the recipient; the failure mode isn't producer error, it's that the host (MinDev) hadn't committed to keeping the URL working. The refined synthesis: the project's format/host split has been treated as "format and host are independent strangers," but in real workflows producers and hosts often want to be coordinated via published contracts. The format itself stays agnostic; some hosts may want to declare more.
The naming move: host vs. registry
- A host serves capsules. No commitments beyond "the bytes go out the way they came in."
- A registry is a host that commits to keeping serving them in a particular way — stable URL patterns, visibility honor, deprecation discipline, attestation headers, no surprise breaking changes.
Hosts can choose to remain just hosts or to declare themselves registries (by publishing a compliance statement at a well-known location). The format takes no position; producers and recipients gain a signal they can act on.
What changes about F23's "URN is the safe default":
- F23's recommendation is correct for the case it documented — producers without signal about host commitments should default to URN because URL form is a bet on unspecified host behavior.
- F23's recommendation is incomplete. When producers know their target host has declared registry compliance — including commitments about URL stability and visibility-tier preservation — encoding the URL becomes a calibrated bet against a published contract, not a guess.
- The Mintel/MinDev case wasn't "Mintel made a mistake by deviating from spec." It was "Mintel made a reasonable bet on MinDev behavior that MinDev hadn't promised to keep." The fix isn't "always use URN" — it's "encode URN by default; encode URL when the host has declared compliance and visibility commitments are part of the declaration."
Sketched Capsule Registry Compliance v1 contract (not yet adopted by any host):
- Stable URL pattern.
<host>/<prefix>/<uuid-or-slug>. Pattern doesn't change without a major version bump + redirect period. /rawbyte-identical endpoint at the URL +/raw. Never mutates the body.- Visibility commitment is part of the contract. Whatever visibility tier a capsule is uploaded under is honored for the capsule's lifetime, OR migration is announced with notice. Removing a tier without grandfathering existing capsules is a breaking change.
- Host-attestation headers (
x-capsule-content-hash,x-capsule-uuid) on every/rawresponse. - Honest deprecation. Breaking changes get a public changelog + deprecation window + migration path. Surprise policy changes that break in-the-wild artifacts are out of compliance.
- Capsule immutability. The registry serves the bytes it received. No mutation, no re-rendering, no injection.
Full sketch with proposed well-known location (<host>/.well-known/capsule-compliance.json) and adoption status is in spec/HOSTING.md under "Hosts vs. registries — the optional commitment layer."
Mapping the MinDev incident onto the proposed contract:
- MinDev was operating as a host, not a registry, at the time of the
publicremoval. - The change was security-correct but registry-breaking: existing-public capsules' anonymous resolvability was removed without grandfathering or notice.
- If MinDev had been operating under compliance v1, the
public-removal would have required either grandfathering existing capsules at their original visibility OR a major version bump + migration period. - MinDev can retroactively declare compliance v1 (with the recent change being framed as the v0→v1 migration event itself) or refuse to claim it. Producers like Mintel benefit either way: a declared host is safe to encode URLs against; an undeclared host is not.
Why the host-vs-registry distinction matters more broadly:
The project's layer picture (format / live-editing / hosting / discovery) treats each layer as independent. The compliance layer adds a coordination axis: within a layer, implementations can choose to coordinate via published contracts. Registry compliance is one example; spec/HOSTING.md's descriptive host-contract pattern is another, weaker example. This is how the open web works generally — browsers treat URLs as untrusted by default, but sites can opt into stronger trust by adopting HTTPS / HSTS / CSP / etc. The Capsule project can offer the same opt-in for hosts.
The format/host split stays correct as the baseline; the compliance layer is the upgrade path for hosts that want to be more than baseline.
Spec implications:
- New section in
spec/HOSTING.md: "Hosts vs. registries — the optional commitment layer" sketches the compliance contract. Stays descriptive (matching HOSTING.md's overall disposition) — defines what a host could commit to, doesn't force any host to commit. - F23's "URN-only default" recommendation is refined here, not retracted. Default for producers without signal remains URN. Case where a producer has signal (registry compliance declared) opens the URL option as a calibrated bet.
- No Core spec change. CAPSULE_CORE.md Rule 4 supplementary QR guidance still says URN — that remains the right default for the format itself, which has no opinion on which hosts produce capsules and where they end up. The format stays agnostic; the registry-compliance layer is opt-in at the host's surface, not at the format's.
Open questions:
- Self-declared vs. third-party verified? Self-declared (host publishes its own
/.well-known/capsule-compliance.json) has lower friction; third-party verified has stronger guarantees. No empirical pressure yet to pick. Lean toward self-declared for v1 — easier to bootstrap. - Version-bump discipline for the contract itself? If compliance v1 ships and then needs revision, what's the upgrade path for hosts that have declared v1? Standard semver-shaped questions; deferrable until at least one host signs on.
- Should the format carry a signal about which compliance level its host declared — e.g., a manifest field naming the registry's declared compliance? Honestly leaning no: that would couple format to host, exactly the thing the project deliberately avoids. The compliance declaration belongs at the host's surface (its
/.well-known/, its docs), not in the capsule's bytes. - First adoption? MinDev is the natural first candidate — its recent security change can be framed as the v0→v1 migration event. htmlbin is a second candidate; the personal-sharing host the maintainer is planning is a third. If multiple hosts adopt the same compliance level voluntarily, the contract crystallizes into something a future capsule producer can rely on across hosts.
Methodological note — the pushback was the finding:
F24 didn't come from a tool, a capsule, or an external piece. It came from the maintainer pushing back on F23's framing during a follow-up conversation: "isn't what we are dancing around is the registry being htmlcapsule-spec compliant?" That single sentence reframed F23 from "Mintel made a mistake" into "the project lacks a host-commitment layer." Worth tracking as a research-method observation: the project's most useful conceptual moves are sometimes made by the maintainer pushing back on a finding's first framing, not by new external pressure. F23's empirical event was necessary but not sufficient; the synthesis required the conversational refinement.
Cross-references:
- F23 — the precipitating finding; URN-as-default refined here, not retracted
- spec/HOSTING.md "Hosts vs. registries" — the compliance contract sketch this finding motivated
- F21 — host-pattern convergence; the compliance contract makes explicit what F21 observed implicitly
F25: ChatGPT producer-population reads Core supplementary guidance reliably; aesthetic adapts to content domain; legacy "Artifact Capsule" wording persists in user-side prompt templates
Date: 2026-05-21
A batch of 7+ ChatGPT-generated capsules (GPT-5.5 Thinking; conversation summaries across varied domains — hands-free coding workflows, geological target reinterpretation, Indigenous-rights conversation, design-award fit assessment, propane fire-pit purchase brief, Kia Sedona vs pickup decision, Swedish mining permits, Colombian pension-refund letter) were reviewed against Core v0.3.0. All produced from the user's prompt template "Produce an Artifact Capsule per the Core spec (attached) summarizing this conversation."
This is the largest single-batch empirical sample of a single LLM producer kind working from Core v0.3.0 to date. Five distinct findings emerged.
1. All five required blocks present. Rule 2 (no network) and Rule 12 (pre-rendered content) honored across every capsule in the batch. Multi-producer interop validated yet again at scale.
2. Rule 4 supplementary QR-code guidance followed faithfully across the population. Every capsule embeds a QR encoded as urn:uuid:<uuid> (per F23's URN-not-URL choice), placed top-right in the header, sized 88×88 px (Core suggested 80–96 px), with image-rendering: pixelated, a data:image/png;base64,... URI, alt="QR code for capsule UUID <uuid>", and a <figcaption> showing the UUID's first 8 chars. This is independent reproduction of supplementary-guidance compliance, not just compliance with the twelve numbered rules. When Core writes implementation-recipe-shape guidance — specific placement, exact sizing, a Python qrcode-library code example — LLM producers follow it precisely. This strengthens the F18/F19 multi-producer interop claim into a sharper version: Core's supplementary sections are load-bearing in practice, when they're written as recipes.
3. Where Core leaves implementation open, producers diverge with their own conventions:
- Integrity hash: Core explicitly excludes integrity from required content (it lists "Content hash protocol and integrity verification" under "What this short spec does NOT cover"). Producers correctly skip the
integrity.content_hashfield. Behavior matches design — not a conformance gap, just integrity remaining opt-in for capsules that don't need tamper-detection. - Rule 7 verification convention: Core says "Capabilities don't lie. Every capability declared in the manifest must have a working implementation" but doesn't specify HOW to mark the implementations. The validator's
data-capsule-action="<cap>"heuristic was added in v0.3.2 (surfaced by external compiler-kind producers and recorded in v0.3.2 release notes) — but that convention lives only in the validator, not in Core. LLM producers in this batch each invent their own pattern (data-action="copy-json",id="copy-json", plain DOMReady listeners). The capabilities ARE implemented and DO work in the browser; only the validator's auto-verification heuristic misses them. Research signal: if a Rule 7 verification convention should reach LLM producers, it needs to be in Core (or in the produce-prompt fragment), not only in the validator. A validator heuristic alone doesn't propagate back to producer kinds that aren't being validated by that exact tool.
4. Producer aesthetic adapts to content domain. Genuinely new observation. Capsules in the batch use distinctly different visual treatments per subject:
- Geological target report → warm earth tones (sand/cream), serif headings, sample-card layout that nods at field-notebook aesthetics
- Propane fire-pit purchase brief → warm cream/orange/brown palette, fire-appropriate vibe, decision-matrix table
- Legal identification-update letter → neutral clean palette, legal-document neutrality
- Sweden mining permits → cream with green accents (tonal nod to Swedish flag)
- Voice/coding workflow → tech-blue with sans, system-font, dashboard feel
- Indigenous-rights conversation → cool grey/blue editorial
- Vehicle-decision brief (Sedona vs pickup for camping) → warm woodsy/camping cream-brown
- DNDA fit assessment → modern blue-grey editorial
The format constrains structure (five blocks, twelve rules, no network) but does not constrain visual design at all, and producers exploit that to make Capsules feel domain-appropriate. The aesthetic is part of what's being archived — a reader opening a geological capsule five years from now will see it in the visual register the producer thought matched the subject, which is itself a form of preservation. This is unspecified-but-useful emergent producer behavior. For project posture: if a future "house theme" became tempting (one stylesheet to rule them all), this is the data point that argues against constraining it. Producers treating Capsules as design objects (not just data containers) is doing useful preservation work that a uniform stylesheet would erase.
5. Legacy "Artifact Capsule" terminology persists in user-side prompt templates. All capsules in the batch have prompt_received containing "Produce an Artifact Capsule per the Core spec (attached)…" — using the v0.1 name that was renamed to just "Capsule" in v0.2 (see GLOSSARY.md and spec/CAPSULE_SPEC.md naming-history notes). The Core spec itself uses "Capsule" everywhere — its own produce-prompt template (CAPSULE_CORE.md §"How to ask an LLM to produce a capsule") says "Produce a Capsule" — so the legacy term is propagating via the user's stored prompt template, not via the spec. Project response (this commit): added an explicit "use the canonical name" reminder immediately above Core's produce-prompt section, with a back-reference to this finding. Doesn't change spec rules; closes the loop by making the canonical name unmissable to anyone templating their own prompts. The producer-side field values are accepted under legacy v0.2 compatibility per the naming notes in the full spec.
Cross-references:
- F18 — peer-review framing of multi-producer interop
- F19 — Claude Design as first independent LLM-kind producer reaching conformance from Core alone
- F20 — Mintel as first compiler-kind production capsule
- F23 — URN-not-URL QR encoding choice; this batch is the largest sample confirming producers respect that default
- CAPSULE_CORE.md "How to ask an LLM to produce a capsule" — the produce-prompt where the "use canonical name" reminder was added
F26: Core spec accommodates 10 MB domain-specific media capsules without rule changes
Date: 2026-05-21
Source: One-off domain-specific song capsule experiment — Paul McCartney & Wings, "Nineteen Hundred and Eighty-Five" (1973). A 7.6 MB MP3 plus Wikipedia-sourced metadata (personnel, role on Band on the Run, covers, critical reception, live history, composition genesis quote) plus a transcribed lyric sheet, sealed as a 10.16 MB self-contained HTML capsule (UUID e26b58da-a3b2-4675-aa33-78511ad93e60, currently at capsule_version 1.1.0). Shipped 25/25 against the reference validator on first build with zero spec changes required.
Finding. Core spec v0.3.0 plus the existing supplementary recipes (QR convention, CSP defaults, capability vocabulary) is sufficient for domain-specific binary-media capsules at the 10 MB scale. Spec held at every dimension tested:
- Domain via
typefield:type: "song"— Core accepts arbitrary domain values without modification; the producer-population pattern from F25 extends straightforwardly to media domains. - CSP delta is minimal:
media-src data:is the only addition required to the default recipe (default-src 'none' baseline; img-src data: already present for QR codes). No new directives. - Capability vocabulary extends naturally:
media.play_audioandmedia.download_audiofollow the established<domain>.<action>convention added in v0.3.2; Rule 7 markers (data-capsule-action="media.play_audio"on the<audio>element) were validated heuristically as expected. - Size limits hold under-cap: 10.16 MB sits under both the 15 MB soft-warn tier and the 20 MB hard cap. Reference validator does not penalize binary-heavy capsules under-cap; size scaling is graceful.
- Round-trip extraction works: the
media.download_audiocapability recovers the embedded MP3 byte-identically via adataUriToBlobruntime helper — the file is genuinely portable, not just embedded-for-display. The downstream "this is just a file" mental model from F21 holds even for 10 MB media. - Aesthetic adapts to content domain: producer (hybrid: Claude Opus 4.7 + maintainer) selected warm earth tones (
#f4ebd9background,#b8421aaccent) appropriate to a 1970s rock recording. Extends F25's "aesthetic adapts to content domain" finding from text-only ChatGPT capsules into hybrid-produced media capsules — the pattern is producer-kind-agnostic. - Version semantics in practice: the song capsule went v1.0.0 (audio + metadata only) → v1.1.0 (added transcribed lyric sheet) on the same UUID via
capsule_versionbump alone. Noparents[]chaining was needed because nothing was distributed between versions.
Implication for the spec. Core is not under-specified for binary-media capsules at this scale. No new rules needed; no Core changes triggered. The fidelity gradient between LLM-produced and compiler-produced capsules (per F25) remains the open work, not size or domain scaling.
Implication for parked Appendix E.11 fields. The song-with-lyrics-added scenario lived through the exact use case the parked supersedes[] / derived_from[] / change_summary fields (raised by external review, parked in spec Appendix E.11 pending real-producer pressure) would address — same UUID, content change worth signaling to downstream holders, current solution is just a capsule_version bump. Since the capsule was not distributed between v1.0.0 and v1.1.0, the parked fields stayed parked correctly: empirical pressure point is now recorded for the next time a producer needs to signal "this supersedes my previously-shared v1.0.0" without minting a new UUID.
Cross-reference. The producer for this capsule was the in-conversation Claude Opus 4.7 hybrid pattern (the same producer pattern as the project's landing page itself, per its generator block). This is the first F-finding from a deliberately one-off, domain-specific, copyright-laden capsule that was not committed to the public repo — a different empirical-pressure source than F25's open-corpus producer population, and a useful complement.
Related findings:
- F19 — Claude Design as first independent LLM-kind producer reaching conformance from Core alone
- F20 — Mintel as first compiler-kind production capsule
- F25 — producer-population reads supplementary guidance; aesthetic adapts to content domain (the source claim this finding extends to media)
- Spec Appendix E.11 — parked
supersedes[]/derived_from[]/change_summaryfields awaiting empirical pressure
F27: The landing-page genre tension for applied-research projects resolves by splitting, not merging
Date: 2026-05-22
Source. The May 2026 landing-page exploration arc on this project — from index.html v10.x through v13.0.0, plus four comparison sketches (landing-sketch.html v1/v2, research-sketch.html v1/v2, positioning-sketch.html) — and three independent external reads (devil's-advocate critique pass, the in-flight Claude landing-agent's hero pick during the parallel-sketch experiment, and a ChatGPT Deep Research site survey). The maintainer captured the tension directly during the arc: "it's hard to create a landing page for something which is, at its heart, research, albeit applied."
Finding. A landing page for an applied-research project pays a real cost trying to do both jobs at once. Landing pages convert (one claim, one CTA, one demo, optimize for click); research pages persuade (cite everything, walk the argument, optimize for "you can verify this"). When a single page tries to do both, it pays both costs and converts on neither. The exploration arc tried all three pure-genre commitments plus the hybrid before settling:
- Hybrid (research narrative + landing elements in one page) —
index.htmlthrough v12.0.0. Numbered Observations / Questions / Answers + nine hero candidates + CTAs + research apparatus all on one surface. The genre tension was visible to every reader: research apparatus showed through landing veneer; landing apparatus interrupted research depth. Both genres paid for the other. - Pure landing (Stripe / Linear stripped) —
landing-sketch.html/landing-sketch-v2.html. Conversion-shaped, ~9% of the prose volume. Lost the research argument; "research project" signal collapsed to "yet another file format." - Pure research-paper (NeRF-style academic) —
research-sketch.html/research-sketch-v2.html. Author block / abstract / numbered findings / methods / related work / cite-this-work. Lost the conversion shape; the word "Abstract" reads as "not for you" to non-research audiences. - Synthesis (positioning-led, lifecycle-diagram-centered) —
positioning-sketch.html. Pain-first hero ("Your AI work shouldn't die when the chat closes."), lifecycle SVG as the centerpiece. The most novel of the single-page options; still asks one URL to carry both audiences.
The resolution that worked. The two-page split — listed as "Option B" / "Option D" during the exploration but consistently underweighted because splitting feels like the hedge move. It isn't. The production landing (index.html at v13.0.0, UUID 7d1a1ac8) is the pure-landing commit, optimized for conversion. The full research-narrative is preserved as a separately-accessible page (exploration.html, UUID 881fed04), optimized for depth. Same UUID lineage (via parents[]); distinct identities. Each page is genre-pure; each page pays only its own genre's cost.
Implication. The framing "decide between landing-genre and research-genre" was wrong all along — it presumed one URL. The right framing was "decide which page is which." The genre tension dissolves when you stop asking one URL to carry both audiences.
Generalization. This pattern likely transfers to any applied-research project with a mixed audience (technical / general / research-leaning). Front door optimized for "what is this and why should I care in 30 seconds"; deep page optimized for "I'm bought in and want the full argument with citations." Cross-link explicitly. Don't try to merge.
Method observation. Three independent reads converged on the split — devil's-advocate critique, the landing-agent's hero pick (which selected "HTML you can keep." as the strongest single claim, implying genre commitment), the ChatGPT Deep Research review (which framed the project as research that doesn't need a sales-y landing). When multiple independent reads converge on a structural conclusion that you'd been resisting (because it feels like a hedge), that convergence is a stronger signal than any single read. Worth tracking as a methodological pattern: external review convergence on a structural decision is empirical pressure even when the decision feels like cowardice.
Related findings:
- F18 — peer review as a source of structural framing pressure (same kind of empirical-pressure pattern operating at the spec level)
- F24 — same split-instead-of-merge pattern at the hosting layer; when two roles are tangled, split first
- F25 — the maintainer's research-method post-mortem on external-LLM review as a recurring source of design pressure
- [CHANGELOG
[Landing decision — v13.0.0]](CHANGELOG.md) — the operational record of the commit, including the parents[] chain and the role assignment forexploration.html design/proposal.html— the design memo from Claude Design that synthesized the landing direction (anti-context-loss pain framing + lifecycle layer + indigo brand) before the commitment
F28: Producers reach for Capsule-shape independently when given the idiom but not the spec — empirical pressure for discoverable onboarding
Date: 2026-05-22
Source. Review of a ChatGPT-produced MIDI capsule POC (Mozart Lacrimosa, ~220 KB; preserved at capsule-midi/proofs/lacrimosa-chatgpt-poc.html). The user asked ChatGPT to "make a DAW-like HTML capsule from this MIDI" without attaching Core as a prompt fragment.
Finding. Without Core attached, the LLM producer (ChatGPT in this case) independently reached for the Capsule idiom — single-file HTML, embedded JSON manifest, schema declaration (capsule_schema: "midi-stem-capsule-v0.1"), parents[] array (with composition reference), sha256 of source bytes, honest license_note with "verify before redistribution" caveat — but missed the Capsule specifics:
- ❌ Single
<script id="capsule-json">block containing both manifest + data, instead of the five separate spec-required blocks (capsule-manifest,capsule-data,capsule-style,capsule-root,capsule-runtime) - ❌ No integrity hash block
- ❌ No
data-capsule-actionmarkers on UI buttons (Rule 7 — declared capabilities have no implementation-binding convention) - ❌ No CSP
<meta>block - ❌ Empty
<div id="lanes">/<div id="facts">populated only at runtime (Rule 12 borderline)
Validator result: 5/10 pass, 1 warn, 4 fail — the basic-shape checks pass (HTML5 doctype, html/body, no external network refs, well-formed runtime JS, under-cap size), but every structural check fails (5-block requirement, manifest section parseable, data section parseable, content hash verifies).
Companion to F25. F25 observed producers with Core attached reliably follow supplementary guidance. F28 observes producers without Core attached reach for the shape but miss the specifics. Together: Core works when attached as a prompt fragment; when not attached, the idiom is reached for organically but the specifics are reinvented.
Implication for the spec — discoverable onboarding is empirically warranted. The Capsule shape is a real attractor — LLMs reach for it even without prompting — but they can't reproduce the structural specifics without seeing them. Possible spec-level responses:
- Extend
/llms.txtto publish Core as a paragraph-level summary plus a link to the full Core, so any LLM doing web research on htmlcapsule.org lands on the discipline naturally. Cost: small. Benefit: every LLM that's done its own research has Core in context. - Publish a one-page "Producer starter kit" — Core + minimal example + the most common producer mistakes (5-block vs single-json, missing Rule 7 markers, etc.) — at a stable URL discoverable from
llms.txt. Cost: medium. Benefit: producers without Core fall back to a clear failure mode (the starter kit) rather than reinventing. - Document the "reached for the shape but missed the specifics" failure pattern in
spec/CAPSULE_SPEC.mdas a known gap, with the response being "attach Core; without Core attached, expect 5/10 at the validator." Cost: very small. Benefit: sets accurate expectations.
The maintainer's pick (per capsule-midi/FEEDBACK.md): option 1 is the smallest and most discoverable. Worth doing as part of the next operational pass.
Methodological side-finding. This is the second time a producer-side experiment has yielded research-record material that crosses back into spec design. The pattern is now visible:
producer attempts a domain → hits a friction → friction is logged in producer's FEEDBACK.md → harvested into htmlcapsule's RESEARCH.md as an F-finding → may trigger a spec change
This is the cross-project memory pattern the producer projects (capsule-midi, Shasta, capsule-photo, Mintel) use to feed empirical pressure back into the spec without unilaterally inventing changes. Worth naming as a deliberate methodology — call it upstream feedback discipline. The producer projects own the friction; the spec project owns the response.
Related findings:
- F19 — Claude Design reached conformance from Core alone (Core attached → producer succeeds)
- F25 — ChatGPT-with-Core-attached reads supplementary guidance reliably (Core attached → producer succeeds at specifics)
- F26 — Core spec accommodates 10 MB domain-specific media capsules without rule changes (the song capsule experiment; same Lacrimosa POC seeded this line of research)
capsule-midi— the producer project that surfaced this finding; raised in itsFEEDBACK.mdas item F-A before being filed here
F29: iOS QuickLook surfaces graceful degradation as a first-class spec principle, not just a Rule 12 implication
Date: 2026-05-22
Source. Two pieces of empirical pressure converging:
- The capsule-midi v0.2.0 producer template added a
<noscript>warning naming iOS Files-app QuickLook explicitly: "Audio playback & interactivity require JavaScript. On iOS: this file is currently in the Files-app preview. Tap the share icon and choose Open in Safari to enable playback." That's a producer-side adaptation to a real distribution-environment constraint. - An external strategic-review discussion (preserved in the parent chat thread) argued at length that iOS QuickLook is the canonical hostile environment Capsules should design for, not against, and proposed promoting graceful degradation to a first-class design principle with manifest-level declarations and per-domain guidance.
The actual environment. iOS Files / Mail / Messages / AirDrop / iCloud Drive / Notes preview surfaces route HTML attachments through Apple's Quick Look framework. Quick Look is a passive preview system — it renders HTML/CSS but does not execute <script> tags. This is a defensible security posture (untrusted attachment HTML running JS from every preview surface would create real attack vectors) but it means a capsule whose substance lives in the runtime fails the iOS-preview first impression.
Finding. The spec already covers most of this but doesn't surface it as the design discipline it's pointing at. What's already there:
- Rule 12 (
CAPSULE_CORE.md) — pre-rendered content must exist in HTML before JS runs. spec/CAPSULE_SPEC.md§2.3 Rendering Model — explicitly names "iOS Files / QuickLook preview" as a JS-restricted target environment; documents the image-fallback pattern with worked example; articulates "interactive archive (permitted) vs app (forbidden)" with the JS-off litmus test.domain.exploration_mapinDOMAIN_CAPSULES.md— image-fallback for visualization geometry is documented as a per-domain pattern.
What's missing:
- No machine-readable
fallbacksmanifest field. Producers handle fallbacks ad-hoc in HTML; consumers (validators, registry viewers, downstream tooling) can't programmatically discover "this capsule has a preview-audio fallback at index X." - Per-domain fallback guidance only formalized for
domain.exploration_map. Other domains (domain.midi_stem,domain.song,domain.photo) need explicit guidance about what their JS-off representation should be. - The three-mode taxonomy is implicit. §2.3 articulates the JS-off litmus but doesn't name the architectural framing the pasted discussion landed on: a capsule should degrade from runtime (full JS app) → document (readable artifact) → preview (consumable media or static representation).
- iOS QuickLook is mentioned but not centered as the canonical hostile environment to design against.
Architectural alternatives evaluated and rejected (so the rejection is on the record):
- Package format (
.capsule/.dawcapsule/.zipwithindex.html). Violates the load-bearing single-file promise. The whole point of the format is that the artifact passes through any document-passing surface (email, AirDrop, USB, Slack attachment, browser save) as one file. Splitting into a folder structure forfeits that. - Native iOS Capsule Viewer app. Out of scope. The project has stayed format-only by design; a canonical viewer app would compete with "open in any browser" and create platform lock-in.
- Hosted viewer as required runtime. Reasonable as a downstream tool but breaks the offline / one-file promise if the capsule requires the viewer to be useful. Fine as an "open in browser → richer interaction" escape hatch; not fine as a precondition.
The principle worth promoting. The pasted discussion's sharpest framing:
A capsule should never become useless when JavaScript is unavailable. It should degrade from app → document → preview.
The spec says this in two paragraphs of §2.3; this single sentence is the version worth elevating to a section tagline.
Implication for v0.3.6. Three concrete additions queued for the next spec release:
- Generalize §2.3 image-fallback into a domain-agnostic JS-off fallback pattern. Add the tagline above. Add iOS QuickLook as the named canonical environment.
- Add a recommended (not required)
fallbacksmanifest field. Shape:{ preview_audio, poster_image, static_summary_present, requires_js_for, preview_mode_description }. All optional. Lets producers declare what's there without forcing a structure on producers who don't have anything to fall back to. - Per-domain fallback guidance in
DOMAIN_CAPSULES.md. For each domain (existing + idea-queue): name the recommended JS-off representation. Examples:domain.midi_stem→ bundled rendered audio mix as<audio controls>;domain.song→ the embedded MP3 already IS the fallback (explicit note);domain.photo→ the image itself is the fallback;domain.exploration_map→ already documented (image-fallback for geometry).
Methodological observation. The pattern that produced this finding is now recurring: the capsule-midi producer-side adaptation preceded the spec change. The <noscript> block in templates/capsule.html.tpl was the producer's response to a real environment constraint; the spec catches up by formalizing the principle. This is the upstream feedback discipline named in F28 working in the opposite direction: not "spec change first, producer follows" but "producer adapts to environment first, spec generalizes the pattern." Both directions are healthy and worth tracking.
Related findings:
- F28 — the upstream feedback discipline named; this finding is its first deliberate application in the producer-adapts-first direction
- F20 — the image-fallback carve-out in
domain.exploration_mapwas the precursor pattern that this finding generalizes capsule-midi/templates/capsule.html.tpl— the producer template with the iOS-QuickLook<noscript>warning that surfaced the gapspec/CAPSULE_SPEC.md§2.3 — the existing rendering-model section that the v0.3.6 generalization will extend
Open questions
In rough priority:
Q1: Does the atomic-unit framing hold across genuinely different domains? (Substantially answered)
The format has working artifacts in at least five domains:
| Domain | Data shape | Production path | Status |
|---|---|---|---|
| Decision board | records[] | Compiler | working (reference template) |
| News annotation | records[] | Compiler | working (reference template) |
| Conversation synthesis | single-document | Pure LLM in chat | working (~30+ capsules across multiple batches) |
| Property-scale map | feature collection | Hybrid (build script) | working (illustrative + real-data instances) |
| Photograph + audio note | single-document with photo object | Hybrid (build script) | working |
| Implementation notes | single-document | LLM or hybrid | documented in DOMAIN_CAPSULES.md (Thariq-pattern) |
| Design system | single-document | LLM or hybrid | documented in DOMAIN_CAPSULES.md (Thariq-pattern) |
| Exploration map | feature collection w/ raster option | Compiler | documented in DOMAIN_CAPSULES.md (third-party producer) |
Eight documented domains, three production paths, three data shapes, all sharing the same five-block envelope. The framing holds. Remaining open question is whether more exotic domains strain the format (journal entries, recipes, scanned letters, voice-only notes, video clips, log files).
Q2: Can the author-side archive be light and still useful?
The previous "biggest gap" framing put the import-side build as a heavyweight registry + ingestion system. F7 dissolved most of that — the lightweight version (SQLite archive + pair viewer) handles the actual common case. Still unbuilt; still a candidate next concrete build.
Q3: How does the format behave under cross-browser file:// constraints?
All browser testing to date has been via local HTTP. Safari, Firefox, and Chrome have different file:// security policies. Specifically: clipboard API availability, localStorage/IndexedDB behavior, inline font and SVG handling under strict CSPs. The format should work identically on file:// and http:// per spec — empirically this is undertested.
Q4: Does the spec need a content-hash protocol that LLMs can actually compute?
The canonical-JSON content hash is unreproducible by LLMs (which don't reliably canonicalize JSON). LLM-produced capsules omit it. The spec correctly degrades to a warning, but this means LLM-produced capsules are fundamentally less verifiable than compiler-produced ones. Is there a hash protocol that an LLM could plausibly compute correctly? Open.
Q5: Will the fidelity gradient hold under adversarial use?
What if an LLM produces a capsule that claims generator.kind: "compiler" (i.e., lies about its production path)? The validator can't catch this — it's a self-declared field. A capsule that claims to be compiler-produced but has malformed integrity hash would fail integrity verification, but a capsule that just omits the integrity block and claims to be compiler-produced would pass with a warning. The trust model assumes good faith. Real-world deployment may not have it. The E.6 transparency-log direction would partly address this.
Q6: How big does the spec need to be?
The full CAPSULE_SPEC.md is ~1500 lines including v0.4 candidates (Appendix E). The Core is ~120 lines. The Core demonstrably works as an LLM prompt. Does the full spec earn its weight, or could it be trimmed without loss? Open question for a future audit.
Recurring LLM authoring failures
Across multiple personal-capsule batches (20+ capsules across four spec versions), several classes of bug have recurred.
Primary recurring failure: string-literal escape errors in markdown export functions
The pattern: LLMs reach for newline characters when generating string-building JavaScript and get the escape level wrong. Either over-escape ("\\n" becomes literal backslash-n in output) or under-escape (raw line terminator inside a non-template string literal, which is a SyntaxError that kills the entire runtime silently).
The validator originally couldn't catch this because the runtime is treated as opaque text by the manifest/data parser path. A capsule with a broken runtime could pass 18/21 + 3 warn + 0 fail while having zero working buttons.
Trajectory across spec versions:
| Batch | Spec version | Mitigation | Bug recurrence |
|---|---|---|---|
| 1–5 | v0.1.0 | none | 1/5 |
| 6–10 | v0.1.0 | none | 2/5 |
| 11–15 | v0.1.1 | prose tip in prompt fragment | 1/5 |
| 16–20 | v0.1.2 | promoted to numbered rule 11 + WRONG/RIGHT code example | 0/5 |
Finding: Promoting the rule from prose guidance to a numbered first-class rule with a concrete code example dropped recurrence from 1/5 to 0/5 in the next batch. All five v0.1.2 capsules used backtick template literals for the markdown export. One batch isn't proof, but the trajectory is monotone improvement and consistent with the hypothesis that LLMs follow mechanical syntax-level rules better than content-level "be careful" prose.
Belt-and-suspenders mitigation in v0.1.2: the validator also grew a regex check for the bug pattern (join("/join(' followed by a raw line terminator) inside the runtime block.
Secondary recurring failure: spec_version cargo-cult from example block
A separate, lower-stakes authoring slip appeared in some LLM batches. The LLM correctly recorded source.spec_received: "v0.1.2 · 2026-05-16" (the Core version line it actually received) but set manifest.spec_version: "0.1.0" — cargo-culted from the example manifest block in the Core, which still showed the old version.
Two mitigations landed together:
- Core's example manifest bumped to match the current spec_version so producers see the right value to copy.
- Rule 4's
spec_receivedexample reminds producers that the two fields should match. - Validator added a cross-check: when both
spec_versionandsource.spec_receivedare present, they must agree on the version.
Tertiary recurring failure: JS-render-everything pattern (the constrained-renderer problem)
The most architecturally significant failure. Discovered in the photo capsule when AirDropped to iPhone — see F14 for full writeup. Spec response: Core v0.1.3 rule 12 — render content in the HTML at build time, not at runtime. Same numbered-rule + WRONG/RIGHT-example pattern that dropped the rule 11 bug class to 0/5. Empirically validated on two consecutive batches under v0.1.3 (10/10 PASS).
Quaternary recurring failure (mild): over-broad CSP directives
Pattern across two v0.1.3 batches: ~30% of capsules add defensive CSP directives (media-src, font-src, blob:) that the capsule doesn't actually use.
Severity: mild. Over-broad CSPs don't break anything — they just permit more than the capsule actually exercises. From a security standpoint they're still very restrictive (everything is 'none' or data: only — no host allowed). From a self-documentation standpoint they over-promise.
Spec response (still deferred): the pattern is consistent but consistently low-severity. No Core/spec change motivated yet. If a capsule ever declared 'self' or a host (which would be a real loosening), that would warrant a rule. Pure-data: over-declaration doesn't.
Variance across runs (and what we can and can't control)
After producing 30+ LLM capsules across formal experiment rounds plus personal-use captures, the variance pattern is now clear:
Between producers (different models): Quality differs systematically. Thinking / extended-reasoning variants (Claude extended thinking, ChatGPT "Thinking" modes, Gemini deep-think) produce noticeably more careful capsules than standard variants — better personal-use defaults, light+dark themes, working markdown exports, CSP headers, richer data structures. This is repeatable and large enough to be worth noting prominently. The Core spec now includes a note encouraging thinking-mode use when available.
Within producer (same model, different runs): Real but smaller variance. Same model with same prompt produces different layouts, different CSS aesthetics, sometimes includes/omits the optional synthesis block. This is intrinsic LLM sampling variance (temperature), generally not user-controllable on web UIs. It is fine. The structural invariants (manifest, data, runtime, validation) hold across all the variance. Each capsule is still a valid capsule. We cannot expect bit-identical reproduction across runs and shouldn't aim for it — the variance is informative about how robust the format is to natural production noise.
Content-aware defaulting (correct behavior, not variance): Thinking variants correctly read social meaning of the conversation and set visibility accordingly. A conversation about sensitive content → visibility: "private", contains_private_data: true. A conversation about generic intellectual content → visibility: "shared". This isn't variance — it's the LLM doing context-aware honest defaulting on its own. Worth preserving as expected behavior.
Self-documenting capsules
Two optional manifest fields turn capsules into a self-documenting research record:
source.spec_received— the Core version string the producer was given (e.g.,"v0.3.0 · 2026-05-19")source.prompt_received— the verbatim prompt
For LLM-produced capsules, these are encouraged. They let future readers correlate output with the spec version and prompt that produced it, without external bookkeeping.
The Core itself is version-stamped (first line of CAPSULE_CORE.md). Material changes bump the version and date. Git tags (core-v0.1.0 through core-v0.3.0) preserve historical versions retrievable via git show core-vX.Y.Z:CAPSULE_CORE.md.
Notable methodology choices
These weren't obvious at the start but proved important:
- Reference implementation is Python stdlib only. No
pip installrequired. Accessibility for adopters matters more than performance.
- Validator is heuristic by design. Capability detection uses regex patterns. False negatives are possible. This was a deliberate choice once we recognized that the long-term real validator is going to be an LLM, not our Python script. The Python validator is a seed and a teaching artifact, not the endpoint.
- Spec evolution is empirical. Usage drives; thesis judges. This is the most load-bearing methodological choice in the project.
Usage drives: we don't design rules from a chair. Every spec move so far has been triggered by an empirical observation in the LLM corpus or the production pipeline — never by "this would be good design." The spec is the trailing indicator of what producers actually do, never the leading edge.
Thesis judges: when we observe something, the question is does this serve "memory object for work worth preserving" or undermine it? The answer determines the direction of the spec move:
| Observation type | Move | Examples |
|---|---|---|
| Honest deviation (LLM reaches for a more accurate value) | Loosen — the spec was too narrow | source.origin: "web_research", synthesis.kind: "llm", loosened enums |
| Recurring failure (mechanical bug, broken rendering, lost meaning) | Tighten — add a numbered rule that names the failure | rule 11 (JS newline), rule 12 (JS-render-everything) |
| Emergent convention (LLMs invent a useful pattern unprompted) | Document — recognize it as a recommended convention without making it required | embedded_media field, sources array (now in §4.1.2 of the full spec) |
| Underexplored option (a useful behavior LLMs aren't choosing on their own) | Add prompt-fragment guidance — no new rule, just explicit permission/encouragement | v0.1.4 thoroughness + sources guidance |
Loosening, tightening, documenting, and guiding aren't opposites. They're four flavors of the same reactive mechanism, applied to different kinds of observation. The thesis is the constant; the spec is always catching up.
Why this matters: most spec design is generative — decide what the right way is, force practice to conform. That model produces specs that ossify and lose contact with reality. The reactive model produces specs that stay current with how producers actually behave. Same model as Markdown/CommonMark, HTML/WHATWG, Python idiom-layer/PEPs.
Limits this principle has, that we should be honest about:
- Bootstrap problem. v0.1.0 had to be something before any usage existed. The initial draft was unavoidably generative. Every revision since has been reactive.
- Requires a clear thesis. Without "memory object for work worth preserving" as the arbiter, we couldn't tell honest deviation from broken artifact. The thesis is doing real work; the principle would collapse without it.
- Requires willingness to unwind. If a rule we added turns out to be wrong, we have to remove it. v0.3 demonstrated this —
capsule_id(slug) andrelated[]were deprecated when their consumer-side use case didn't materialize. - Slow under pressure. When you want to build a new path NOW, the reactive principle says "watch what you build, then formalize." That's slower than designing the framework up front. We have to be willing to accept the slower path.
This is the project's first-rank methodological commitment.
- Spec-evolution mechanism: "numbered rule + WRONG/RIGHT code example." When an LLM-authoring failure recurs and has a mechanical (syntax-level / architectural) fix, the working pattern for propagating the fix is:
- Document the failure with empirical evidence (multi-batch trajectory data).
- Promote the principle to a numbered Core rule (not a prose tip in the prompt fragment).
- Include a concrete code example showing WRONG vs RIGHT.
- Bump the Core version and re-test on the next batch.
This has now worked twice empirically:
| Rule | Failure class | Pre-numbered mitigation | Post-numbered result |
|---|---|---|---|
| 11 (v0.1.2) | JS string-literal newlines | prose tip in prompt fragment → 1/5 still failing | numbered rule + WRONG/RIGHT → 0/5 failing in next batch |
| 12 (v0.1.3) | JS-render-everything | no prior mitigation (pattern not recognized) | numbered rule + WRONG/RIGHT → 10/10 passing across two batches |
Two cases isn't a strong statistical sample, but the mechanism is consistent with the broader observation that LLMs reliably follow mechanical, syntactically-explicit rules better than they follow content-level advice. Worth treating as the default spec-evolution pattern going forward.
What this is NOT: a license to add more rules. Each numbered rule consumes prompt budget and cognitive load on the producer side. The bar for adding a rule remains "empirically recurring failure with no other available mitigation."
Project artifacts
| Artifact | Role |
|---|---|
CAPSULE_CORE.md | One-page short spec, designed for LLM prompts (currently v0.3.0) |
spec/CAPSULE_SPEC.md | Full normative spec (currently v0.3.2) |
spec/DOMAIN_CAPSULES.md | Per-domain schemas (implementation_notes, design_system, exploration_map) |
spec/SYSTEM_ARCHITECTURE.md | The four-layer architecture (private system / compiler / artifact / format profile) |
spec/manifest.schema.json | JSON Schema for the manifest block |
spec/response.schema.json | JSON Schema for response envelopes |
spec/examples/ | Canonical example capsules (briefing, implementation_notes) |
compiler/compile.py | Reference compiler, stdlib-only |
compiler/validate.py | Reference validator (26 checks at v0.3.2) |
templates/decision_board/ | First template: per-option decisions with verdict export |
templates/news_capsule/ | Second template: annotated article with claims/entities/sources |
examples/ | Sanitized JSON inputs for the compiler templates |
GLOSSARY.md | Vocabulary, four-layer table, phase status |
PRECEDENTS.md | Positioning against RO-Crate, TiddlyWiki, MPEG-21, C2PA, etc. |
index.html | Project landing page — itself a valid Capsule |
Git tags core-v0.1.0 … core-v0.3.0 | Historical Core versions retrievable via git show core-vX.Y.Z:CAPSULE_CORE.md |
Reproducibility
To rerun the LLM experiment yourself:
- Open a fresh chat with the LLM of your choice (Claude, Gemini, ChatGPT, or any model capable of reading attached files).
- Attach
CAPSULE_CORE.md. - Ask: "Using this spec, give me a summary of [topic] as a Capsule."
- Save the resulting HTML.
- Run
python3 compiler/validate.py <file>.htmlto check conformance.
Expected result: roughly 22/25 pass with 3 warns (missing integrity block, capability-marker false negative). Different pattern? That's a finding — either the spec drifted, the LLM behaviour changed, or you've found a new edge case.
To re-derive the integrity hash from spec prose alone (as one independent producer did):
- Read
spec/CAPSULE_SPEC.md§9.1.1 ("Content Hash Recipe — normative"). - Implement the canonical-JSON serialization + placeholder substitution rules in your language of choice.
- Compute the hash for the worked example given in the spec.
- Compare against the expected hash also given in the spec.
If your implementation produces the expected hash bit-identical, the spec is doing its job as a normative document. If it doesn't, the spec has a gap.
Status
As of v0.3.2 (2026-05-20):
- Core spec v0.3.0 — twelve rules. Five rounds of loosening / additions based on empirical findings:
- v0.1.1: rule 11 first draft (string-literal mitigation in prompt fragment)
- v0.1.2: rule 11 promoted to numbered rule with WRONG/RIGHT example; data shape clarifications; spec_version self-doc fields
- v0.1.3: rule 12 added (render content in HTML, not at runtime) — empirically validated on two consecutive batches
- v0.1.4: prompt-fragment additions (no new rules) — thoroughness guidance + structured
sourcesarray recommendation - v0.1.5–v0.1.8: minor patches (QR code convention, snapshot_id prefix callout)
- v0.2.0: schema rename —
capsule_id/capsule_versioncanonical;artifact_id/artifact_versiondeprecated but accepted - v0.3.0: added
parents[]for hard provenance; deprecatedcapsule_idslug andrelated[]field; spec-gravity discipline formalized
- Full spec v0.3.2 — doc-only patches on top of v0.3.0:
- v0.3.1: normative content-hash recipe with verifiable test vector (§9.1.1); "Inspecting a served capsule" preamble
- v0.3.2:
download_capsulestandard capability with implementation pattern (§5.1.1)
- Reference validator at 26 checks. New checks since v0.1.0: runtime JS string-literal regex, spec_version ↔ source.spec_received cross-agreement, progressive enhancement heuristic,
parents[]format checks, deprecation notes forcapsule_idandrelated[].
- Templates: 2 compiler templates (decision_board, news_capsule).
- Independent producers shipped: at least one third-party deterministic compiler producing
generator.kind: "compiler"capsules that validate clean at 26/26 against the reference validator. The producer re-derived the integrity-hash recipe from spec prose alone and produced bit-identical hashes on first attempt.
- Domains covered: decision boards, news annotations, conversation summaries, property-scale geospatial maps, photographs with audio attachments, image-grounded conversations, implementation notes, design systems, exploration maps. Multiple data shapes, multiple production paths.
- CSP: one feature-driven loosening landed (
media-src data:for embedded audio). All other CSP directives unchanged since the format's launch.
- Empirical size scaling tested through 13 MB (synthetic, F5) and 13.7 MB (real production Mintel capsule, F20). Hard cap raised from 15 MB to 20 MB in spec v0.3.3 with a 15 MB soft warning for email-attachment compatibility.
- Parked v0.4+ directions (Appendix E of full spec): remove deprecated fields, compiler-kind UUIDv5 carve-out, reconsider
ai_usage_guidancein domain capsules, hash-algorithm flexibility, author signing + transparency log, password-protected encrypted capsules, validator refinement for non-resource-loading<link>tags. None built; each waits for empirical pressure. E.5 (Rule 12 vs. legacy templates) was resolved in v0.3.3 via the image-fallback carve-out documented in §2.3 — see F20.
Biggest unbuilt piece: author-side import tooling (registry + import.py). The producer side has matured significantly and the consumer side hasn't moved. The lightweight version (SQLite archive + pair viewer per F7) remains a candidate next concrete build.
Biggest untested area: cross-browser file:// behavior across Safari, Firefox, and Chrome. The format should work identically on file:// and http:// per spec — empirically this remains undertested.
How to read this project
This is a research project that produces a working spec and reference implementation as primary artifacts. The spec is the hypothesis. The fixtures (compiled and LLM-produced capsules) are the evidence. The findings document (this file) is the running narrative of what we've learned. Every commit message is part of the research log — the "why" of each change is preserved in git history.
The project does not have a single "result" or a release date. It's a working investigation. The most likely failure mode is spec inflation (the long spec grows beyond what anyone reads) and the second most likely is import-side abandonment (we keep polishing the producer side while the consumer side stays unbuilt). Both are explicitly tracked as risks.
The project is not trying to invent something. It's trying to articulate the discipline that's missing from a practice already underway.