Capsule — Research Project

What this is

A research project investigating whether HTML can be disciplined into a portable knowledge-artifact format with a machine-readable contract, content provenance, and a structured feedback loop — without becoming a SaaS platform, a new file format, or a new browser standard.

The project produces a spec, a reference implementation, and empirical evidence about whether the spec works in practice. The hypothesis is that the substrate (HTML) has won and what's missing is discipline, not a new format.

Started: 2026-05-15 Current Core spec: v0.3.0 · Full spec: v0.3.2 Repo: bigfancygarden/htmlcapsule · Site: htmlcapsule.org

Project identity

A capsule is a sealed, self-contained HTML memory object for work worth preserving. The smallest portable structured unit any kind of knowledge work can resolve into — human-readable + machine-readable + provenance-bearing in one object. Not a working format — you still edit in your text editor, design in Figma, cook in your kitchen, think in your LLM chat. A publish / preserve / share format that any domain can emit.

Every domain today has good working tools and bad publish formats. PDFs lose interactivity. PNGs lose vector data. Recipe cards lose chef's notes. Exported chats lose structure. LLM conversations lose synthesis to the archive. Capsules are designed to be the universal publish format that preserves more than the alternatives, because:

The same outer contract serves recipes, research notes, decision briefs, journal entries, design specs, log entries, learning artifacts, project handoffs — and most importantly, the synthesis that comes out of LLM conversations that today disappears into chat archives.

Framing arc

The project's framing has sharpened through the research. Each version was less narrow than the last:

  1. "Compile from your private DB into shareable HTML" — too narrow; assumed a structured source.
  2. "Boundary object between private system and external recipient" — better; named the sharing pattern.
  3. "Save state for useful LLM conversations" — closer; named the most common production path.
  4. "Atomic unit of preserved work, across any domain" — broader, served the spec well during v0.1–v0.2.
  5. "Sealed, self-contained HTML memory object for work worth preserving" — current; emerged from peer review in v0.3 (see F18). Adds the human/machine/provenance trio as a differentiating wedge.

The format itself supports each framing without changes — technical work over previous iterations turns out to have been correct under broader interpretations than we started with.

What this is not

Capsules are not trying to replace working tools. Recipes will still be edited in cooking apps; designs in Figma; data analysis in Jupyter; thinking in LLM chats. The capsule is the export from these tools when the work is done, not the editing surface. This is exactly the role PDFs play today — they're just lifeless. Capsules give the same role to HTML, which is alive.

Capsules are also not trying to be a universal data interchange format like JSON-LD or RDF. The capsule's outer contract is universal; the inner content is domain-specific. This split is what gives the format both portability and expressivity.

Origin

Thariq Shihipar's public observation that LLMs and agents are already producing single self-contained HTML files as their default artifact format. The substrate is winning. The question this project asks: what does it take to make those files trustworthy — to give them a contract, provenance, versioning, and a structured way for recipients to respond?

Research questions

Primary: Can a one-page spec, given to an LLM as context, produce a conformant Capsule?

Secondary:

  1. What discipline makes HTML useful as a boundary object between private knowledge systems and external recipients?
  2. Where does the spec need to be strict vs. permissive?
  3. What's the gap between compiler-produced and LLM-produced capsules — and is that gap useful (fidelity gradient) or broken?
  4. Where does the format break down empirically — size, browser support, distribution friction?
  5. Can the recipient side respond in a structured way that the author can programmatically ingest?
  6. Will LLMs honestly declare themselves and their limitations when producing capsules?
  7. Can a deterministic compiler produced by a third party round-trip through the reference validator at full fidelity? (Substantially answered: yes — see F18's note on independent compiler-kind producers.)

Methodology

Iterative spec evolution against real artifacts:

Hypothesis → Draft spec → Build reference compiler → Compile real artifacts
   ↑                                                          ↓
   |                                                    External review
   |                                                          ↓
   |←─── Adjust spec ←─── What broke or what felt off ←──────┘

Three classes of "real artifact" are tested:

  1. Compiler-produced — deterministic output from our reference Python compiler. Establishes the strict end of conformance.
  2. LLM-produced — capsules generated by giving the Core spec to commercial LLMs (Claude, Gemini, ChatGPT) and asking them to produce a capsule on a real topic. Establishes the loose-but-honest end.
  3. Hand-written / hybrid — the spec itself was originally dogfooded as a capsule. Tests whether the format can document itself.

External review at each iteration: code review on the implementation side, plus design review from independent LLM agents and (in v0.3) from third-party producers building compiler-kind capsules against the spec.

The spec can only loosen (backward-compatible additions) unless a major breaking issue is found. Tightening would invalidate prior artifacts and would also discourage LLMs from producing capsules at all.

Findings

F1: The Core spec works as an LLM prompt

Experiment: Pasted CAPSULE_CORE.md (one-page short spec, ~120 lines) into fresh Claude, Gemini, and ChatGPT sessions with one prompt: "Using this spec, can you give me a summary of [public regulatory topic] as a Capsule?"

Round 1 result: Three structurally compatible capsules. All passed validation with 18/21 pass + 3 warn + 0 fail (identical pattern). Each opened in a browser, rendered correctly, declared itself honestly as generator.kind: "llm", included working exports, and presented a useful summary.

Round 2 result (same day, more specific prompt): Same pattern, plus prompt specificity successfully disambiguated topic interpretation.

Conclusion: Yes. The Core spec works as an LLM prompt. The format propagates through being readable and useful, not through enforcement.

F2: LLMs deviate from the spec toward honesty

The most significant finding. Across both experiment rounds, LLMs consistently disagreed with the spec in five specific places. In every case, the LLMs were objectively more honest than the spec required:

Spec fieldWhat spec saidWhat LLMs reached forWhy LLMs were right
source.originConstant "private_database""web_research", "public_documents", "official_public_sources"An LLM synthesizing from public content has no private database
source.snapshot_typeDatabase-flavored enum"synthesis", "research_summary", "bounded_public_legislative_summary"A summary isn't a "portable_excerpt"
synthesis.kindai_extraction/ai_summarization/etc."llm", "llm_summary", "web_summary"The natural words are clearer
typeStrict enum"summary", "briefing"None of the original types described what the capsule actually was
feedback_payloadRequired rating/comments/suggestions onlyStructured form with position/concern/notesReal feedback isn't always a 1-5 rating

In every case the spec was adjusted to accept the more honest values. The pattern: usage shapes spec, not the other way around. The spec is a description of what disciplined capsules look like, not a prescription that LLMs must obey.

F3: The fidelity gradient is real and useful

The validator distinguishes three result tiers (pass / warn / fail). Compiler-produced capsules pass strict. LLM-produced capsules pass degraded — typically missing the integrity block (no canonical-JSON content hash) and triggering a capability-marker heuristic false-negative.

This is a designed feature, not a workaround. Recipients of an LLM-produced capsule can see exactly what's verified and what isn't. They can calibrate trust appropriately. A compiler-produced capsule comes with cryptographic integrity; an LLM-produced one comes with structural conformance.

Conclusion: The format works for multiple production paths with different trust profiles. The validator's tier system is the load-bearing piece that makes this possible.

F4: Capability honesty is enforceable

The spec rule "a capsule must implement every capability it declares" was tested against LLM-produced capsules. In every case, declared capabilities matched implemented capabilities:

No LLM over-declared. This is meaningful because it shows the LLMs treated the capabilities list as a contract, not as aspirational marketing. Implementation honesty is a property the format can preserve even when LLMs are the producers.

F5: The format scales empirically through 13 MB

Experiment: Synthetic capsules at three sizes (1.35 MB / 6.6 MB / 13.15 MB) with embedded base64 blobs to simulate photo albums.

Result:

Conclusion (at time of finding): The 15 MB hard cap in the spec is correctly positioned. Browser performance isn't the bottleneck. Distribution is — Gmail's 25 MB attachment limit is the real ceiling, hit before browser strain.

Updated by F20 (2026-05-21): A real production Mintel capsule arrived at 13.7 MB and several real production channels (MinDev hosting, AirDrop, Slack, cloud-storage links) have no equivalent of the email-attachment constraint. Spec v0.3.3 raised the hard cap to 20 MB and added a 15 MB soft warning specifically for email-attachment compatibility — the 15 MB number was always proxying for email-friendliness, not browser strain. The conclusion above still holds; the cap moved up because the distribution-channel landscape has more than one shape.

F6: An LLM built half the feedback loop unprompted

The most surprising finding. In round 2, one LLM received only the Core spec and a one-line prompt. It produced a capsule with:

The recipient side of the feedback loop was implemented end-to-end by the LLM, without us telling it to. This was always part of the spec's design, but it wasn't part of the prompt. The LLM reached for the architecture.

Reinforcement: A later meta-capsule (produced under v0.1.2 with the standard one-line prompt) invented a spec_compliance_self_check field — an array grading the capsule against all eleven Core rules with pass/n/a and a per-rule note. The LLM cited rule 11 ("Runtime JS string-literal rule") by number. The numbered-rule format introduced in v0.1.2 is being consumed as machine-readable structure, not just human guidance.

F7: Structured response payloads are mostly tally bait; notes carry the meaning

Experiment: A recipient opened an LLM-produced capsule, filled out its built-in feedback form (position dropdown + concern dropdown + notes textarea), and exported response.json.

Result: The structured fields (position, most_important_issue) contained little information that wasn't already in the notes field. The notes carried the actual meaning — the reasoning, the nuance, the position. The structured fields were essentially redundant.

Generalization: Structured response fields are aggregation infrastructure. They earn their weight when you have many respondents — you can tally positions, group by issue, scan notes within each group. For a single respondent, structured fields are decoration; notes are the response.

Implication for the spec: The response_schema_version envelope is correct. The eight response types are probably more than needed; the real axes are (per-record vs. whole-capsule) and (structured-for-aggregation vs. prose-only). The feedback_payload schema was correctly loosened in this iteration to allow arbitrary fields — its rigidity was preventing the most common real use case.

Implication for the build: The "import side as registry + database ingestion" framing was overstated. What's actually useful is much lighter — an archive + a pair viewer (open response + originating capsule side-by-side). The author still does the qualitative reading; the system doesn't try to merge or auto-process.

F9: The single-document data shape is the natural LLM choice for conversation summaries

Observation across ~20 personal-use conversation-summary capsules: Almost every one used the single-document shape from §4.1 of the full spec — a top-level JSON object whose keys are themes (summary, key_takeaways, decision_matrix, quick_recommendations, etc.) — rather than the records[] shape.

The specific top-level keys vary per topic — that's expected and good. The shape definition isn't "must contain key X"; it's "top-level object with thematic named sections, each appropriate to the content." LLMs reach for this shape unprompted when summarizing a conversation; they reach for records[] when producing decision boards or list-shaped artifacts (the compiler templates).

Implication: Section 4.1's two shapes correctly carve the space. The example in the spec for the single-document shape is one possible filling; LLMs invent their own thematic keys per topic, which is the intended behavior.

F10: The format absorbs primary artifacts (not just syntheses)

Experiment: Build capsules that are the work product, not summaries of one. Specifically: print-targeted 8.5×11 property-scale claim maps (both an illustrative synthetic one and one built from a public claims GeoJSON snapshot).

Result: Both validate cleanly (same shape as chat-summary capsules). No new failure modes appeared in the domain switch. The format absorbed:

A third data shape emerged on its own: the map capsules' data block isn't records[] and isn't single-document. It's a feature collection: a property metadata header + bbox + per-feature-class arrays. This is the natural GIS / GeoJSON-ish shape.

Implication for the spec: Section 4.1's two-shape carve (records / single-document) may want a third bucket called "feature collection" for geospatial / typed-feature-set domains. Documented as the seed of the domain.exploration_map schema in DOMAIN_CAPSULES.md.

F11: The hybrid producer pattern is the most reliable production path for real-data capsules

Observation: Three production paths have produced capsules in this project:

PathWho writes HTMLBug surfacePattern
A. Pure LLM in chatLLM sessionHigh (rule 11 bug class, manifest drift)One-off content
B. Pure Python compiler + templatesReference compiler + per-type template dirZero (deterministic)Records-shaped artifacts
C. LLM-authored Python generatorOne Python script per artifact class, written by LLM, then frozenZero (deterministic shell + real data)Recurring real-data artifacts
D. Pure human handcodingn/a regularlyn/aRare

Path C is the new one. The LLM writes a Python generator once (with all the HTML, CSS, JS frozen as Python strings + a render_body() function), then the generator runs from real data on demand.

Why it works: the runtime JS is the same code every time, reviewed once, frozen. The manifest fields are computed by Python (validator-clean). The data block contains real data. Path A's recurring failures — JS string-literal bugs (the rule 11 bug class), manifest drift, capability marker mismatches — all disappear because the LLM never re-generates the shell.

Cost: Adding a new artifact class (e.g. a recipe capsule, a journal entry capsule) requires writing a new generator. Per-instance cost is near zero.

Implication: For recurring content (photos, claim maps, perhaps recipes/journals/decisions), Path C is the right default. Path A stays useful for one-off chat-summary capsules where the per-instance content is bespoke. Path B (the reference compiler) is the seed and the validator's intellectual reference, but produces fewer capsules in practice than C.

F12: Photo-shaped capsules — one artifact, one capsule (atomic-unit framing in its purest form)

Build: Example photograph capsules — one image, embedded as base64 in an <img src="data:image/jpeg;base64,..."> tag. Plus an associated voice memo (m4a/AAC) embedded similarly. Plus metadata: caption, people[], location (lat/lon + accuracy), date (value + precision + is_approximate), tags, alt_text.

Architectural pivot mid-build: the first attempt packed multiple photos as records[] inside a single album-capsule. That conflicted with the project's atomic-unit thesis — a photograph is itself an atomic unit of preserved work, not a row in a parent file. Rewrote to one-capsule-per-image; the album becomes the index listing them, not a container holding them.

Manifest signal: new type: "photograph", new collection field referencing the conceptual album by name (loose linkage, no parent file). The included_records is always 1.

Data shape: single-document with a top-level photo object containing the photograph's metadata + (originally) the data URIs. After F14's refactor, the data URIs live in the HTML <img> and <audio> tags directly, and the JSON data block is metadata-only.

F13: First real CSP loosening — media-src data: for embedded audio

Background: all prior CSPs across the corpus had been identical:

default-src 'none'; style-src 'unsafe-inline'; script-src 'unsafe-inline';
img-src data:; connect-src 'none'; base-uri 'none'; form-action 'none';

That permits inline base64 images via data: URIs but not audio (audio falls back to default-src 'none' and is blocked).

Change: added media-src data: to the photo capsule's CSP. One directive. It does not open the door to external audio — default-src 'none' and connect-src 'none' still block remote media. The capsule remains sealed; only inline base64 audio is permitted.

This was the first feature-driven CSP change in the format. Documented in the spec as the canonical pattern: if your capsule has embedded audio or video, add media-src data:. Don't broaden it further.

Format choice for audio: AAC in M4A container (.m4a). Universal browser support, best compression-to-quality ratio. Python's mimetypes.guess_type() claims .m4a is audio/mp4a-latm, which browsers reject (LATM is a different stream format). Required an explicit .m4a → audio/mp4 mapping in the build script.

F14: Capsules are archives, not apps — the JS-render-everything failure mode

The biggest learning of the early sessions. Discovered when a photo capsule was AirDropped to iPhone and "didn't load properly."

Root cause: iOS Files preview (the QuickLook HTML viewer) doesn't execute inline JavaScript, or restricts it severely. The chat-LLM capsules — and, by pattern-copying, the first version of the photo capsule — were 100% JS-rendered: the static HTML had empty containers (<h2 id="title"></h2>, <figure id="photo-frame"></figure>) and runtime JS filled them on load. With JS disabled or restricted, the capsule rendered as a near-blank page.

Honest acknowledgment: the pattern had been copied from the existing chat-LLM corpus without examining whether it fit. The thesis says "capsules are archives, portable across decades, self-contained." The implementation said "tiny single-page app that needs my runtime to be useful." Mismatch.

Architectural fix: progressive enhancement. Move all rendering to build time in Python. The static HTML, as written to disk, already contains the rendered artifact (image, audio, caption, metadata, description, tags, alt-text, manifest dump). JavaScript shrinks to ~3 KB of button click handlers (Print / Copy / Download). With JS fully disabled, the capsule still renders the full content; the three buttons just don't respond.

Spec response — Core v0.1.3, rule 12: promoted the principle to a numbered first-class rule, mirroring rule 11's structure (mechanical instruction + WRONG/RIGHT code example). Same hypothesis as rule 11 — LLMs follow syntax-level mechanical rules better than content-level prose guidance.

Validator response: check_progressive_enhancement heuristic — counts visible text inside <main id="capsule-root"> after stripping <script> and <style> blocks and HTML tags. Under 200 chars, the capsule is flagged. WARN, not FAIL — existing JS-rendered fixtures remain validatable; the warning signals they don't follow the v0.1.3 convention.

Implication for the project's identity: the failure was the most informative thing in the corpus that session. The atomic-unit framing isn't just a slogan — it has implementation consequences. Archives must be readable by any HTML renderer, not just one that runs the producer's specific JS.

F14 follow-up: Rule 12 propagation result — first batches under v0.1.3

Experiment: Produce fresh batches of conversation-summary capsules through the same LLM pipeline that produced the v0.1.0–v0.1.2 capsules, this time with the v0.1.3 Core attached. Two batches of five capsules each (10 total), spanning unrelated topical domains.

Result: 10/10 PASS rule 12. Every capsule pre-renders its full readable content (title, summary, takeaways, tables, glossary, source URLs, conversation transcripts, manifest dump in <details>) directly in <main id="capsule-root">. JS shrunk to button handlers in every one.

Visible-text counts inside capsule-root (validator threshold: 200 chars) ranged from ~6,000 to ~13,400 — every capsule cleared the threshold by 30× to 67×.

Rule-12 trajectory (mirrors rule 11's trajectory table):

BatchSpec versionMitigationRule 12 PASS rate
1–20 + earlyv0.1.0 – v0.1.2none (pattern not yet recognized)0/23
Batch A (5)v0.1.3promoted to numbered rule 12 + WRONG/RIGHT code example5/5
Batch B (5)v0.1.3(same)5/5

Epistemic update after second batch: the result replicates. Two consecutive batches, 10/10 PASS, same producer, spanning 10 unrelated topical domains. Within-producer replication is solid; cross-producer confirmation is still the remaining open evidence gap before broad generalization.

Hypothesis confirmed: the "deeper instinct to build a tiny app" did not persist when rule 12 was promoted to a numbered rule with a code example. The same model that produced the JS-render-everything capsules in the earlier batches immediately switched to progressive enhancement when given the v0.1.3 Core.

F15: Mobile responsiveness is a CSS-layer concern, not a format-layer one

Trigger: After F14's fix, an AirDropped photo capsule rendered on iPhone but looked like a thumbnail of an 8.5in letter page in a 375px viewport. Tiny. Required pinch-zoom to read.

Fix: mobile-first responsive CSS — same HTML body, three CSS modes:

  1. Default (mobile / narrow): fluid layout, touch-friendly buttons, readable typography (no sub-12px sizes), stacked title block.
  2. @media (min-width: 900px): switches to the 8.5×11 letterhead view — fixed page dimensions in inches, two-column grids, desktop typography scale.
  3. @media print: locks to letter portrait independent of viewport.

Key insight: the 8.5×11 page is a print target, not a screen requirement. The screen view can be fluid. Conflating the two was the design mistake.

Implication for the spec: This is implementation detail, not a Core rule. No spec change needed. Worth a note in the full spec's UI section that capsules should be screen-readable on any viewport size, with the 8.5×11 form factor reserved for print output.

F16: Chat-LLM capsules embed source-conversation images when the conversation is image-grounded

Pattern across two batches under v0.1.3: When the source conversation included an image (a chart, screenshot, diagram, photo), the LLM embedded that image inline in the resulting capsule as a data:image/...;base64,... URI. Each used the same spontaneously-invented embedded_media data-block field structure (kind / description / filename / mime_type / embedded_as).

BatchSource image typeCapsule file sizeCSP change required
Batch AScreenshot of a public chart~254 KBNo (img-src data: already in baseline)
Batch BChart/document from a public source~2.2 MBNo

Epistemic state: n=2 from same producer. Cross-producer confirmation still pending. But the within-producer pattern is consistent enough to treat as expected behavior, not anomaly.

Implications for the spec: no new rule warranted. The format already absorbs this:

The spec's documentation now notes that when conversations include images, embedding the source image as a data: URI is an established pattern. The embedded_media data-block field (or a similar shape) is recognized as a recommended convention.

F17: Prompt-fragment-only Core revisions are a valid spec-evolution mode

Background: all Core revisions to date (v0.1.1, v0.1.2, v0.1.3) introduced or promoted at least one numbered rule. v0.1.4 was the first that didn't. It added only prompt-fragment guidance:

  1. "Be thorough about real content" — a paragraph pushing back against LLM brevity-truncation, with explicit permission to include all takeaways / sources / caveats / open questions, and an explicit floor on inventing content the conversation didn't produce.
  2. "Capture sources and links" — a paragraph recommending a structured sources array in the data block, with a shape example.

Neither is a rule. Neither has validator enforcement. Both are producer-behavior hints in the prompt fragment that producers actually see.

Why these are worth a Core version bump: the prompt fragment IS the Core to producers. If we silently amend it, the version line lies — producers under "v0.1.3" would actually see different content than the v0.1.3 fragment captured by git tag. The two self-documenting fields (source.spec_received, source.prompt_received) would lose meaning if the content of a given version drifted.

So: every change that producers will see gets a version bump. Rule changes get major attention. Guidance changes get minor attention.

Hypothesis: prompt-fragment guidance will work similarly to rule promotions — explicit, mechanical, included alongside the numbered rules in the producer's context, with examples. The "numbered rule + WRONG/RIGHT code example" pattern addresses mechanical failures. Prompt-fragment guidance addresses underexplored options — behaviors that aren't broken but aren't being chosen. Different mechanism, different bar. Worth tracking both separately.

F8: The atomic-unit framing explains everything we've built

Reflection rather than experiment. Across multiple framings the project has tried — "compile from private DB", "boundary object", "save state for LLM chats" — the format itself didn't need to change. Each framing was the same format viewed through a narrower lens. The framing that explains all the previous ones is: a capsule is the atomic unit of preserved work.

Evidence supporting the broader framing:

DomainWorking toolExisting publish formatWhat capsule preserves
Decision-makingSpreadsheets, meetingsPDF / email threadPer-option records, evidence, decisions
News annotationBrowser + memoryForwarded linkArticle + extracted claims + verdicts
Research synthesisLLM chatCopy-paste into docSynthesis + sources + provenance
RecipesCooking apps / notebookRecipe cardIngredients + steps + scaling + notes
JournalNotion / paper journalLocked in appEntry + mood + context
Map / geospatialQGIS / GIS toolsPNG / map serviceFeatures + layers + popups
LogsSystem logsText dumpEvents + context + severity

In every row, the existing publish format loses something the working format had. PDFs lose interactivity. PNGs lose vector data. Recipe cards lose the chef's notes. Capsules preserve more because they're alive (HTML + structured data + provenance + UI).

The atomic property matters because:

Consequence for the project's identity: capsules are to preserved work output what JSON is to data interchange. A universal portable envelope that any domain can fill with appropriate content. F18 sharpens the framing further into "memory object" but the atomic-unit point remains the structural argument.

F18: Peer review (2026-05-19) — sharpest framing, landscape position, and trust-model gaps

A peer-review pass on the v0.3.2 state of the project produced three things worth recording in the research log: a sharper one-sentence thesis, a 2026 landscape position, and an explicit naming of the format's open trust-model questions.

Sharpest framing. The strongest one-sentence definition that emerged from review:

"A capsule is a sealed, self-contained HTML memory object for work worth preserving."

"Memory object" is doing real work in this sentence. It captures the human-readable + machine-readable + provenance-bearing trio in one noun phrase — the property no neighboring format provides simultaneously. PDF is human-only, JSON export is machine-only, MHTML lacks a manifest, ZIP lacks rendering, .docx lacks a programmatic data block, Notion exports are platform-dependent. The previous framing ("atomic unit of preserved work") remains internally accurate but lacks a differentiating wedge. The new framing has been adopted in README, CAPSULE_CORE.md, and index.html.

The second insight: multi-producer interop is the strongest empirical claim the format makes. LLMs (Claude, ChatGPT, Gemini), deterministic compilers (third-party build scripts), and human authors all produce the same envelope shape. That's what makes capsules different from yet another save format. Personal/team memory is the most accessible adoption vector; multi-producer interop is the differentiator. Don't narrow positioning to wave-one adoption.

The first independent compiler-kind producer (a third-party Python build script) shipped capsules that round-trip through the reference validator at 26/26 in v0.3. Crucially, the producer re-derived the integrity-hash recipe from spec prose alone (§9.1.1) without reading the validator source, and produced bit-identical hashes on first attempt. This is the spec earning its keep as a normative document.

2026 landscape position. Neighbors mapped:

NeighborLayerRelationship
HTML artifacts (Thariq / Blake Crosley)Live agent output / control surfaceAligned but upstream — capsules are the seal step downstream
Durable interactive artifacts (AgentPatterns)Workspace objectsAligned but platform-bound; capsules are portable across tools
Intermediate artifacts in agentic systems (arXiv)Multi-agent internal stateSame instinct, systems-internal scope
ARA agent-native research artifactsResearch deliverablesHeavier research-world cousin
RO-CrateSealed research packagesDirect competing format — capsules differ in single-file constraint
WACZ/WARCWeb archivesDifferent layer (archived web, not authored work)
C2PA / Content CredentialsSigned media provenanceComplementary trust layer, not format competition
Agent manifests (agent.json, JSON Agents)Agents themselvesAdjacent 2026 instinct (machine-readable manifests around AI)

Strategic conclusion: HTML is unlikely to be usurped soon as the rendering substrate. The likely future is HTML remaining the human-inspectable surface while JSON / RO-Crate / C2PA-style metadata wrap around or live inside it. Web Bundles were the only direct technical challenger; their IETF draft is stale and Chrome removed the navigation experiment in 2023. Capsules are betting on the stable layer.

Open trust-model gap. The current spec answers "what is this? where does it claim to come from?". It does not answer "did the claimed author actually publish these exact bytes?". The UUID asserts identity but doesn't enforce it — anyone can ship a modified capsule with the same UUID.

A full trust story would require four pieces:

  1. Two-hash split. content_hash (canonical manifest+data, survives DOM round-trip) + file_hash (raw bytes, doesn't). Lets a recipient verify two different questions independently.
  2. Author signing, identity-anchored via a Sigstore/Fulcio-style OIDC issuance. Without identity infrastructure, "signed by author" is just another lie waiting to happen.
  3. Transparency log (Sigstore/Rekor-shaped). Append-only public record of signed releases, detecting same-UUID-different-content games and backdating.
  4. Out-of-band verification. Capsule never calls home (Rule 2 preserved). The QR code already embedded in the capsule (Core convention) resolves on the recipient's phone/reader to a verification URL that queries the transparency log. Friction lives on the verifier's side; the capsule stays mute.

Three trust tiers would emerge: Self-describing (current baseline), Signed, Logged.

Decision: parked, not built. No reported real-world tampering incident exists in the corpus or among independent producers. Building infrastructure ahead of empirical pressure would be exactly the "spec gravity before daily-use pressure" failure mode the peer review explicitly warned against. Captured in spec/CAPSULE_SPEC.md Appendix E.6 as a v0.5+ candidate.

Two strategic risks named in the review, now internalized as ongoing discipline:

Both risks are now ongoing discipline rather than one-time fixes.

F19: Design-tool integration experiment — Claude Design with CAPSULE_CORE.md attached

Experiment. Asked a design tool (Claude Design, claude.ai/design) to produce a landing-page design and export the result as a Capsule per Core v0.3.0, with CAPSULE_CORE.md attached as conversation context. The session produced three relevant artifacts, with quite different structural shapes — all worth recording. Two were valid; one was not.

Note on finding evolution. This finding was substantially revised after the model self-corrected its export choice. An initial draft treated the bundler-wrapped 52 KB file as "the model's capsule output" and framed the discrepancy as "two verifiers checking different criteria." The model's reply clarified — and our reference validator confirmed — that the model first wrote a 40 KB file that does validate cleanly (24/25 pass, 0 fail), then a separate "Save as standalone HTML" step ran a general-purpose bundler over that valid file and produced the 52 KB shell. So the finding is not "spec-aware intent, non-conforming output" — it's "spec-aware conforming output, destroyed by a downstream pipeline step that should not have run." The revised version below is the accurate record.

Output A — three design-variation files from the canvas (dc-card wrapped).

These were the JSX/HTML mockups exported earlier in the same session, before the user asked for a capsule. Each file:

A structural transformation script (strip resets, strip data-om-*, strip the redundant @import, unwrap dc-card, merge <style> blocks into <style id="capsule-style">, wrap content in <main id="capsule-root">, inject <script id="capsule-manifest"> + <script id="capsule-data"> + <script id="capsule-runtime">, add CSP meta) converted each to a valid capsule at 25/25, ~1.5 MB final (essentially the embedded woff2 fonts plus thin content). The visual design was preserved bit-for-bit; the bloat removed was Claude Design's canvas safety net, not the design itself.

Output B1 — the model's actual capsule output (40 KB, pre-bundler).

After the user asked explicitly for "the landing page as a Capsule per the attached Core spec," the model wrote a 40 KB single HTML file. Validated against the reference validator:

24/25 pass, 1 warn, 0 fail.

Validator checkResultNotes
HTML5 doctypePASS
<html> + <body> tagsPASS
All required sections presentPASSAll five id="capsule-*" blocks at the byte level
No external resource referencesPASSZero network fetches
Manifest section parseablePASSHonest provenance (see below)
All required manifest fieldsPASS
Manifest field typesPASS
Capsule version presentPASScapsule_version: "8.0.0"
Recommended manifest fieldsPASS
generator.kind recognizedPASS"llm"
source blockPASSorigin: "authored", snapshot fields populated
privacy blockPASSexternal_dependencies: false
spec_version recognizedPASS0.3.0
spec_versionsource.spec_received agreePASS
capabilities include about + one exportPASS
Data section parseablePASS
Content hash verifiesPASSNo integrity block (optional for LLM-kind) — passes by absence
Field format patternsPASS
All capabilities have impl markers (heuristic)WARNcopy_as_prompt — implementation exists but the function name doesn't match the validator's marker regex (false negative on a soft check).
Runtime JS strings well-formedPASS
Content pre-rendered in HTMLPASS5388 chars of visible text in <main id="capsule-root">
File size under 15 MBPASS41,724 bytes

This is a third independent producer kind reaching conformance — joining the reference compiler (generator.kind: "compiler") and the hand-authored landing (generator.kind: "human" / "hybrid"). All three producer kinds in the spec's interop claim are now empirically demonstrated.

What the model got right structurally:

The single validator warn was a heuristic false-negative on copy_as_prompt — the function existed but didn't match the marker regex pattern. Zero hard failures. The pre-bundler file is a deployable, conforming capsule.

Output B2 — the same file after "Save as standalone HTML" ran (52 KB, post-bundler).

The user then clicked "Save as standalone HTML." Claude Design's bundler — a general-purpose pipeline built to inline external assets for designs that aren't self-contained — ran over the already-self-contained B1 and wrapped it in a single-page-app hydration shell:

<head>
  <style>… thumbnail + loading styles only …</style>
  <noscript>This page requires JavaScript to display.</noscript>
</head>
<body>
  <div id="__bundler_thumbnail">… social-preview SVG …</div>
  <div id="__bundler_loading">Unpacking…</div>
  <script>
    // 6 KB bundler that:
    //   reads script[type="__bundler/manifest"]
    //   reads script[type="__bundler/template"]
    //   base64-decodes + gzip-decompresses assets
    //   fetch()-rewrites blob URLs
    //   replaces the thumbnail with the actual content via DOM injection
  </script>
  <script type="__bundler/manifest">… base64-encoded assets …</script>
  <script type="__bundler/template">… HTML template as JSON …</script>
</body>

Validator score: 4/10 pass, 1 warn, 5 fail — required sections missing (the bundler uses script[type="__bundler/"] instead of id="capsule-"), fetch(s.src) in the asset-assembly step violates Rule 2, manifest unfindable, content not pre-rendered (zero visible text in <main id="capsule-root"> because no such element exists at parse time).

Architecturally, this is exactly the failure mode Rule 12 was written to catch — content packed into JS, rehydrated on load, body empty at parse time. Open the file with JavaScript disabled (iOS Files preview, email previewer, archive viewer, old browser) and you see the loading spinner forever, then a <noscript> fallback. Same shape as F14's JS-render-everything failure pattern, but inverted into a deliberate hydration architecture.

The mechanism is innocent — Claude Design's bundler exists for a legitimate purpose (inlining externally-referenced assets into a transportable single file). The bug is that it should be skipped, not run, when the input is already capsule-shaped. Running a "make this self-contained" pipeline over a file that is already self-contained is destructive, not idempotent.

The actual integration boundary: a process-ordering issue, not a verifier-criteria mismatch.

An earlier draft framed B1 ↔ B2 as a discrepancy between two verifiers checking different things. That was wrong. The clarified mechanism (confirmed by the model and re-checked against our validator):

So the lesson generalizes as: "verify-before-mutate, but always re-verify after any pipeline step that touches the artifact." A multi-step export pipeline that mutates the artifact between verifications can ship a file that the verifier never actually checked. In the Claude Design case, the verifier ran at the right point in the conversation flow but not at the right point in the file-mutation flow.

(This is closer in spirit to the build-pipeline / artifact-signing problem in software supply chains than to a spec-interpretation disagreement. The signature/verification gate has to be the last thing that touches the artifact before it leaves the producer, or there is a window where the artifact and the gate disagree.)

What the model self-corrected.

After being shown the validator score on B2 alongside the diagnosis, the model's reply was sharp: "You're right, and the diagnosis is accurate… The file is already standalone. Running the bundler will wrap it in an SPA hydration shell that violates Rules 2, 3, and 12. Skipping the bundler — the file you have is the deliverable." It correctly identified the actual deployable as B1, named the bundler as the source of the destruction, and rephrased the original "two verifiers" framing as a process bug ("I shipped two files in sequence and only validated the first").

This is itself a relevant data point for multi-producer interop: a spec-aware model, when given the empirical evidence, can self-diagnose the integration boundary and correctly route around it.

Implications for multi-producer interop:

Implications for the spec:

What we did with the outputs:

The clean record: the LLM-producer path works. A spec-aware model with the Core spec attached can produce a conforming capsule on the first try. The integration risk is downstream pipeline steps that mutate the artifact after verification has cleared — the mitigation is process, not a rule change.

Postscript — validator improvement motivated by this finding.

The original 1/25 warn was a heuristic false-negative on copy_as_prompt. The model used a cleaner Rule 7 verification pattern than our reference examples:

Same literal string in three places — the most direct manifest-to-implementation link possible, auditable by eyeball without needing a regex translation table from copy_as_promptcopyPromptbtn-copy-prompt (our reference examples' three-name convention). The validator's marker regex (copy[-_]?prompt) didn't anticipate this convention and false-negatived the cleaner one.

When asked, the model offered to rename the handler to match our regex. We declined: the right fix was the validator, not the file. We added two uniform patterns to the marker-check that apply to every known capability automatically:

escaped_cap = re.escape(cap)
clean_convention_patterns = [
    rf'data-capsule-action\s*=\s*["\']{escaped_cap}["\']',
    rf'\b{escaped_cap}\s*:\s*function\b',
]

Both patterns are specific to implementation context — the data-attribute only appears in HTML markup, and the : function form requires the function keyword, which cannot appear in JSON. No false-positive risk on declared-but-unimplemented capabilities (Rule 7's actual guarantee is preserved bit-for-bit).

Result: the Claude Design file now validates 25/25 clean. All existing examples (the canonical landing, briefing_example.html, implementation_notes_example.html, the three converted theme files) still validate at their previous scores. The patch is strictly additive.

The lesson generalizes: when an independent producer finds a cleaner convention than the reference examples, the right response is to recognize the cleaner convention in the tooling, not to demand the producer rename. This is the difference between a spec that ossifies around its reference implementation and one that improves through external pressure. The patch isn't adding spec surface area — it's improving the validator's ability to recognize compliance. The spec discipline principle ("the corpus drives the spec; spec inflation runs the other direction") cuts toward the patch, not against it.

F20: First publicly-fetchable Mintel production capsule validates spec at scale

Date: 2026-05-21

Mintel now publicly serves a real production exploration_map capsule via MinDev. First time the project has end-to-end validated a production third-party capsule (not LLM-corpus, not sanitized example) against the reference validator.

The capsule:

Five empirical findings:

1. The 15 MB cap was always a proxy for email-friendliness, not browser strain. F5 set the cap at 15 MB from Gmail's 25 MB attachment limit. Mintel's distribution channel is MinDev hosting (no equivalent of the email cap), and the empirical desktop parse ceiling is well above 15 MB. v0.3.3 splits the constraints: hard cap raised to 20 MB, with a 15 MB soft warning explicitly for email-attachment compatibility. The number that was always proxying for one thing now names two things.

2. Rule 12 vs. visualization geometry — image-fallback resolves E.5. The Copper Dome capsule pre-renders chrome (title, legend, north arrow, info panel, attribution, QR code; 1,373 chars visible) but draws polygons into an empty <svg id="map-svg"> container at runtime. The validator's surrounding-text heuristic passes; strictly the data-bearing content depends on JS — iOS Files preview would show an empty white box where the map should be.

The principled resolution (now documented in spec §2.3 "Carve-out for visualization geometry"): visualization geometry rendered into a pre-declared named container is allowed IF a static image rendering is embedded as the JS-disabled fallback in the same container. Preserves Rule 12's intent (content IS in the HTML — as a raster) while accommodating geometry that can't reasonably be pre-rendered as static markup. The image rendering is typically free — it's the same raster the pipeline already produces for non-capsule deliverables (PDF/JPEG exports). One extra <img> element and a one-line visibility toggle in runtime.

E.5 was parked specifically waiting for this case. v0.3.3 ships the resolution.

3. MinDev's hosting model is now empirically demonstrated. The MinDev response includes:

x-capsule-content-hash: sha256:60282cbdad54708f...
x-capsule-uuid: 9357a933-7ce1-4061-9488-2ca61d81bded

The host attests independently via response headers without modifying the file body — "wrap, don't modify" per Appendix B distribution guidance and E.7 (MinDev pattern). First publicly-fetchable example of this hosting model. Caveat: header attestation is honest about what the host computed; it's not a signature from the original author (that's still E.6 signing territory).

4. The compiler-kind integrity path works end-to-end on real production data. Full integrity block present (content_hash + hash_scope: "data+manifest"), generator.kind: "compiler", and the validator confirms the hash verifies on a 13.7 MB file. Mintel re-derived the integrity-hash recipe from spec prose alone — bit-identical hashes (noted earlier in F18; this finding adds concrete production-capsule evidence at scale).

5. Custom namespace use is exemplary. The x-mintel block (project_id, project_version_id, project_version_number) uses the x- extension prefix correctly per E.3's recommendation. Consumers that don't know about Mintel ignore the block; domain-specific consumers can dereference back to the source.

Spec moves landed in v0.3.3:

Open questions remaining:

F21: Independent convergence on the host-contract pattern (MinDev + htmlbin)

Date: 2026-05-21

Two independent hosting layers have converged on the same shape for serving Capsule-style HTML artifacts, without coordination between them:

  1. MinDev (private, Mintel-tied; serves the F20 Copper Dome capsule and other Mintel-produced exploration_map capsules).
  2. htmlbin.dev (public, agent-first; launched ~May 17-18, 2026 by Utkarsh Sengar, Cloudflare D1 + KV stack). Independent project; not aware of htmlcapsule at launch.

Shared shape observed:

AspectMinDevhtmlbin.dev
Short URL identityUUID — mindev.ca/api/c/{uuid}Slug — htmlbin.dev/p/{slug}
/raw byte-identical endpoint/api/c/{uuid}/raw/p/{slug}/raw
Host chromeRecedes to a left railSmall header + footer attribution
Authorship attributionResponse headers (x-capsule-content-hash, x-capsule-uuid)Footer text ("content authored by the agent that uploaded it")
Content mutationNone — serves uploaded bytes byte-identicallyNone — serves uploaded bytes byte-identically
Validates on uploadYes (against Capsule spec)No (accepts any self-contained HTML)
VisibilityPrivatePublic, OAuth-gated first publish

Why this matters:

The convergence is empirical evidence that the host-contract pattern referenced in Appendix E.7 ("hosting-platform auth gates per the MinDev pattern... the platform controls delivery; the capsule itself doesn't gate its internal contents") is a real shape that independent producers reach on their own. The format/host split — the format defines the artifact, the host serves it — appears stable across implementations.

This is the empirical pressure the project was waiting for to formalize a host contract beyond the single-implementor MinDev reference. The "what a host should do (and not do)" doc that was previously parked can now be drafted as a description of an observed convention across two independent implementations, not a proposal made in a vacuum.

Practical implication — the format is hosting-agnostic, demonstrably:

A valid Capsule can be hosted on MinDev, on htmlbin, or self-hosted, with no format change. Hosts adopt the format optionally; the format imposes nothing on the host beyond "serve the bytes you received." The "format-not-platform" stance is now concrete and verifiable, not aspirational.

Spec moves to consider:

Open questions:

Cross-references:

F22: Independent convergence on the live-editing layer pattern (html-docs + workplane)

Date: 2026-05-21

Two independent live-editing tools shipped in approximately the same window (mid-May 2026) with substantially the same workflow shape, without coordinating. This is the parallel finding to F21 (host-pattern convergence): two layer-level patterns have now been observed converging independently within this project's first two weeks of running. The pattern of convergence is itself becoming a recurring methodological observation.

The two tools:

html-docs.comworkplane.co
CreatorRaunaq Bhutoria (Meta engineer; @raunaqbn)Matan (GitHub: matanrak; based in Israel)
RepoNot publicwork-plane/workplane-skills (MIT); /workplane repo linked but 404s
Org created(n/a — closed SaaS)work-plane GitHub org created 2026-03-29
Most recent push(closed)2026-05-20 (workplane-skills)
Tagline"Create beautiful docs and webpages with your Agents.""Turn AI outputs into live pages." / "The working plane between AI and humans." (README)
Open sourceNo (SaaS, closed)Partial — agent skill is open MIT; main service may be closed
Agent integrationClaude Code skill + MCP server + HTTP API; 6 named tools (publish, publish_file, update, read, comment, list_comments)MCP-first; works with Claude Code, Codex, Cursor, Devin, Claude Desktop
Account gatesRequired for some importsFree for individuals; no account required for commenters
Endorsements on homepageKarpathy, Thariq, Ryan CarsonNone visible

Shared workflow shape (the actual convergence):

  1. Agent generates HTML/markdown
  2. Publish to the tool → live URL with stable identity
  3. Humans review with inline comments
  4. Agent reads the comments and revises
  5. Iteration loop continues until "if good, go build" (Raunaq's framing)

Both tools implement steps 1-5 with MCP as a primary integration path and inline comments as the review surface.

Differences (mostly orthogonal to the workflow):

Why this matters:

The pattern (agent ↔ human review loop with publish-and-comment as the primitive) is now empirically observable as something multiple independent producers reach for. Just like F21 named the hosting-layer convergence (short URL + /raw + minimal chrome + honest attribution), F22 names the live-editing-layer convergence:

The MCP common denominator is itself notable — both tools lead with MCP integration, which suggests MCP adoption is the enabling substrate for this layer's convergence. Without a standard agent-to-service protocol, each tool would have to ship bespoke integrations; with MCP, the same skill works against any host.

This is the "canvas step" Capsule explicitly doesn't compete with. F22 names that the canvas step is a real, reproducible layer in the lifecycle — not a one-off product idea.

The composition story is now empirically backed at every layer:

Four lifecycle layers; convergence-pattern findings at three of them. The format-and-host split + the editing-and-format split + the host-and-discovery split are all real.

Spec implications: None directly. Capsule occupies the seal step downstream of the live-editing layer; the live-editing layer doesn't need Capsule's discipline because the artifact is still mutating. The composition is what matters, not Capsule mandating anything in the upstream layer.

Open questions:

Cross-references:

F23: URN-not-URL QR encoding — empirical validation of a deliberate spec choice

Date: 2026-05-21

The Core spec (CAPSULE_CORE.md Rule 4 supplementary QR-code guidance) recommends embedding a QR code that encodes urn:uuid:<uuid> — the URN form, not a live URL. The reasoning at the time was that URNs are non-resolvable but honest about being non-resolvable, while URLs encode a host's distribution policy and that policy can change without the format changing. A real-world incident on 2026-05-21 validated this reasoning concretely.

What happened:

Mintel's build_exploration_map_capsule.py had been encoding https://mindev.ca/c/<uuid> in the on-map QR (the rationale: a phone scanning a printed map could land directly on the live capsule). This was a deviation from the spec — fine in isolation because MinDev was a known host and the URLs worked at the time.

On 2026-05-21 MinDev shipped a security-driven schema change that removed the public visibility tier entirely. Existing public rows migrated to org; the mindev.ca/c/<uuid> URL pattern now returns 403 {"error":"forbidden"} to anonymous callers. Org members keep access via Firebase auth; external recipients need a share-token URL (mindev.ca/api/c/share/<token>) instead.

Immediate consequence: every previously-printed Mintel map carrying a QR pointing at the live URL now resolves to a 403 for any anonymous scanner. The QR didn't break structurally — it still scans, still produces a URL — but the URL has changed semantic meaning. Was: "fetch this capsule". Now: "fetch this capsule if you happen to be authenticated to the right org on the device scanning the code". The producer (Mintel) had no way to know in advance that this change would happen on the host side; the printed maps in the wild can't be recalled.

What this validates:

The fallback pattern that does work:

If a producer wants the QR to resolve to a live capsule via the URL form, the right path is:

  1. Producer asks host to mint a share token at upload time (the Mintel-side ask currently flagged in MINTEL_TODOS.md: ?mint_share_token=true on the upload endpoint, returning a share_url in the response)
  2. Producer encodes https://<host>/api/c/share/<token> in the QR
  3. This URL is anonymous-resolvable by design, has revocation, has audit, has expiry, has view-cap — and survives host policy changes because the share-token endpoint exists specifically for anonymous resolution

The URN form remains the right default; the share-URL form is opt-in for cases where the producer explicitly wants anonymous resolution AND has minted a token at build time AND has accepted the share-token's audit/revocation tradeoffs.

Spec implications:

None directly — the spec already says URN. This is post-hoc empirical validation, not a spec change. A one-paragraph addition to spec/HOSTING.md (landed alongside this finding) names visibility tiers as host-side policy and cites this case as the canonical example of why format artifacts shouldn't bake in resolution-semantics assumptions.

Open question:

What should a producer's build script do for capsules whose host visibility is org (where anonymous scan won't resolve)? Three reasonable patterns, currently unsettled across implementations:

  1. Always encode the URN (safe default; recipient has to type or paste UUID into a host UI to view).
  2. Encode the URL but add an alt-text/caption like "Sign in to to view" so a scanner knows what to expect.
  3. Encode the share-URL when a token has been minted (opt-in to anonymous resolution; requires producer to have requested a share token at upload time).

Currently the canonical convention to recommend isn't settled. Worth tracking whether other compiler-kind producers reach for one shape vs. another — if a second independent producer makes a different choice and ships, the convergence (or divergence) becomes a future finding.

Methodological note — the agent-to-agent collaboration pattern:

This finding emerged from a Claude-on-MinDev-side conversation pushed through to a Claude-on-Mintel-side conversation via the user as a human-router. Each agent owned its own system's concerns: MinDev's agent diagnosed the threat model + drove the schema change + posted prod verification; Mintel's agent audited the producer-side fallout + flagged the QR-encoding gap + committed to build-script patches in its own domain. The htmlcapsule project's record (this finding) is then the third surface that absorbs the cross-domain learning. Worth tracking as a pattern: multi-agent + human-router collaboration is producing real research artifacts (this F-finding) faster than a single-agent loop probably would.

Refinement (2026-05-21, F24): The URN-as-default recommendation in this finding is correct for producers without signal about host commitments. The case where a producer knows their target host has declared registry compliance opens a different reasonable choice — encoding the URL becomes a calibrated bet against a published contract rather than a gamble. F24 introduces the host vs. registry distinction and sketches a Capsule Registry Compliance v1 contract in spec/HOSTING.md. The default for general-purpose producers (and for the Mintel build script today, since MinDev has not declared compliance) remains URN; the option to encode URLs becomes available when the destination host has declared compliance.

Cross-references:

F24: Host vs. registry — the missing commitment layer

Date: 2026-05-21

F23 documented the empirical case where Mintel's URL-encoded QR codes broke after MinDev removed the public visibility tier. The first reading of that finding was: URN is the right default; URL was a deviation that bit Mintel. In a conversation following the F23 commit, the maintainer pushed back with a sharper question: at build time, Mintel knows it's uploading to MinDev; the URL form is more useful than URN for the recipient; the failure mode isn't producer error, it's that the host (MinDev) hadn't committed to keeping the URL working. The refined synthesis: the project's format/host split has been treated as "format and host are independent strangers," but in real workflows producers and hosts often want to be coordinated via published contracts. The format itself stays agnostic; some hosts may want to declare more.

The naming move: host vs. registry

Hosts can choose to remain just hosts or to declare themselves registries (by publishing a compliance statement at a well-known location). The format takes no position; producers and recipients gain a signal they can act on.

What changes about F23's "URN is the safe default":

Sketched Capsule Registry Compliance v1 contract (not yet adopted by any host):

  1. Stable URL pattern. <host>/<prefix>/<uuid-or-slug>. Pattern doesn't change without a major version bump + redirect period.
  2. /raw byte-identical endpoint at the URL + /raw. Never mutates the body.
  3. Visibility commitment is part of the contract. Whatever visibility tier a capsule is uploaded under is honored for the capsule's lifetime, OR migration is announced with notice. Removing a tier without grandfathering existing capsules is a breaking change.
  4. Host-attestation headers (x-capsule-content-hash, x-capsule-uuid) on every /raw response.
  5. Honest deprecation. Breaking changes get a public changelog + deprecation window + migration path. Surprise policy changes that break in-the-wild artifacts are out of compliance.
  6. Capsule immutability. The registry serves the bytes it received. No mutation, no re-rendering, no injection.

Full sketch with proposed well-known location (<host>/.well-known/capsule-compliance.json) and adoption status is in spec/HOSTING.md under "Hosts vs. registries — the optional commitment layer."

Mapping the MinDev incident onto the proposed contract:

Why the host-vs-registry distinction matters more broadly:

The project's layer picture (format / live-editing / hosting / discovery) treats each layer as independent. The compliance layer adds a coordination axis: within a layer, implementations can choose to coordinate via published contracts. Registry compliance is one example; spec/HOSTING.md's descriptive host-contract pattern is another, weaker example. This is how the open web works generally — browsers treat URLs as untrusted by default, but sites can opt into stronger trust by adopting HTTPS / HSTS / CSP / etc. The Capsule project can offer the same opt-in for hosts.

The format/host split stays correct as the baseline; the compliance layer is the upgrade path for hosts that want to be more than baseline.

Spec implications:

Open questions:

Methodological note — the pushback was the finding:

F24 didn't come from a tool, a capsule, or an external piece. It came from the maintainer pushing back on F23's framing during a follow-up conversation: "isn't what we are dancing around is the registry being htmlcapsule-spec compliant?" That single sentence reframed F23 from "Mintel made a mistake" into "the project lacks a host-commitment layer." Worth tracking as a research-method observation: the project's most useful conceptual moves are sometimes made by the maintainer pushing back on a finding's first framing, not by new external pressure. F23's empirical event was necessary but not sufficient; the synthesis required the conversational refinement.

Cross-references:

F25: ChatGPT producer-population reads Core supplementary guidance reliably; aesthetic adapts to content domain; legacy "Artifact Capsule" wording persists in user-side prompt templates

Date: 2026-05-21

A batch of 7+ ChatGPT-generated capsules (GPT-5.5 Thinking; conversation summaries across varied domains — hands-free coding workflows, geological target reinterpretation, Indigenous-rights conversation, design-award fit assessment, propane fire-pit purchase brief, Kia Sedona vs pickup decision, Swedish mining permits, Colombian pension-refund letter) were reviewed against Core v0.3.0. All produced from the user's prompt template "Produce an Artifact Capsule per the Core spec (attached) summarizing this conversation."

This is the largest single-batch empirical sample of a single LLM producer kind working from Core v0.3.0 to date. Five distinct findings emerged.

1. All five required blocks present. Rule 2 (no network) and Rule 12 (pre-rendered content) honored across every capsule in the batch. Multi-producer interop validated yet again at scale.

2. Rule 4 supplementary QR-code guidance followed faithfully across the population. Every capsule embeds a QR encoded as urn:uuid:<uuid> (per F23's URN-not-URL choice), placed top-right in the header, sized 88×88 px (Core suggested 80–96 px), with image-rendering: pixelated, a data:image/png;base64,... URI, alt="QR code for capsule UUID <uuid>", and a <figcaption> showing the UUID's first 8 chars. This is independent reproduction of supplementary-guidance compliance, not just compliance with the twelve numbered rules. When Core writes implementation-recipe-shape guidance — specific placement, exact sizing, a Python qrcode-library code example — LLM producers follow it precisely. This strengthens the F18/F19 multi-producer interop claim into a sharper version: Core's supplementary sections are load-bearing in practice, when they're written as recipes.

3. Where Core leaves implementation open, producers diverge with their own conventions:

4. Producer aesthetic adapts to content domain. Genuinely new observation. Capsules in the batch use distinctly different visual treatments per subject:

The format constrains structure (five blocks, twelve rules, no network) but does not constrain visual design at all, and producers exploit that to make Capsules feel domain-appropriate. The aesthetic is part of what's being archived — a reader opening a geological capsule five years from now will see it in the visual register the producer thought matched the subject, which is itself a form of preservation. This is unspecified-but-useful emergent producer behavior. For project posture: if a future "house theme" became tempting (one stylesheet to rule them all), this is the data point that argues against constraining it. Producers treating Capsules as design objects (not just data containers) is doing useful preservation work that a uniform stylesheet would erase.

5. Legacy "Artifact Capsule" terminology persists in user-side prompt templates. All capsules in the batch have prompt_received containing "Produce an Artifact Capsule per the Core spec (attached)…" — using the v0.1 name that was renamed to just "Capsule" in v0.2 (see GLOSSARY.md and spec/CAPSULE_SPEC.md naming-history notes). The Core spec itself uses "Capsule" everywhere — its own produce-prompt template (CAPSULE_CORE.md §"How to ask an LLM to produce a capsule") says "Produce a Capsule" — so the legacy term is propagating via the user's stored prompt template, not via the spec. Project response (this commit): added an explicit "use the canonical name" reminder immediately above Core's produce-prompt section, with a back-reference to this finding. Doesn't change spec rules; closes the loop by making the canonical name unmissable to anyone templating their own prompts. The producer-side field values are accepted under legacy v0.2 compatibility per the naming notes in the full spec.

Cross-references:

F26: Core spec accommodates 10 MB domain-specific media capsules without rule changes

Date: 2026-05-21

Source: One-off domain-specific song capsule experiment — Paul McCartney & Wings, "Nineteen Hundred and Eighty-Five" (1973). A 7.6 MB MP3 plus Wikipedia-sourced metadata (personnel, role on Band on the Run, covers, critical reception, live history, composition genesis quote) plus a transcribed lyric sheet, sealed as a 10.16 MB self-contained HTML capsule (UUID e26b58da-a3b2-4675-aa33-78511ad93e60, currently at capsule_version 1.1.0). Shipped 25/25 against the reference validator on first build with zero spec changes required.

Finding. Core spec v0.3.0 plus the existing supplementary recipes (QR convention, CSP defaults, capability vocabulary) is sufficient for domain-specific binary-media capsules at the 10 MB scale. Spec held at every dimension tested:

Implication for the spec. Core is not under-specified for binary-media capsules at this scale. No new rules needed; no Core changes triggered. The fidelity gradient between LLM-produced and compiler-produced capsules (per F25) remains the open work, not size or domain scaling.

Implication for parked Appendix E.11 fields. The song-with-lyrics-added scenario lived through the exact use case the parked supersedes[] / derived_from[] / change_summary fields (raised by external review, parked in spec Appendix E.11 pending real-producer pressure) would address — same UUID, content change worth signaling to downstream holders, current solution is just a capsule_version bump. Since the capsule was not distributed between v1.0.0 and v1.1.0, the parked fields stayed parked correctly: empirical pressure point is now recorded for the next time a producer needs to signal "this supersedes my previously-shared v1.0.0" without minting a new UUID.

Cross-reference. The producer for this capsule was the in-conversation Claude Opus 4.7 hybrid pattern (the same producer pattern as the project's landing page itself, per its generator block). This is the first F-finding from a deliberately one-off, domain-specific, copyright-laden capsule that was not committed to the public repo — a different empirical-pressure source than F25's open-corpus producer population, and a useful complement.

Related findings:

F27: The landing-page genre tension for applied-research projects resolves by splitting, not merging

Date: 2026-05-22

Source. The May 2026 landing-page exploration arc on this project — from index.html v10.x through v13.0.0, plus four comparison sketches (landing-sketch.html v1/v2, research-sketch.html v1/v2, positioning-sketch.html) — and three independent external reads (devil's-advocate critique pass, the in-flight Claude landing-agent's hero pick during the parallel-sketch experiment, and a ChatGPT Deep Research site survey). The maintainer captured the tension directly during the arc: "it's hard to create a landing page for something which is, at its heart, research, albeit applied."

Finding. A landing page for an applied-research project pays a real cost trying to do both jobs at once. Landing pages convert (one claim, one CTA, one demo, optimize for click); research pages persuade (cite everything, walk the argument, optimize for "you can verify this"). When a single page tries to do both, it pays both costs and converts on neither. The exploration arc tried all three pure-genre commitments plus the hybrid before settling:

  1. Hybrid (research narrative + landing elements in one page) — index.html through v12.0.0. Numbered Observations / Questions / Answers + nine hero candidates + CTAs + research apparatus all on one surface. The genre tension was visible to every reader: research apparatus showed through landing veneer; landing apparatus interrupted research depth. Both genres paid for the other.
  2. Pure landing (Stripe / Linear stripped) — landing-sketch.html / landing-sketch-v2.html. Conversion-shaped, ~9% of the prose volume. Lost the research argument; "research project" signal collapsed to "yet another file format."
  3. Pure research-paper (NeRF-style academic) — research-sketch.html / research-sketch-v2.html. Author block / abstract / numbered findings / methods / related work / cite-this-work. Lost the conversion shape; the word "Abstract" reads as "not for you" to non-research audiences.
  4. Synthesis (positioning-led, lifecycle-diagram-centered) — positioning-sketch.html. Pain-first hero ("Your AI work shouldn't die when the chat closes."), lifecycle SVG as the centerpiece. The most novel of the single-page options; still asks one URL to carry both audiences.

The resolution that worked. The two-page split — listed as "Option B" / "Option D" during the exploration but consistently underweighted because splitting feels like the hedge move. It isn't. The production landing (index.html at v13.0.0, UUID 7d1a1ac8) is the pure-landing commit, optimized for conversion. The full research-narrative is preserved as a separately-accessible page (exploration.html, UUID 881fed04), optimized for depth. Same UUID lineage (via parents[]); distinct identities. Each page is genre-pure; each page pays only its own genre's cost.

Implication. The framing "decide between landing-genre and research-genre" was wrong all along — it presumed one URL. The right framing was "decide which page is which." The genre tension dissolves when you stop asking one URL to carry both audiences.

Generalization. This pattern likely transfers to any applied-research project with a mixed audience (technical / general / research-leaning). Front door optimized for "what is this and why should I care in 30 seconds"; deep page optimized for "I'm bought in and want the full argument with citations." Cross-link explicitly. Don't try to merge.

Method observation. Three independent reads converged on the split — devil's-advocate critique, the landing-agent's hero pick (which selected "HTML you can keep." as the strongest single claim, implying genre commitment), the ChatGPT Deep Research review (which framed the project as research that doesn't need a sales-y landing). When multiple independent reads converge on a structural conclusion that you'd been resisting (because it feels like a hedge), that convergence is a stronger signal than any single read. Worth tracking as a methodological pattern: external review convergence on a structural decision is empirical pressure even when the decision feels like cowardice.

Related findings:

F28: Producers reach for Capsule-shape independently when given the idiom but not the spec — empirical pressure for discoverable onboarding

Date: 2026-05-22

Source. Review of a ChatGPT-produced MIDI capsule POC (Mozart Lacrimosa, ~220 KB; preserved at capsule-midi/proofs/lacrimosa-chatgpt-poc.html). The user asked ChatGPT to "make a DAW-like HTML capsule from this MIDI" without attaching Core as a prompt fragment.

Finding. Without Core attached, the LLM producer (ChatGPT in this case) independently reached for the Capsule idiom — single-file HTML, embedded JSON manifest, schema declaration (capsule_schema: "midi-stem-capsule-v0.1"), parents[] array (with composition reference), sha256 of source bytes, honest license_note with "verify before redistribution" caveat — but missed the Capsule specifics:

Validator result: 5/10 pass, 1 warn, 4 fail — the basic-shape checks pass (HTML5 doctype, html/body, no external network refs, well-formed runtime JS, under-cap size), but every structural check fails (5-block requirement, manifest section parseable, data section parseable, content hash verifies).

Companion to F25. F25 observed producers with Core attached reliably follow supplementary guidance. F28 observes producers without Core attached reach for the shape but miss the specifics. Together: Core works when attached as a prompt fragment; when not attached, the idiom is reached for organically but the specifics are reinvented.

Implication for the spec — discoverable onboarding is empirically warranted. The Capsule shape is a real attractor — LLMs reach for it even without prompting — but they can't reproduce the structural specifics without seeing them. Possible spec-level responses:

  1. Extend /llms.txt to publish Core as a paragraph-level summary plus a link to the full Core, so any LLM doing web research on htmlcapsule.org lands on the discipline naturally. Cost: small. Benefit: every LLM that's done its own research has Core in context.
  2. Publish a one-page "Producer starter kit" — Core + minimal example + the most common producer mistakes (5-block vs single-json, missing Rule 7 markers, etc.) — at a stable URL discoverable from llms.txt. Cost: medium. Benefit: producers without Core fall back to a clear failure mode (the starter kit) rather than reinventing.
  3. Document the "reached for the shape but missed the specifics" failure pattern in spec/CAPSULE_SPEC.md as a known gap, with the response being "attach Core; without Core attached, expect 5/10 at the validator." Cost: very small. Benefit: sets accurate expectations.

The maintainer's pick (per capsule-midi/FEEDBACK.md): option 1 is the smallest and most discoverable. Worth doing as part of the next operational pass.

Methodological side-finding. This is the second time a producer-side experiment has yielded research-record material that crosses back into spec design. The pattern is now visible:

producer attempts a domain → hits a friction → friction is logged in producer's FEEDBACK.md → harvested into htmlcapsule's RESEARCH.md as an F-finding → may trigger a spec change

This is the cross-project memory pattern the producer projects (capsule-midi, Shasta, capsule-photo, Mintel) use to feed empirical pressure back into the spec without unilaterally inventing changes. Worth naming as a deliberate methodology — call it upstream feedback discipline. The producer projects own the friction; the spec project owns the response.

Related findings:

F29: iOS QuickLook surfaces graceful degradation as a first-class spec principle, not just a Rule 12 implication

Date: 2026-05-22

Source. Two pieces of empirical pressure converging:

The actual environment. iOS Files / Mail / Messages / AirDrop / iCloud Drive / Notes preview surfaces route HTML attachments through Apple's Quick Look framework. Quick Look is a passive preview system — it renders HTML/CSS but does not execute <script> tags. This is a defensible security posture (untrusted attachment HTML running JS from every preview surface would create real attack vectors) but it means a capsule whose substance lives in the runtime fails the iOS-preview first impression.

Finding. The spec already covers most of this but doesn't surface it as the design discipline it's pointing at. What's already there:

What's missing:

  1. No machine-readable fallbacks manifest field. Producers handle fallbacks ad-hoc in HTML; consumers (validators, registry viewers, downstream tooling) can't programmatically discover "this capsule has a preview-audio fallback at index X."
  2. Per-domain fallback guidance only formalized for domain.exploration_map. Other domains (domain.midi_stem, domain.song, domain.photo) need explicit guidance about what their JS-off representation should be.
  3. The three-mode taxonomy is implicit. §2.3 articulates the JS-off litmus but doesn't name the architectural framing the pasted discussion landed on: a capsule should degrade from runtime (full JS app)document (readable artifact)preview (consumable media or static representation).
  4. iOS QuickLook is mentioned but not centered as the canonical hostile environment to design against.

Architectural alternatives evaluated and rejected (so the rejection is on the record):

The principle worth promoting. The pasted discussion's sharpest framing:

A capsule should never become useless when JavaScript is unavailable. It should degrade from app → document → preview.

The spec says this in two paragraphs of §2.3; this single sentence is the version worth elevating to a section tagline.

Implication for v0.3.6. Three concrete additions queued for the next spec release:

  1. Generalize §2.3 image-fallback into a domain-agnostic JS-off fallback pattern. Add the tagline above. Add iOS QuickLook as the named canonical environment.
  2. Add a recommended (not required) fallbacks manifest field. Shape: { preview_audio, poster_image, static_summary_present, requires_js_for, preview_mode_description }. All optional. Lets producers declare what's there without forcing a structure on producers who don't have anything to fall back to.
  3. Per-domain fallback guidance in DOMAIN_CAPSULES.md. For each domain (existing + idea-queue): name the recommended JS-off representation. Examples: domain.midi_stem → bundled rendered audio mix as <audio controls>; domain.song → the embedded MP3 already IS the fallback (explicit note); domain.photo → the image itself is the fallback; domain.exploration_map → already documented (image-fallback for geometry).

Methodological observation. The pattern that produced this finding is now recurring: the capsule-midi producer-side adaptation preceded the spec change. The <noscript> block in templates/capsule.html.tpl was the producer's response to a real environment constraint; the spec catches up by formalizing the principle. This is the upstream feedback discipline named in F28 working in the opposite direction: not "spec change first, producer follows" but "producer adapts to environment first, spec generalizes the pattern." Both directions are healthy and worth tracking.

Related findings:

Open questions

In rough priority:

Q1: Does the atomic-unit framing hold across genuinely different domains? (Substantially answered)

The format has working artifacts in at least five domains:

DomainData shapeProduction pathStatus
Decision boardrecords[]Compilerworking (reference template)
News annotationrecords[]Compilerworking (reference template)
Conversation synthesissingle-documentPure LLM in chatworking (~30+ capsules across multiple batches)
Property-scale mapfeature collectionHybrid (build script)working (illustrative + real-data instances)
Photograph + audio notesingle-document with photo objectHybrid (build script)working
Implementation notessingle-documentLLM or hybriddocumented in DOMAIN_CAPSULES.md (Thariq-pattern)
Design systemsingle-documentLLM or hybriddocumented in DOMAIN_CAPSULES.md (Thariq-pattern)
Exploration mapfeature collection w/ raster optionCompilerdocumented in DOMAIN_CAPSULES.md (third-party producer)

Eight documented domains, three production paths, three data shapes, all sharing the same five-block envelope. The framing holds. Remaining open question is whether more exotic domains strain the format (journal entries, recipes, scanned letters, voice-only notes, video clips, log files).

Q2: Can the author-side archive be light and still useful?

The previous "biggest gap" framing put the import-side build as a heavyweight registry + ingestion system. F7 dissolved most of that — the lightweight version (SQLite archive + pair viewer) handles the actual common case. Still unbuilt; still a candidate next concrete build.

Q3: How does the format behave under cross-browser file:// constraints?

All browser testing to date has been via local HTTP. Safari, Firefox, and Chrome have different file:// security policies. Specifically: clipboard API availability, localStorage/IndexedDB behavior, inline font and SVG handling under strict CSPs. The format should work identically on file:// and http:// per spec — empirically this is undertested.

Q4: Does the spec need a content-hash protocol that LLMs can actually compute?

The canonical-JSON content hash is unreproducible by LLMs (which don't reliably canonicalize JSON). LLM-produced capsules omit it. The spec correctly degrades to a warning, but this means LLM-produced capsules are fundamentally less verifiable than compiler-produced ones. Is there a hash protocol that an LLM could plausibly compute correctly? Open.

Q5: Will the fidelity gradient hold under adversarial use?

What if an LLM produces a capsule that claims generator.kind: "compiler" (i.e., lies about its production path)? The validator can't catch this — it's a self-declared field. A capsule that claims to be compiler-produced but has malformed integrity hash would fail integrity verification, but a capsule that just omits the integrity block and claims to be compiler-produced would pass with a warning. The trust model assumes good faith. Real-world deployment may not have it. The E.6 transparency-log direction would partly address this.

Q6: How big does the spec need to be?

The full CAPSULE_SPEC.md is ~1500 lines including v0.4 candidates (Appendix E). The Core is ~120 lines. The Core demonstrably works as an LLM prompt. Does the full spec earn its weight, or could it be trimmed without loss? Open question for a future audit.

Recurring LLM authoring failures

Across multiple personal-capsule batches (20+ capsules across four spec versions), several classes of bug have recurred.

Primary recurring failure: string-literal escape errors in markdown export functions

The pattern: LLMs reach for newline characters when generating string-building JavaScript and get the escape level wrong. Either over-escape ("\\n" becomes literal backslash-n in output) or under-escape (raw line terminator inside a non-template string literal, which is a SyntaxError that kills the entire runtime silently).

The validator originally couldn't catch this because the runtime is treated as opaque text by the manifest/data parser path. A capsule with a broken runtime could pass 18/21 + 3 warn + 0 fail while having zero working buttons.

Trajectory across spec versions:

BatchSpec versionMitigationBug recurrence
1–5v0.1.0none1/5
6–10v0.1.0none2/5
11–15v0.1.1prose tip in prompt fragment1/5
16–20v0.1.2promoted to numbered rule 11 + WRONG/RIGHT code example0/5

Finding: Promoting the rule from prose guidance to a numbered first-class rule with a concrete code example dropped recurrence from 1/5 to 0/5 in the next batch. All five v0.1.2 capsules used backtick template literals for the markdown export. One batch isn't proof, but the trajectory is monotone improvement and consistent with the hypothesis that LLMs follow mechanical syntax-level rules better than content-level "be careful" prose.

Belt-and-suspenders mitigation in v0.1.2: the validator also grew a regex check for the bug pattern (join("/join(' followed by a raw line terminator) inside the runtime block.

Secondary recurring failure: spec_version cargo-cult from example block

A separate, lower-stakes authoring slip appeared in some LLM batches. The LLM correctly recorded source.spec_received: "v0.1.2 · 2026-05-16" (the Core version line it actually received) but set manifest.spec_version: "0.1.0" — cargo-culted from the example manifest block in the Core, which still showed the old version.

Two mitigations landed together:

Tertiary recurring failure: JS-render-everything pattern (the constrained-renderer problem)

The most architecturally significant failure. Discovered in the photo capsule when AirDropped to iPhone — see F14 for full writeup. Spec response: Core v0.1.3 rule 12 — render content in the HTML at build time, not at runtime. Same numbered-rule + WRONG/RIGHT-example pattern that dropped the rule 11 bug class to 0/5. Empirically validated on two consecutive batches under v0.1.3 (10/10 PASS).

Quaternary recurring failure (mild): over-broad CSP directives

Pattern across two v0.1.3 batches: ~30% of capsules add defensive CSP directives (media-src, font-src, blob:) that the capsule doesn't actually use.

Severity: mild. Over-broad CSPs don't break anything — they just permit more than the capsule actually exercises. From a security standpoint they're still very restrictive (everything is 'none' or data: only — no host allowed). From a self-documentation standpoint they over-promise.

Spec response (still deferred): the pattern is consistent but consistently low-severity. No Core/spec change motivated yet. If a capsule ever declared 'self' or a host (which would be a real loosening), that would warrant a rule. Pure-data: over-declaration doesn't.

Variance across runs (and what we can and can't control)

After producing 30+ LLM capsules across formal experiment rounds plus personal-use captures, the variance pattern is now clear:

Between producers (different models): Quality differs systematically. Thinking / extended-reasoning variants (Claude extended thinking, ChatGPT "Thinking" modes, Gemini deep-think) produce noticeably more careful capsules than standard variants — better personal-use defaults, light+dark themes, working markdown exports, CSP headers, richer data structures. This is repeatable and large enough to be worth noting prominently. The Core spec now includes a note encouraging thinking-mode use when available.

Within producer (same model, different runs): Real but smaller variance. Same model with same prompt produces different layouts, different CSS aesthetics, sometimes includes/omits the optional synthesis block. This is intrinsic LLM sampling variance (temperature), generally not user-controllable on web UIs. It is fine. The structural invariants (manifest, data, runtime, validation) hold across all the variance. Each capsule is still a valid capsule. We cannot expect bit-identical reproduction across runs and shouldn't aim for it — the variance is informative about how robust the format is to natural production noise.

Content-aware defaulting (correct behavior, not variance): Thinking variants correctly read social meaning of the conversation and set visibility accordingly. A conversation about sensitive content → visibility: "private", contains_private_data: true. A conversation about generic intellectual content → visibility: "shared". This isn't variance — it's the LLM doing context-aware honest defaulting on its own. Worth preserving as expected behavior.

Self-documenting capsules

Two optional manifest fields turn capsules into a self-documenting research record:

For LLM-produced capsules, these are encouraged. They let future readers correlate output with the spec version and prompt that produced it, without external bookkeeping.

The Core itself is version-stamped (first line of CAPSULE_CORE.md). Material changes bump the version and date. Git tags (core-v0.1.0 through core-v0.3.0) preserve historical versions retrievable via git show core-vX.Y.Z:CAPSULE_CORE.md.

Notable methodology choices

These weren't obvious at the start but proved important:

Usage drives: we don't design rules from a chair. Every spec move so far has been triggered by an empirical observation in the LLM corpus or the production pipeline — never by "this would be good design." The spec is the trailing indicator of what producers actually do, never the leading edge.

Thesis judges: when we observe something, the question is does this serve "memory object for work worth preserving" or undermine it? The answer determines the direction of the spec move:

Observation typeMoveExamples
Honest deviation (LLM reaches for a more accurate value)Loosen — the spec was too narrowsource.origin: "web_research", synthesis.kind: "llm", loosened enums
Recurring failure (mechanical bug, broken rendering, lost meaning)Tighten — add a numbered rule that names the failurerule 11 (JS newline), rule 12 (JS-render-everything)
Emergent convention (LLMs invent a useful pattern unprompted)Document — recognize it as a recommended convention without making it requiredembedded_media field, sources array (now in §4.1.2 of the full spec)
Underexplored option (a useful behavior LLMs aren't choosing on their own)Add prompt-fragment guidance — no new rule, just explicit permission/encouragementv0.1.4 thoroughness + sources guidance

Loosening, tightening, documenting, and guiding aren't opposites. They're four flavors of the same reactive mechanism, applied to different kinds of observation. The thesis is the constant; the spec is always catching up.

Why this matters: most spec design is generative — decide what the right way is, force practice to conform. That model produces specs that ossify and lose contact with reality. The reactive model produces specs that stay current with how producers actually behave. Same model as Markdown/CommonMark, HTML/WHATWG, Python idiom-layer/PEPs.

Limits this principle has, that we should be honest about:

  1. Bootstrap problem. v0.1.0 had to be something before any usage existed. The initial draft was unavoidably generative. Every revision since has been reactive.
  2. Requires a clear thesis. Without "memory object for work worth preserving" as the arbiter, we couldn't tell honest deviation from broken artifact. The thesis is doing real work; the principle would collapse without it.
  3. Requires willingness to unwind. If a rule we added turns out to be wrong, we have to remove it. v0.3 demonstrated this — capsule_id (slug) and related[] were deprecated when their consumer-side use case didn't materialize.
  4. Slow under pressure. When you want to build a new path NOW, the reactive principle says "watch what you build, then formalize." That's slower than designing the framework up front. We have to be willing to accept the slower path.

This is the project's first-rank methodological commitment.

  1. Document the failure with empirical evidence (multi-batch trajectory data).
  2. Promote the principle to a numbered Core rule (not a prose tip in the prompt fragment).
  3. Include a concrete code example showing WRONG vs RIGHT.
  4. Bump the Core version and re-test on the next batch.

This has now worked twice empirically:

RuleFailure classPre-numbered mitigationPost-numbered result
11 (v0.1.2)JS string-literal newlinesprose tip in prompt fragment → 1/5 still failingnumbered rule + WRONG/RIGHT → 0/5 failing in next batch
12 (v0.1.3)JS-render-everythingno prior mitigation (pattern not recognized)numbered rule + WRONG/RIGHT → 10/10 passing across two batches

Two cases isn't a strong statistical sample, but the mechanism is consistent with the broader observation that LLMs reliably follow mechanical, syntactically-explicit rules better than they follow content-level advice. Worth treating as the default spec-evolution pattern going forward.

What this is NOT: a license to add more rules. Each numbered rule consumes prompt budget and cognitive load on the producer side. The bar for adding a rule remains "empirically recurring failure with no other available mitigation."

Project artifacts

ArtifactRole
CAPSULE_CORE.mdOne-page short spec, designed for LLM prompts (currently v0.3.0)
spec/CAPSULE_SPEC.mdFull normative spec (currently v0.3.2)
spec/DOMAIN_CAPSULES.mdPer-domain schemas (implementation_notes, design_system, exploration_map)
spec/SYSTEM_ARCHITECTURE.mdThe four-layer architecture (private system / compiler / artifact / format profile)
spec/manifest.schema.jsonJSON Schema for the manifest block
spec/response.schema.jsonJSON Schema for response envelopes
spec/examples/Canonical example capsules (briefing, implementation_notes)
compiler/compile.pyReference compiler, stdlib-only
compiler/validate.pyReference validator (26 checks at v0.3.2)
templates/decision_board/First template: per-option decisions with verdict export
templates/news_capsule/Second template: annotated article with claims/entities/sources
examples/Sanitized JSON inputs for the compiler templates
GLOSSARY.mdVocabulary, four-layer table, phase status
PRECEDENTS.mdPositioning against RO-Crate, TiddlyWiki, MPEG-21, C2PA, etc.
index.htmlProject landing page — itself a valid Capsule
Git tags core-v0.1.0core-v0.3.0Historical Core versions retrievable via git show core-vX.Y.Z:CAPSULE_CORE.md

Reproducibility

To rerun the LLM experiment yourself:

  1. Open a fresh chat with the LLM of your choice (Claude, Gemini, ChatGPT, or any model capable of reading attached files).
  2. Attach CAPSULE_CORE.md.
  3. Ask: "Using this spec, give me a summary of [topic] as a Capsule."
  4. Save the resulting HTML.
  5. Run python3 compiler/validate.py <file>.html to check conformance.

Expected result: roughly 22/25 pass with 3 warns (missing integrity block, capability-marker false negative). Different pattern? That's a finding — either the spec drifted, the LLM behaviour changed, or you've found a new edge case.

To re-derive the integrity hash from spec prose alone (as one independent producer did):

  1. Read spec/CAPSULE_SPEC.md §9.1.1 ("Content Hash Recipe — normative").
  2. Implement the canonical-JSON serialization + placeholder substitution rules in your language of choice.
  3. Compute the hash for the worked example given in the spec.
  4. Compare against the expected hash also given in the spec.

If your implementation produces the expected hash bit-identical, the spec is doing its job as a normative document. If it doesn't, the spec has a gap.

Status

As of v0.3.2 (2026-05-20):

Biggest unbuilt piece: author-side import tooling (registry + import.py). The producer side has matured significantly and the consumer side hasn't moved. The lightweight version (SQLite archive + pair viewer per F7) remains a candidate next concrete build.

Biggest untested area: cross-browser file:// behavior across Safari, Firefox, and Chrome. The format should work identically on file:// and http:// per spec — empirically this remains undertested.

How to read this project

This is a research project that produces a working spec and reference implementation as primary artifacts. The spec is the hypothesis. The fixtures (compiled and LLM-produced capsules) are the evidence. The findings document (this file) is the running narrative of what we've learned. Every commit message is part of the research log — the "why" of each change is preserved in git history.

The project does not have a single "result" or a release date. It's a working investigation. The most likely failure mode is spec inflation (the long spec grows beyond what anyone reads) and the second most likely is import-side abandonment (we keep polishing the producer side while the consumer side stays unbuilt). Both are explicitly tracked as risks.

The project is not trying to invent something. It's trying to articulate the discipline that's missing from a practice already underway.

About this page · manifest · exports

This is a sealed HTML Capsule per Core spec v0.3.0. Five required inline blocks, no network dependencies, integrity hash over data + manifest. The content above is rendered from RESEARCH.md by the deterministic compiler/build_md_capsules.py at the time of the last source change.

Loading manifest…