Voices · Archived 2026-05-20 · Andrej Karpathy

On HTML as the current best default for LLM output

A four-step progression — raw text → markdown → HTML → eventually interactive neural simulations — and a practical tip: ask the LLM to structure its response as HTML.

Author: Andrej Karpathy (@karpathy)
Venue: X (formerly Twitter)
Source: x.com/karpathy/status/2053872850101285137
Original date: not verified at archive time
Archived by: B. F. Garden (htmlcapsule)
Method: verbatim transcription from public post

— begin quote —

This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc.

More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage:

1) raw text (hard/effortful to read)

2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default

3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default

...4,5,6,...

n) interactive neural videos/simulations

Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral https://x.com/zan2434/status/2046982383430496444.

There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen.

TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.

— end quote —

Why this is in the archive

The reason Capsule exists at all is the step Karpathy names as #3: HTML, "early but forming new good default." Capsule isn't trying to invent that step — Anthropic's tools (Claude artifacts, Claude Design), the broader LLM-emitting-HTML pattern, and the substrate observation made it inevitable. Capsule's job is the discipline layer ON the HTML the AIs are already producing: a manifest at the top, a contract on the file, content pre-rendered so it still works when scripts don't, capabilities that don't lie, no network calls. A spec for what "a usable HTML artifact" means when LLMs are the ones writing it.

The progression also names something the Capsule project tries to take seriously: today's load-bearing step is HTML, but it isn't the last step. There will be #4, #5, #n. A format that tries to be the final word on AI-generated artifacts will age badly. Capsule is deliberately scoped to the current step — sealed HTML files with a contract — and the spec discipline ("no addition without empirical pressure") is designed to let the format evolve or be superseded honestly rather than calcify.

The practical "hot tip try ask for HTML" is also exactly what Capsule's produce prompt does — except the prompt asks for HTML conforming to a 12-rule contract, so the file you get back is sealed and inspectable rather than just "a file in your browser."

Why this is in the archive

Manifest