nice-ass-captions

Burn stylish captions into videos using whisper and ffmpeg. Runs locally. Powered by ASS™.

Features

Word-level highlighting — the active word is bright, the rest are dim
Rounded semi-transparent background box
Optional palette-derived text and box colors with accessible contrast — one pair for the whole video, or colors that follow the imagery block by block
Accurate local transcription via whisper.cpp running on-device (Apple Silicon Metal acceleration)
No API keys, no cloud, no data leaves your machine
One command from raw video to captioned video

How it works

Extract audio — ffmpeg pulls a 16kHz mono WAV from the input video
Transcribe — whisper-cli transcribes the audio with word-level timestamps via DTW forced alignment, writing a .wts script with per-word timing windows
Parse and normalize — the .wts is parsed to extract word timestamps; contractions, punctuation, and compound proper nouns (e.g. "Cloudflare") are merged into single tokens
Generate ASS — words are grouped into ~5-word chunks; each chunk becomes one ASS Dialogue line using \1a alpha animations and zero-duration \t() transitions to highlight the active word; the background is a \p1 vector-drawn rounded rectangle
Burn in — ffmpeg with libass renders the ASS onto the video frames via the subtitles= video filter

The ASS format

Advanced SubStation Alpha (ASS) is an open subtitle format that has been the standard for high-quality fansubbing since 2003. It is far more expressive than SRT — supporting per-word timing, animations, transparency, rotation, blur, and vector drawing.

libass is the open source rendering library for ASS. It is built into ffmpeg, mpv, VLC, and most serious media players. nice-ass-captions uses libass via ffmpeg's subtitles= video filter to burn captions directly into the video frame.

The specific ASS features this tool uses:

\p1 — vector drawing commands for the rounded rectangle background
\1a — per-word primary alpha (transparency) control
\t(t1,t2,tags) — zero-duration transitions to snap opacity at precise millisecond offsets
\an2\pos(x,y) — absolute bottom-center positioning so all lines in a chunk render at the same location

The Aegisub editor and its ASS tag reference are the best resources for understanding what ASS can do.

Prerequisites

brew install whisper-cpp ffmpeg-full

whisper-cpp provides whisper-cli. ffmpeg-full is the Homebrew variant that includes libass support for subtitle rendering. The standard ffmpeg formula does not include libass.

Usage

# Clone and run
git clone https://github.com/zeke/nice-ass-captions
cd nice-ass-captions
uv run caption video.mp4
uv run caption video.mp4 --position top

Output is saved to video-captioned.mp4 in the same directory as the input.

Use --colorize to pull text and background colors from the video imagery. See Colorize for the two modes.

uv run caption video.mp4 --colorize global
uv run caption video.mp4 --colorize per-chunk

Use --prompt to give whisper.cpp spelling and vocabulary hints before transcription:

uv run caption video.mp4 --prompt "Cloudflare, Browser Rendering, WebAssembly, Zeke"
uv run caption video.mp4 --transcript script.txt

Good prompt terms include:

Product names, company names, project names, and people names
Acronyms, technical terms, jargon, and uncommon words
Words with expected casing or punctuation, such as Workers AI, don't, or you're

Keep the prompt short and comma-separated. It is context, not a script. The prompt biases transcription toward those terms, but it does not force them into the output. Use --transcript when you have the raw script or captions. The file is passed to whisper.cpp as an initial prompt with --carry-initial-prompt, so it can improve spellings and punctuation across segments. It is still guidance, not forced alignment.

Colorize

By default captions use the built-in white-on-black style. --colorize instead derives the text and background colors from the video imagery. It has two modes.

uv run caption video.mp4 --colorize global
uv run caption video.mp4 --colorize per-chunk

--colorize global samples frames across the whole video, builds a dominant-color palette, and picks one text/box pair for the entire run. --colorize per-chunk samples a frame at each caption block's timestamp and picks a pair per block, so the colors track what is on screen as the captions advance.

Both modes share the same accessibility rules:

Text is checked against the semi-transparent box composited over the sampled pixels, and must clear the MIN_CONTRAST ratio (4.5:1, WCAG AA for normal text).
Among readable pairs, saturated and visibly colored combinations are preferred over washed-out near-white or near-black ones.
If the sampled palette cannot produce a readable pair, it falls back to a safer high-contrast pair.

Because per-chunk samples the frames each block actually appears over, its contrast check is tighter than global. Colors change abruptly between blocks (hard cuts); smooth color tweening between blocks may come later.

When colorizing, the console prints the chosen colors, the box opacity, and the measured contrast ratio. In global mode this is a single line; in per-chunk mode one line is printed per caption block:

Colorize: text #F5E9C8, box #1A140B (72% opaque), contrast 9.3:1   # global
  block 1: text #F5E9C8, box #1A140B (72% opaque), contrast 9.3:1  # per-chunk

Options

Flag	Default	Description
`-o, --output PATH`	`<input>-captioned.mp4`	Output file path
`--model PATH\|NAME`	`~/.cache/nice-ass-captions/ggml-large-v3-turbo.bin`	Whisper model path or short name (e.g. `small.en`)
`--prompt TEXT`	—	Initial prompt for whisper — improves accuracy for domain-specific proper nouns
`--transcript PATH`	—	Raw transcript text file to use as a whisper prompt
`--words N`	`5`	Words per caption chunk
`--position top\|center\|bottom`	`bottom`	Caption position
`--colorize global\|per-chunk`	off	Derive text and background colors from the video imagery (whole-video pair or per-block)
`--keep-tmp`	off	Keep intermediate `.wav` and `.wts` files

Models

Models live in ~/.cache/nice-ass-captions/. Download with:

mkdir -p ~/.cache/nice-ass-captions

# large-v3-turbo — recommended/default (best accuracy/speed tradeoff)
curl -L -o ~/.cache/nice-ass-captions/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

# small.en — faster, slightly less accurate
curl -L -o ~/.cache/nice-ass-captions/ggml-small.en.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.en.bin

Model	Size	Notes
`tiny.en`	75MB	Fastest, basic accuracy
`base.en`	142MB	Fast, decent accuracy
`small.en`	466MB	Good balance
`medium.en`	1.5GB	High accuracy, English-only
`large-v3-turbo`	1.6GB	Recommended/default, multilingual, fast on Apple Silicon
`large-v3`	3.1GB	Highest accuracy, multilingual, slower

On Apple Silicon, whisper.cpp runs via Metal (GPU), so even the large models transcribe a short video in seconds.

Style

All visual parameters are constants at the top of caption.py:

Constant	Default	Description
`FONT_NAME`	`Arial`	Caption font family
`FONT_SIZE`	`72`	Font size in script pixels
`PAD_X`	`24`	Horizontal padding inside background box
`PAD_Y`	`28`	Vertical padding inside background box
`CORNER_R`	`18`	Background box corner radius
`MARGIN_BOTTOM`	`80`	Distance from bottom of frame to text
`CAPTION_POSITION`	`bottom`	Default caption position: `top`, `center`, or `bottom`
`ALPHA_DIM`	`&H99&`	Inactive word opacity (~60% opaque)
`ALPHA_BRIGHT`	`&H00&`	Active word opacity (fully opaque)
`COL_TEXT`	`&H00FFFFFF`	Caption text color
`COL_BOX`	`&H48000000`	Background box color and opacity
`MIN_CONTRAST`	`4.5`	Minimum contrast ratio for `--colorize` colors
`WORDS_PER_CHUNK`	`5`	Words per caption group

Colors use ASS format: &HAABBGGRR where AA is alpha (00 = opaque, FF = transparent).

Compound word merges

whisper.cpp occasionally splits compound words at subword boundaries (e.g. "Cloudflare" → Cloud + fl + are). The COMPOUND_MERGES list in caption.py handles known cases. Add entries for proper nouns specific to your content:

COMPOUND_MERGES = [
    ["cloud", "fl", "are"],   # Cloudflare
    ["my", "company"],        # MyCompany — add your own
]

The --prompt flag is also effective for proper nouns — passing --prompt "Cloudflare, WebAssembly" biases the model toward recognizing them correctly before the merge step is needed.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
caption.py		caption.py
example-1-no-colorize.mp4		example-1-no-colorize.mp4
example-2-colorize-global.mp4		example-2-colorize-global.mp4
example-3-colorize-perchunk.mp4		example-3-colorize-perchunk.mp4
example-before-captions.mp4		example-before-captions.mp4
pyproject.toml		pyproject.toml
screenshot.jpg		screenshot.jpg
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nice-ass-captions

Features

How it works

The ASS format

Prerequisites

Usage

Colorize

Options

Models

Style

Compound word merges

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nice-ass-captions

Features

How it works

The ASS format

Prerequisites

Usage

Colorize

Options

Models

Style

Compound word merges

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages