Like cloc, but counts Claude tokens instead of lines.
$ ctoc src/
---------------------
Ext files tokens
---------------------
.rs 17 52,000
.py 5 12,340
.ts 3 4,200
---------------------
SUM 25 68,540
---------------------
- Self-contained C++17 binary — no runtime dependencies.
- Greedy longest-match tokenizer built from 36,495 reverse-engineered Claude tokens.
- ~4% error vs the Anthropic
count_tokensAPI across 30 tested files (tiktoken and bytes/4 undercount by 20%+). - Processes ~1M tokens/sec including file I/O.
bazel build //:ctoc
cp bazel-bin/ctoc /usr/local/bin/Cross-compile with --config={linux_amd64,linux_arm64,macos_amd64,macos_arm64} via hermetic zig cc.
ctoc . # tokenize current project
ctoc --by-file src/ # per-file breakdown
ctoc --include-ext .py --include-ext .js # only Python and JS
ctoc --exclude-dir vendor . # skip vendor/- At build time,
gen_vocab.pyconvertsvocab.jsoninto a C++ array - At runtime, tokens are inserted into a trie and files are tokenized via greedy longest-match
- Vocabulary was extracted by probing Anthropic's
count_tokensAPI ~276K times — see REPORT.md