Skip to content

feat: semantic icon search with VLM descriptions and embeddings#117

Draft
mmacpherson wants to merge 4 commits intomainfrom
feat/semantic-search
Draft

feat: semantic icon search with VLM descriptions and embeddings#117
mmacpherson wants to merge 4 commits intomainfrom
feat/semantic-search

Conversation

@mmacpherson
Copy link
Copy Markdown
Owner

Summary

Adds embedding-based semantic search so users can find Lucide icons by natural language queries instead of exact name matching.

  • search_icons("payment") → dollar-sign, banknote, receipt, credit-card...
  • search_icons("a cozy cabin in the woods") → tent-tree, armchair, tree-deciduous...
  • search_icons("ennui") → annoyed, meh, frown...

How it works

  1. Build time: Gemini 2.5 Flash Lite generates text descriptions from rendered icon PNGs + Lucide metadata (tags, categories)
  2. Build time: nomic-embed-text-v1.5-Q computes 768d embeddings with asymmetric search_query:/search_document: prefixes
  3. Query time: User's query is embedded locally via fastembed (ONNX, no GPU), cosine similarity against pre-computed vectors

What's included

  • search.py: Public API — search_icons(), search_available(), SearchResult
  • build_search.py: VLM + embedding pipeline with JSONL as durable intermediate
  • build_clusters.py: HDBSCAN discovers 88 semantic themes, Gemini Flash names them
  • lucide CLI: Unified subcommands — db, describe, build-search, search, cluster, version
  • Relational metadata: icon_tags (12,619), icon_categories (3,309), icon_aliases (248) in main DB
  • Pre-built data: 1,703 icon descriptions + embeddings + cluster assignments for Lucide v0.577.0
  • Inline icon rendering: Kitty graphics protocol for Ghostty/kitty/WezTerm
  • UMAP + HDBSCAN visualizations: Interactive Plotly HTML maps of the embedding space
  • 62 tests, ruff + mypy clean

Install & try

pip install 'python-lucide[search]'
lucide search "love"
lucide search "hard work and determination" -v

Optional extra keeps base package lightweight

pip install python-lucide          # zero deps, 796KB — unchanged
pip install python-lucide[search]  # adds fastembed (ONNX Runtime)

Search DB (~8MB) downloads on first use and is cached in ~/.cache/python-lucide/.

Test plan

  • 62 unit tests covering search API, build pipeline, CLI subcommands, clustering
  • Quality smoke test with 10-icon subset (positive/negative controls)
  • Full 1,703-icon index verified with diverse queries
  • All pre-commit hooks pass (ruff, mypy, pytest)
  • Test on clean install without dev dependencies
  • Verify search DB auto-download from GitHub release

🤖 Generated with Claude Code

mmacpherson and others added 4 commits March 26, 2026 21:48
Add embedding-based search so users can find Lucide icons by natural
language queries ("payment", "hard work", "owl") instead of exact name
matching.

Architecture:
- Gemini 2.5 Flash Lite generates rich text descriptions from rendered
  icon PNGs + Lucide metadata (tags, categories) at build time
- nomic-embed-text-v1.5-Q computes embeddings with asymmetric
  search_query/search_document prefixes
- Descriptions saved as JSONL (durable source of truth), embeddings
  stored in a separate SQLite search DB
- Search DB auto-downloaded on first use, cached locally
- Lucide repo auto-cloned for metadata during description generation

New modules:
- search.py: public API (search_icons, search_available, SearchResult)
- build_search.py: VLM + embedding build pipeline with JSONL intermediate
- build_clusters.py: HDBSCAN clustering + Gemini Flash theme naming
- cli.py: unified `lucide` CLI with subcommands (db, describe,
  build-search, search, cluster, version)

Main DB now includes relational metadata:
- icon_tags (12,619 rows), icon_categories (3,309), icon_aliases (248)
- All indexed for fast lookup

CLI search with inline icon rendering via Kitty graphics protocol
(Ghostty, kitty, WezTerm) with white background for visibility.

Optional extra: pip install python-lucide[search]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-built artifacts for semantic search:
- gemini-icon-descriptions.jsonl: VLM descriptions for all 1,703 icons
  (Gemini 2.5 Flash Lite, prompt hash 30eb8a53d63e)
- lucide-search.db: embeddings (nomic-embed-text-v1.5-Q, 768d) +
  descriptions + 88 HDBSCAN cluster assignments with Gemini-named themes
- lucide-icons.db: rebuilt with relational metadata tables
  (icon_tags, icon_categories, icon_aliases)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dev scripts and artifacts for exploring the icon embedding space:
- UMAP + HDBSCAN clustering discovers 88 semantic themes
- Gemini Flash names clusters from icon names alone
- Interactive Plotly HTML visualizations (category map + cluster map)
- Quality test script for validating search results
- beads issue tracker initialized with follow-up items

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test `lucide search` with a mock search DB
- Test `lucide search` with nonexistent DB returns error
- Verify cluster data is loaded into search DB by build_search_db
- 62 tests total, all passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant