Skip to content

ruminaider/clew

Repository files navigation

clew

Semantic code discovery for AI agents -- find code by meaning, not just name.

Ask natural language questions about your codebase -- clewdex indexes your code with AST-aware chunking, embeds it with Voyage AI, stores it in Qdrant, and serves results through both a CLI and an MCP server that Claude Code can call directly.

Quick start

1. Install clewdex

pip install clewdex

Or with pipx, Homebrew, or npx:

pipx install clewdex                          # isolated install
brew install ruminaider/tap/clewdex           # macOS
npx clewdex                                   # run without installing

2. Start Qdrant

docker run -d -p 6333:6333 qdrant/qdrant:v1.16.1

3. Set your API key

export VOYAGE_API_KEY=pa-xxxxxxxxxxxxxxxxxxxx

Get a key at dash.voyageai.com.

4. Index and search

clew index /path/to/your/project --full
clew search "how do we handle authentication"

When to use clew vs grep

Clew and grep are complementary tools that handle different types of code discovery:

Use clew when you:

  • Need to find code by concept, not by identifier name ("where is error handling for the pharmacy API")
  • Are exploring an unfamiliar codebase and don't know what to search for
  • Want to trace structural relationships (call chains, inheritance, imports)
  • Need vocabulary bridging -- business language to code identifiers
  • Want to understand how a feature is implemented across multiple files

Use grep when you:

  • Know the exact pattern you're looking for (raise ValidationError, @celery_app.task)
  • Need exhaustive enumeration -- every instance of a pattern, guaranteed complete
  • Are matching literal text in comments, strings, or config files
  • Need structural completeness (grep finds things in places BM25 cannot reach)

Use both (via agent skills) when you:

  • Need to discover a concept AND find all its instances
  • Are debugging and need both semantic context and exact pattern locations
  • Want to verify that clew found everything relevant

Features

  • Hybrid search -- Dense embeddings (Voyage voyage-code-3) + BM25 keyword matching fused with Reciprocal Rank Fusion, optionally re-ranked with Voyage rerank-2.5
  • Multi-vector architecture -- Three named vectors (signature, semantic, body) with intent-adaptive routing for precise retrieval
  • AST-aware chunking -- tree-sitter parses Python, TypeScript, and JavaScript into semantic units (functions, classes, components) with token-aware fallback splitting
  • Code relationship tracing -- Extracts imports, calls, inheritance, decorators, JSX renders, test mappings, and API boundaries; traversable via BFS graph queries
  • Incremental indexing -- Git-aware change detection (with file-hash fallback) so re-indexing only touches what changed
  • NL descriptions -- LLM-generated descriptions for undocumented code, prepended before embedding to improve search quality
  • Compact MCP responses -- ~20x token reduction by default; returns signatures + docstring previews instead of full source
  • Multi-collection -- Separate code and docs collections with intent-driven routing
  • Confidence self-assessment -- Z-score based confidence scoring included in results as informational metadata
  • Explicit exhaustive mode -- --mode exhaustive runs grep alongside semantic search for completeness when needed

Competitive comparison

Capability clew grepai CodeSight CodeGrok Cursor
Multi-vector search (3 vectors) Yes No No No No
BM25 hybrid + RRF fusion Yes No Yes No No
Reranking (calibrated scores) Yes No No No ?
Intent-adaptive routing Yes No No No No
Relationship graph + trace 7 types 2 types No No No
NL descriptions for code Yes No No No No
Confidence self-assessment Yes No No No No
MCP server 5 tools 3 tools Yes 4 tools N/A
Agent skills / cookbooks Yes No No No N/A
Compact responses (token-aware) 20x Yes ? Yes N/A
Fully offline Yes* Yes Yes Yes N/A
Open source Yes Yes Yes Yes No

* Requires Voyage AI API for embeddings and reranking. Qdrant runs locally.

CLI usage

clew index

Index a codebase for search.

# Incremental -- only re-index changed files
clew index /path/to/project

# Full reindex
clew index /path/to/project --full

# Generate NL descriptions for undocumented code (requires ANTHROPIC_API_KEY)
clew index /path/to/project --nl-descriptions

# Index specific files
clew index --files src/auth.py --files src/models.py

clew search

Search the indexed codebase.

# Natural language query
clew search "where is the rate limiter configured"

# Explicit exhaustive mode -- runs grep alongside semantic search
clew search "all error handlers" --mode exhaustive

# Filter by language
clew search "database models" --language python

# Filter by chunk type
clew search "API endpoints" --chunk-type function

# Set intent explicitly (code, docs, debug, location)
clew search "why does login fail" --intent debug

# JSON output
clew search "user authentication" --raw

clew trace

Trace code relationships via BFS graph traversal.

# Show all relationships for an entity
clew trace "src/auth/models.py::User"

# Only inbound (what depends on this)
clew trace "src/auth/models.py::User" --direction inbound

# Limit depth and filter types
clew trace "src/api/views.py::handle_request" --depth 3 --type calls --type imports

# JSON output
clew trace "src/auth/models.py::User" --raw

Relationship types: imports, calls, inherits, decorates, renders, tests, calls_api

clew status

Show system health and index statistics.

clew status

clew serve

Start the MCP server (stdio transport) for Claude Code integration.

clew serve

MCP integration

Add clewdex to Claude Code's .mcp.json:

{
  "mcpServers": {
    "clew": {
      "command": "clew",
      "args": ["serve"],
      "env": {
        "VOYAGE_API_KEY": "pa-xxxxxxxxxxxxxxxxxxxx",
        "QDRANT_URL": "http://localhost:6333"
      }
    }
  }
}

MCP tools

search

Semantic search over the indexed codebase.

search(query, limit=5, collection="code", active_file=None,
       intent=None, filters=None, detail="compact", mode=None)
  • detail="compact" (default) -- returns signature + docstring snippet
  • detail="full" -- returns complete source content
  • mode="exhaustive" -- runs grep alongside semantic search for completeness
  • filters -- metadata filters: language, chunk_type, app_name, layer, is_test

get_context

Read file content with optional related code chunks.

get_context(file_path, line_start=None, line_end=None, include_related=False)

explain

Search for context about a symbol or question in a file.

explain(file_path, symbol=None, question=None, detail="compact")

trace

Traverse code relationships (imports, calls, inheritance, etc.).

trace(entity, direction="both", max_depth=2, relationship_types=None)

index_status

Check health or trigger re-indexing.

index_status(action="status", project_root=None)

Configuration

Environment variables

Variable Required Default Description
VOYAGE_API_KEY Yes -- Voyage AI API key for embeddings and re-ranking
QDRANT_URL No http://localhost:6333 Qdrant server endpoint
QDRANT_API_KEY No -- Qdrant API key (if auth is enabled)
CLEW_CACHE_DIR No Auto-detected from git root SQLite cache directory (.clew/)
CLEW_LOG_LEVEL No INFO Logging verbosity
ANTHROPIC_API_KEY No -- Required for NL description generation

The cache directory resolves in order: CLEW_CACHE_DIR env var, then {git_root}/.clew/, then .clew/ relative to the working directory. This ensures the MCP server and CLI share the same cache.

Project configuration (optional)

Create a config.yaml in your project root for fine-grained control:

project:
  name: "my-project"
  root: "."

collections:
  code:
    include:
      - "src/**/*.py"
      - "frontend/**/*.tsx"
    exclude:
      - "**/migrations/*.py"
      - "**/__pycache__/**"
  docs:
    include:
      - "**/*.md"

chunking:
  default_max_tokens: 3000
  overlap_tokens: 200

terminology_file: indexer/terminology.yaml

Architecture

                    +==============+
                    | Claude Code  |
                    | (MCP client) |
                    +------+-------+
                           | stdio
                    +------v-------+
                    |  MCP Server  |  search, get_context, explain, trace, index_status
                    +------+-------+
                           |
              +------------v------------+
              |     Search Pipeline     |
              |  enhance -> classify -> |
              |  hybrid search -> rerank|
              +------------+------------+
                           |
              +------------v------------+
              |    Qdrant Collections   |
              |  code: py/ts/tsx/js/jsx |
              |  docs: markdown         |
              +------------+------------+
                           |
              +------------v------------+
              |   Indexing Pipeline     |
              |  discover -> chunk ->   |
              |  enrich -> embed ->     |
              |  upsert + relationships |
              +------------+------------+
                           |
        +----------+-------+-------+----------+
        v          v       v       v          v
   tree-sitter  Voyage   SQLite   git     Anthropic
   (AST parse)  (embed)  (cache)  (diff)  (NL desc)

Search pipeline

  1. Query enhancement -- Terminology expansion via YAML (abbreviations, synonyms)
  2. Intent classification -- Heuristic routing: CODE, DOCS, DEBUG, LOCATION
  3. Hybrid search -- Dense + BM25 multi-prefetch with structural boosting (same-module, test files for debug intent)
  4. Re-ranking -- Voyage rerank-2.5 for final ordering
  5. Confidence assessment -- Z-score based, informational only (included in result metadata)

When --mode exhaustive is specified, grep runs in parallel with the semantic pipeline and results are merged and deduplicated before returning.

Chunking strategy

File pattern Strategy Token range
models.py Class + fields as unit 1,500 - 3,000
views.py Class as unit; split large actions 2,000 - 4,000
tasks.py Function with decorators 1,000 - 2,000
*.tsx, *.jsx Component boundaries 1,500 - 3,000
*.md Section-level by headers 1,000 - 2,000
Migrations Skipped --

Fallback chain: tree-sitter AST -> token-recursive splitting -> line-based splitting.

Development

Setup

git clone https://github.com/ruminaider/clew.git
cd clew
pip install -e ".[dev]"

Tests

# All tests with coverage
pytest --cov=clew -v

# Integration tests (requires running Qdrant)
pytest -m integration

# Single test file
pytest tests/search/test_hybrid.py -v

Linting and type checking

ruff format .           # Format
ruff check .            # Lint
mypy clew/              # Type check (strict mode)

Project structure

clew/
+-- chunker/             # AST parsing, language strategies, token counting
+-- clients/             # External service wrappers (Voyage, Qdrant, Anthropic)
+-- indexer/             # Pipeline, caching, change detection, relationship extraction
|   +-- extractors/      # Pluggable per-language relationship extractors
+-- search/              # Engine, hybrid retrieval, intent classification, re-ranking
+-- cli.py               # Typer CLI
+-- mcp_server.py        # FastMCP server (5 tools)
+-- config.py            # Environment variable loading
+-- factory.py           # Component wiring (no global state)
+-- models.py            # Pydantic v2 config models
+-- exceptions.py        # Error hierarchy with fix hints
+-- discovery.py         # File discovery with ignore patterns and safety checks
+-- safety.py            # File size, chunk count, collection limits

Troubleshooting

Problem Fix
Qdrant not running docker compose up -d qdrant or docker run -d -p 6333:6333 qdrant/qdrant:v1.16.1
VOYAGE_API_KEY not set export VOYAGE_API_KEY=pa-...
No search results Run clew index --full to reindex
MCP server can't find cache Set CLEW_CACHE_DIR to an absolute path, or run from within the git repo
Stale results after code changes Run clew index (incremental) to pick up changes

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages