Skip to content

santhsecurity/keyhog

Repository files navigation

KeyHog

The secret scanner that finds what others miss.

crates.io MIT CI 50 MB/s 887+ detectors

887 detectors · ML-scored confidence · decode-through scanning · live verification
Finds base64-encoded, hex-wrapped, and nested secrets that regex-only scanners miss entirely.


$ keyhog scan --path .

  ██   ██ ████████ ██    ██ ██   ██  ██████   ██████
  ██  ██  ██        ██  ██  ██   ██ ██    ██ ██
  █████   █████      ████   ███████ ██    ██ ██   ███
  ██  ██  ██          ██    ██   ██ ██    ██ ██    ██
  ██   ██ ████████    ██    ██   ██  ██████   ██████
   v0.2.0 · Secret Scanner · 887 detectors
  by SanthSecurity

  critical  82%  ██████░░  GitHub Classic PAT
                 ghp_...7890  src/config.py:42
  critical  78%  █████░░░  Stripe Secret Key
                 sk_l...ab12  .env:7
  critical  78%  █████░░░  GitHub PAT (decoded from base64)
                 ghp_...7890  k8s/secret.yaml:12

  3 secrets found · 2 unique credentials · 0 false positives

Why KeyHog

Most secret scanners run regex against plaintext. They miss anything encoded, embedded, or obfuscated. KeyHog doesn't.

Decode-through scanning recursively unwraps base64, hex, URL encoding, quoted-printable, and Unicode escapes before pattern matching — catching secrets buried in Kubernetes manifests, CI configs, Docker layers, and compiled artifacts that other tools never see.

ML confidence scoring uses a 3,969-parameter neural network trained on 200K real credentials to separate secrets from hashes, test fixtures, and documentation strings. Every finding comes with a 0–100% score. Zero false positives at the default 70% threshold.

Live verification hits real APIs (AWS, GitHub, Stripe, Slack, OpenAI, and more) to confirm whether a leaked credential is actually active.

Feature Comparison

KeyHog TruffleHog Gitleaks Semgrep
Detectors 887+ 800+ 150+ Rules
Recall (blind test) 98% 32% ~30% ~40%
False positives Zero Moderate Low High
Base64 decode
Hex decode
ML scoring ✓ (99.5%) Partial
Live verify
Throughput ~50 MB/s ~10–30 ~5–15 ~20
License MIT AGPL MIT LGPL

KeyHog finds 74 credentials that TruffleHog misses. TruffleHog finds 0 that KeyHog misses.

Choosing Between Alternatives

  • Use KeyHog when you need high recall on encoded secrets, embeddable Rust crates, and optional live verification.
  • Use TruffleHog when you prioritize its existing verification workflows over a lightweight Rust-native integration story.
  • Use Gitleaks when plaintext regex scanning is enough and you want a simpler rule engine.
  • Use Semgrep when your main goal is broad static analysis rather than secret-specific recall.

Quick Start

# Install
cargo install keyhog

# Scan a directory
keyhog scan --path .

# Scan with verification
keyhog scan --path . --verify

# Scan a git repo's full history
keyhog scan --git ./repo

# CI mode: only changed files, SARIF output
keyhog scan --git-diff origin/main --format sarif --fail-on-findings

Install

# Install the published CLI
cargo install keyhog

# Or build from source
git clone https://github.com/santhsecurity/keyhog.git
cd keyhog
cargo install --path crates/cli

Standalone Crates

[dependencies]
keyhog-core = "0.2"
keyhog-scanner = "0.2"
keyhog-sources = "0.2"
keyhog-verifier = "0.2"
  • keyhog-core provides detector specs, findings, reporting, and allowlists.
  • keyhog-scanner compiles detectors and scans Chunk values.
  • keyhog-sources provides filesystem, stdin, git, Docker, S3, and binary inputs.
  • keyhog-verifier verifies deduplicated findings asynchronously.
  • keyhog is the end-user binary package.

Library Quick Start

use keyhog_core::{Chunk, ChunkMetadata, DetectorSpec, PatternSpec, Severity};
use keyhog_scanner::CompiledScanner;

let scanner = CompiledScanner::compile(vec![DetectorSpec {
    id: "demo-token".into(),
    name: "Demo Token".into(),
    service: "demo".into(),
    severity: Severity::High,
    patterns: vec![PatternSpec {
        regex: "demo_[A-Z0-9]{8}".into(),
        description: None,
        group: None,
    }],
    companion: None,
    verify: None,
    keywords: vec!["demo_".into()],
}])?;

let findings = scanner.scan(&Chunk {
    data: "TOKEN=demo_ABC12345".into(),
    metadata: ChunkMetadata {
        source_type: "filesystem".into(),
        path: Some(".env".into()),
        commit: None,
        author: None,
        date: None,
    },
});

assert_eq!(findings.len(), 1);
# Ok::<(), keyhog_scanner::ScanError>(())

Docker

docker run --rm -v $(pwd):/scan ghcr.io/keyhog/keyhog:latest scan --path /scan

GitHub Actions

- uses: keyhog/keyhog-action@v1
  with:
    path: .
    min-confidence: 0.7
    format: sarif

Pre-commit

repos:
  - repo: https://github.com/santhsecurity/keyhog
    rev: v0.2.0
    hooks:
      - id: keyhog

Usage

# Scan directory
keyhog scan --path ./src

# JSON output
keyhog scan --path . --format json

# Only high-severity findings
keyhog scan --path . --severity high

# Scan last 5 commits
keyhog scan --git-diff HEAD~5

# Staged files only (for pre-commit)
keyhog scan --git-diff --staged

# Custom confidence threshold
keyhog scan --path . --min-confidence 0.8

# Fail CI on any finding
keyhog scan --path . --fail-on-findings

Output Formats

Format Flag Use for
Text --format text Human reading (default)
JSON --format json Programmatic use
JSONL --format jsonl Streaming / log ingestion
SARIF --format sarif GitHub code scanning

Architecture

KeyHog uses a two-phase architecture built on Aho-Corasick automata:

Input          Phase 1: Prefilter           Phase 2: Confirm          Score & Verify
─────          ──────────────────           ────────────────          ──────────────

              ┌───────────────────┐     ┌──────────────────┐     ┌────────────────┐
 file         │  Decode-Through   │     │  Regex Confirm   │     │  ML Classifier │
 stdin  ────▶ │  Aho-Corasick     │────▶│  Match regions   │────▶│  3,969 params  │
 git          │  O(n) single-pass │     │  per candidate   │     │  99.5% acc     │
              └───────────────────┘     └──────────────────┘     └───────┬────────┘
                                                                         │
                                                                         ▼
                                                                 ┌────────────────┐
                                                                 │  Live Verify   │
                                                                 │  (optional)    │
                                                                 │  async tokio   │
                                                                 └────────────────┘

Decode-Through Scanning

Before pattern matching, KeyHog recursively decodes:

  • Base64 (standard + URL-safe)
  • Hexadecimal
  • URL encoding
  • Quoted-printable
  • Unicode escapes
# KeyHog catches this. Other scanners don't.
encoded = "Z2hwX3h4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4"  # base64(ghp_...)

Structural Context

Same credential, different context, different confidence:

# 82% — production config
production_config = "ghp_xxxxxxxxxxxxxxxxxxxx"

# 25% — test fixture (auto-detected via AST context)
def test_auth():
    token = "ghp_xxxxxxxxxxxxxxxxxxxx"

Adding Detectors

Detectors are TOML — no code changes needed:

# detectors/my-service.toml
[detector]
id = "my-service-api-key"
name = "My Service API Key"
severity = "critical"
keywords = ["ms_live_", "ms_test_"]

[[detector.patterns]]
regex = 'ms_(live|test)_[a-zA-Z0-9]{32}'

[detector.verify]
method = "GET"
url = "https://api.myservice.com/v1/status"
[detector.verify.auth]
type = "bearer"
field = "match"

Configuration

.keyhog.toml

detectors = "detectors"       # Path to detector TOML files
severity = "medium"            # Minimum: info | low | medium | high | critical
format = "text"                # Output: text | json | jsonl | sarif
min_confidence = 0.7           # ML confidence threshold (0.0–1.0)
threads = 8                    # Parallel scan threads
dedup = "credential"           # Dedup: credential | file | none
deep = true                    # Enable decode-through + entropy + multiline
timeout = 10                   # Verification timeout (seconds)
show_secrets = false            # Redact credentials in output

.keyhogignore

# Paths
path:tests/**
path:**/*.md

# Detectors
detector:entropy
detector:generic-api-key

# Specific findings by hash
hash:abc123def456

Inline suppression

# keyhog:ignore
GITHUB_TOKEN = "ghp_xxxxxxxxxxxxxxxxxxxx"

# keyhog:ignore detector=github-token
api_key = "ghp_yyyyyyyyyyyyyyyyyyyy"

# keyhog:ignore reason="public CI token"
TOKEN = "ghp_zzzzzzzzzzzzzzzzzzzz"

Modular Builds

# Full build (default)
cargo build --release

# Fast mode: regex-only, no ML/decode/multiline — for pre-commit hooks
cargo build --release --no-default-features --features fast

# With live verification
cargo build --release --features verify

Performance

All benchmarks: AMD Ryzen 9 5900X, 32 GB RAM, NVMe SSD.

Throughput

Detectors 1 MB 10 MB 100 MB
100 55 MB/s 58 MB/s 62 MB/s
500 48 MB/s 52 MB/s 56 MB/s
887 42 MB/s 46 MB/s 50 MB/s

Real-World Repos

Repository Size KeyHog TruffleHog Gitleaks
facebook/react 350 MB 8s 25s 45s
denoland/deno 900 MB 18s 55s 95s
rust-lang/rust 2.1 GB 42s 120s 200s

Verification Latency

Service Status Latency
AWS ~200ms
GitHub ~150ms
Slack ~180ms
Stripe ~220ms
OpenAI ~250ms

License

MIT — see LICENSE.


KeyHog by Santh
Built with Rust · Zero dependencies in core · keyhog.santh.io

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages