Skip to content
/ Huan Public

A simple html convertor that transform pages as Markdown files.

License

Notifications You must be signed in to change notification settings

cycleuser/Huan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

huan

A command-line tool that converts web pages to Markdown files, preserving the site's URL structure as a local folder hierarchy.

The name "huan" (换) means "convert" in Chinese.

Features

  • Web Page Conversion - Traverses a website and converts every page to clean Markdown
  • Readability Extraction - Uses Mozilla's Readability algorithm for high-quality content extraction (via readability-lxml)
  • Rich Metadata - Extracts title, author, date, Open Graph, Schema.org etc. as YAML front matter
  • Multiple HTTP Backends - Choose from requests, curl_cffi, DrissionPage (system browser), or Playwright
  • Infinite Scroll Support - Automatic scrolling for lazy-loaded content
  • Math Formula Conversion - MathML, MathJax, and KaTeX converted to LaTeX notation
  • Image Downloading - Downloads images locally with relative path rewriting in Markdown
  • Code Block Language Detection - Preserves language hints from HTML for proper fenced code blocks
  • Table Preprocessing - Handles complex tables with colspan/rowspan for cleaner Markdown output
  • Token Estimation - Reports word count and estimated token count for LLM usage planning
  • Incremental Mode - Skip existing files for efficient re-runs
  • Proxy Support - Manual proxy or system environment variables
  • Save Raw HTML - Optionally save original HTML alongside Markdown

Installation

Install from PyPI:

pip install huan

To install the latest development version from source:

git clone https://github.com/cycleuser/Huan.git
cd Huan
pip install -e .

Optional Dependencies

For better content extraction quality (recommended):

pip install -e ".[readability]"

For better browser compatibility:

pip install -e ".[curl]"

For JavaScript-heavy sites:

pip install -e ".[browser]"   # Uses system Chrome/Edge via DrissionPage

Or install all optional dependencies:

pip install -e ".[all]"

Usage

Basic Usage

# Convert an entire site
huan https://geopytool.com

# Limit to first 100 pages
huan https://geopytool.com -m 100

# Specify output directory
huan https://geopytool.com -o ./my-archive

Content Extraction

# Default: readability extraction (best quality, requires readability-lxml)
huan https://geopytool.com

# Heuristic extraction (tag-based, no extra dependency needed)
huan https://geopytool.com --extractor heuristic

# Full page content (no extraction filtering)
huan https://geopytool.com --extractor full

Metadata

Each Markdown file includes YAML front matter with extracted metadata:

---
title: "Article Title"
author: "Author Name"
published: 2024-01-15
url: "https://geopytool.com/article"
language: en
word_count: 2847
estimated_tokens: 3701
---

To disable metadata extraction:

huan https://geopytool.com --no-metadata

With Proxy

# Manual proxy
huan https://geopytool.com --proxy http://127.0.0.1:7890

# System proxy (from HTTP_PROXY/HTTPS_PROXY env vars)
huan https://geopytool.com --system-proxy

Different Fetcher Backends

Some sites use JavaScript to render content, which the default requests backend cannot handle. If the tool returns 0 links or incomplete content, try switching to a different backend:

# Default: standard requests (fast, works for static sites)
huan https://geopytool.com

# curl_cffi backend (better compatibility with more sites)
huan https://geopytool.com --fetcher curl

# System browser (recommended for JS-rendered sites)
huan https://geopytool.com --fetcher browser

# Playwright (requires: playwright install chromium)
huan https://geopytool.com --fetcher playwright

Tip: If the default requests backend finds 0 links on a page, the tool will print a warning suggesting you try --fetcher curl or --fetcher browser.

For Sites with Infinite Scroll

# Newsletter sites: use /archive endpoint + browser fetcher
huan https://geopytool.com/archive --fetcher browser --scroll 50

# Blog platforms with infinite scroll
huan https://geopytool.com/ --fetcher browser --scroll 30

Additional Options

# Save raw HTML alongside Markdown
huan https://geopytool.com --save-html

# Disable image downloading
huan https://geopytool.com --no-download-images

# Overwrite existing files (disable incremental mode)
huan https://geopytool.com --overwrite

# Only convert pages under /docs
huan https://geopytool.com --prefix /docs

# Verbose output for debugging
huan https://geopytool.com -v

Command-Line Options

Option Description
url Starting URL (required)
-o, --output Output directory (default: domain name)
-d, --delay Seconds between requests (default: 0.5)
-m, --max-pages Limit number of pages (default: no limit)
--prefix Only convert URLs with this path prefix
--extractor Content extraction: readability, heuristic, full
--full Alias for --extractor full
--no-metadata Disable YAML front matter metadata
--no-verify-ssl Disable SSL certificate verification
--proxy HTTP/HTTPS proxy URL
--system-proxy Use system proxy from environment
--fetcher Backend: requests, curl, browser, playwright
--scroll Scroll count for lazy-loaded content (default: 20)
--overwrite Overwrite existing files
-v, --verbose Verbose output
--no-download-images Skip image downloading
--save-html Save raw HTML alongside Markdown
--version Show version

Output Structure

example.com/
├── index.md
├── about.md
├── blog/
│   ├── index.md
│   ├── post-1.md
│   └── post-2.md
├── images/
│   ├── logo.png
│   └── hero.jpg
└── _external/
    └── cdn.example.com/
        └── assets/
            └── image.webp
  • Markdown files mirror the site's URL structure
  • Same-domain images are saved preserving their path
  • External CDN images go under _external/{domain}/
  • All image references in Markdown use relative paths

Python API

from huan import SiteCrawler

converter = SiteCrawler(
    start_url="https://geopytool.com",
    output_dir="./archive",
    max_pages=50,
    fetcher_type="browser",
    download_images=True,
    extractor="readability",
)
converter.crawl()

Requirements

  • Python 3.10+
  • requests
  • beautifulsoup4
  • html2text

Optional:

  • readability-lxml (better content extraction)
  • curl-cffi (better compatibility)
  • DrissionPage (for system browser)
  • playwright (for headless Chromium)

Screenshots

Running Screenshot

License

MIT License - see LICENSE for details.

About

A simple html convertor that transform pages as Markdown files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages