Skip to content

valyuAI/rust-contentextractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

readability-rs

A Rust library for extracting the main readable content from HTML pages and converting it to plain text or markdown.

What it does

Given a raw HTML string, readability-rs strips away noise (navigation, ads, scripts, footers, hidden elements) and returns just the article content. It also handles Next.js documentation sites by extracting markdown directly from embedded RSC payloads.

Installation

Add to your Cargo.toml:

[dependencies]
readability-rs = { git = "https://github.com/gayacodes/readability-rs" }

Quick start

use readability_rs::{Readability, OutputFormat};

let html = r#"
<html>
<body>
  <nav><a href="/">Home</a></nav>
  <article>
    <h1>My Article</h1>
    <p>This is the <strong>main content</strong> of the page.</p>
  </article>
  <footer>Copyright 2025</footer>
</body>
</html>
"#;

let reader = Readability::new(html);

// Extract as markdown
let markdown = reader.extract(OutputFormat::Markdown).unwrap();
// => "# My Article\n\nThis is the **main content** of the page."

// Extract as plain text
let text = reader.extract(OutputFormat::PlainText).unwrap();
// => "My Article\n\nThis is the main content of the page."

How it works

The extraction pipeline has two strategies:

Strategy 0: RSC extraction (Next.js sites)

Runs first on the raw HTML string. Next.js App Router sites embed page content inside self.__next_f.push([1,"..."]) script tags as React Server Component payloads. This content includes everything behind interactive components (accordions, tabs, collapsible sections) that never renders to the visible DOM.

The RSC pipeline: find chunks -> unescape -> select best chunk -> strip frontmatter -> convert MDX components to markdown -> remove SVGs -> resolve URLs -> normalize.

Strategy 1: DOM-based extraction

If no RSC payload is found, the DOM pipeline runs:

raw HTML
  |
  v
Parse (scraper crate)
  |
  v
Clean - remove noise elements:
  - Tags: script, style, nav, footer, aside, form, iframe, svg, etc.
  - Hidden: display:none, visibility:hidden, aria-hidden, hidden attr
  - Ads: class/id matching (ad-banner, sponsored, cookie-consent, etc.)
  - Site headers: <header> with <nav> (article headers are kept)
  |
  v
Locate - find content root via selector cascade:
  main > article > [role="main"] > #content > body (fallback)
  |
  v
Render - convert to plain text or markdown:
  - Headings, bold, italic, links, images
  - Strikethrough, subscript, superscript, highlight
  - Ordered/unordered lists, tables
  - Blockquotes, code blocks (with language detection), inline code
  - Details/summary, figure/figcaption
  - Horizontal rules, paragraph separation

API

Readability

let reader = Readability::new(html_str);
let content: Option<String> = reader.extract(OutputFormat::Markdown);

OutputFormat

Variant Description
PlainText Strip all markup, preserve paragraph breaks
Markdown Convert HTML to markdown syntax

Individual functions

Each pipeline stage is also exported for standalone use:

use readability_rs::{clean, locate_content, render, OutputFormat};
use scraper::Html;

let doc = Html::parse_document(html_str);
let cleaned = clean(&doc);
let root = locate_content(&cleaned).unwrap();
let markdown = render(&root, OutputFormat::Markdown);

RSC extraction

use readability_rs::rsc::extract_rsc_content;

// Returns Some(markdown) if RSC payload found, None otherwise
let result = extract_rsc_content(html_str, Some("https://docs.example.com"));

Markdown conversion reference

HTML Markdown
<h1>-<h6> #-######
<strong>, <b> **bold**
<em>, <i> *italic*
<s>, <del>, <strike> ~~strikethrough~~
<sub> ~subscript~
<sup> ^superscript^
<mark> ==highlight==
<a href="url"> [text](url)
<img src="" alt=""> ![alt](src)
<ul> + <li> - item
<ol> + <li> 1. item
<blockquote> > text
<code> `code`
<pre> fenced code block
<pre><code class="language-X"> ```X fenced block
<table> pipe table with | --- | separator
<details> + <summary> bold title + body content
<figure> + <figcaption> image + italic caption
<hr> ---

Feature examples

Each example below shows the HTML input and the markdown output the renderer produces.

Tables

<table>
  <thead>
    <tr><th>Name</th><th>Age</th></tr>
  </thead>
  <tbody>
    <tr><td>Alice</td><td>30</td></tr>
    <tr><td>Bob</td><td>25</td></tr>
  </tbody>
</table>

Output:

| Name | Age |
| --- | --- |
| Alice | 30 |
| Bob | 25 |

Strikethrough, subscript, superscript, highlight

<p><del>old price</del> <mark>new price</mark></p>
<p>H<sub>2</sub>O and x<sup>2</sup></p>

Output:

~~old price~~ ==new price==

H~2~O and x^2^

Details / Summary

<details>
  <summary>Click to expand</summary>
  <p>This content was hidden behind a toggle.</p>
</details>

Output:

**Click to expand**

This content was hidden behind a toggle.

Figure with caption

<figure>
  <img src="photo.jpg" alt="Sunset over the ocean">
  <figcaption>A beautiful sunset at the beach</figcaption>
</figure>

Output:

![Sunset over the ocean](photo.jpg)
*A beautiful sunset at the beach*

Code blocks with language detection

<pre><code class="language-rust">fn main() {
    println!("Hello, world!");
}</code></pre>

Output:

```rust
fn main() {
    println!("Hello, world!");
}
```

Without a language-* class, the fence opens with plain ```.

Horizontal rule

<p>Section one</p>
<hr>
<p>Section two</p>

Output:

Section one

---

Section two

Project structure

src/
  lib.rs       - Public API: Readability struct, extract() pipeline
  cleaner.rs   - Strip noise elements (pre-compiled selectors via OnceLock)
  locator.rs   - Find content root via CSS selector cascade
  renderer.rs  - Recursive HTML-to-markdown/plaintext conversion
  rsc.rs       - RSC/MDX extraction for Next.js documentation sites
tests/
  integration.rs - End-to-end pipeline tests

Tests

cargo test

130+ tests covering all modules: cleaner (noise removal, hidden elements, ads, site headers), locator (selector cascade, fallbacks), renderer (headings, lists, tables, inline formatting, code language detection, details/summary, figure/figcaption), RSC (chunk finding, unescaping, MDX stripping, URL resolution), and integration (full pipeline).

License

MIT

About

Rust library for extracting main readable content from HTML pages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages