A Rust library for extracting the main readable content from HTML pages and converting it to plain text or markdown.
Given a raw HTML string, readability-rs strips away noise (navigation, ads, scripts, footers, hidden elements) and returns just the article content. It also handles Next.js documentation sites by extracting markdown directly from embedded RSC payloads.
Add to your Cargo.toml:
[dependencies]
readability-rs = { git = "https://github.com/gayacodes/readability-rs" }use readability_rs::{Readability, OutputFormat};
let html = r#"
<html>
<body>
<nav><a href="/">Home</a></nav>
<article>
<h1>My Article</h1>
<p>This is the <strong>main content</strong> of the page.</p>
</article>
<footer>Copyright 2025</footer>
</body>
</html>
"#;
let reader = Readability::new(html);
// Extract as markdown
let markdown = reader.extract(OutputFormat::Markdown).unwrap();
// => "# My Article\n\nThis is the **main content** of the page."
// Extract as plain text
let text = reader.extract(OutputFormat::PlainText).unwrap();
// => "My Article\n\nThis is the main content of the page."The extraction pipeline has two strategies:
Runs first on the raw HTML string. Next.js App Router sites embed page content inside self.__next_f.push([1,"..."]) script tags as React Server Component payloads. This content includes everything behind interactive components (accordions, tabs, collapsible sections) that never renders to the visible DOM.
The RSC pipeline: find chunks -> unescape -> select best chunk -> strip frontmatter -> convert MDX components to markdown -> remove SVGs -> resolve URLs -> normalize.
If no RSC payload is found, the DOM pipeline runs:
raw HTML
|
v
Parse (scraper crate)
|
v
Clean - remove noise elements:
- Tags: script, style, nav, footer, aside, form, iframe, svg, etc.
- Hidden: display:none, visibility:hidden, aria-hidden, hidden attr
- Ads: class/id matching (ad-banner, sponsored, cookie-consent, etc.)
- Site headers: <header> with <nav> (article headers are kept)
|
v
Locate - find content root via selector cascade:
main > article > [role="main"] > #content > body (fallback)
|
v
Render - convert to plain text or markdown:
- Headings, bold, italic, links, images
- Strikethrough, subscript, superscript, highlight
- Ordered/unordered lists, tables
- Blockquotes, code blocks (with language detection), inline code
- Details/summary, figure/figcaption
- Horizontal rules, paragraph separation
let reader = Readability::new(html_str);
let content: Option<String> = reader.extract(OutputFormat::Markdown);| Variant | Description |
|---|---|
PlainText |
Strip all markup, preserve paragraph breaks |
Markdown |
Convert HTML to markdown syntax |
Each pipeline stage is also exported for standalone use:
use readability_rs::{clean, locate_content, render, OutputFormat};
use scraper::Html;
let doc = Html::parse_document(html_str);
let cleaned = clean(&doc);
let root = locate_content(&cleaned).unwrap();
let markdown = render(&root, OutputFormat::Markdown);use readability_rs::rsc::extract_rsc_content;
// Returns Some(markdown) if RSC payload found, None otherwise
let result = extract_rsc_content(html_str, Some("https://docs.example.com"));| HTML | Markdown |
|---|---|
<h1>-<h6> |
#-###### |
<strong>, <b> |
**bold** |
<em>, <i> |
*italic* |
<s>, <del>, <strike> |
~~strikethrough~~ |
<sub> |
~subscript~ |
<sup> |
^superscript^ |
<mark> |
==highlight== |
<a href="url"> |
[text](url) |
<img src="" alt=""> |
 |
<ul> + <li> |
- item |
<ol> + <li> |
1. item |
<blockquote> |
> text |
<code> |
`code` |
<pre> |
fenced code block |
<pre><code class="language-X"> |
```X fenced block |
<table> |
pipe table with | --- | separator |
<details> + <summary> |
bold title + body content |
<figure> + <figcaption> |
image + italic caption |
<hr> |
--- |
Each example below shows the HTML input and the markdown output the renderer produces.
<table>
<thead>
<tr><th>Name</th><th>Age</th></tr>
</thead>
<tbody>
<tr><td>Alice</td><td>30</td></tr>
<tr><td>Bob</td><td>25</td></tr>
</tbody>
</table>Output:
| Name | Age |
| --- | --- |
| Alice | 30 |
| Bob | 25 |
<p><del>old price</del> <mark>new price</mark></p>
<p>H<sub>2</sub>O and x<sup>2</sup></p>Output:
~~old price~~ ==new price==
H~2~O and x^2^
<details>
<summary>Click to expand</summary>
<p>This content was hidden behind a toggle.</p>
</details>Output:
**Click to expand**
This content was hidden behind a toggle.
<figure>
<img src="photo.jpg" alt="Sunset over the ocean">
<figcaption>A beautiful sunset at the beach</figcaption>
</figure>Output:

*A beautiful sunset at the beach*
<pre><code class="language-rust">fn main() {
println!("Hello, world!");
}</code></pre>Output:
```rust
fn main() {
println!("Hello, world!");
}
```
Without a language-* class, the fence opens with plain ```.
<p>Section one</p>
<hr>
<p>Section two</p>Output:
Section one
---
Section two
src/
lib.rs - Public API: Readability struct, extract() pipeline
cleaner.rs - Strip noise elements (pre-compiled selectors via OnceLock)
locator.rs - Find content root via CSS selector cascade
renderer.rs - Recursive HTML-to-markdown/plaintext conversion
rsc.rs - RSC/MDX extraction for Next.js documentation sites
tests/
integration.rs - End-to-end pipeline tests
cargo test130+ tests covering all modules: cleaner (noise removal, hidden elements, ads, site headers), locator (selector cascade, fallbacks), renderer (headings, lists, tables, inline formatting, code language detection, details/summary, figure/figcaption), RSC (chunk finding, unescaping, MDX stripping, URL resolution), and integration (full pipeline).
MIT