readability-rs

A Rust library for extracting the main readable content from HTML pages and converting it to plain text or markdown.

What it does

Given a raw HTML string, readability-rs strips away noise (navigation, ads, scripts, footers, hidden elements) and returns just the article content. It also handles Next.js documentation sites by extracting markdown directly from embedded RSC payloads.

Installation

Add to your Cargo.toml:

[dependencies]
readability-rs = { git = "https://github.com/gayacodes/readability-rs" }

Quick start

use readability_rs::{Readability, OutputFormat};

let html = r#"
<html>
<body>
  <nav><a href="/">Home</a></nav>
  <article>
    <h1>My Article</h1>
    <p>This is the <strong>main content</strong> of the page.</p>
  </article>
  <footer>Copyright 2025</footer>
</body>
</html>
"#;

let reader = Readability::new(html);

// Extract as markdown
let markdown = reader.extract(OutputFormat::Markdown).unwrap();
// => "# My Article\n\nThis is the **main content** of the page."

// Extract as plain text
let text = reader.extract(OutputFormat::PlainText).unwrap();
// => "My Article\n\nThis is the main content of the page."

How it works

The extraction pipeline has two strategies:

Strategy 0: RSC extraction (Next.js sites)

Runs first on the raw HTML string. Next.js App Router sites embed page content inside self.__next_f.push([1,"..."]) script tags as React Server Component payloads. This content includes everything behind interactive components (accordions, tabs, collapsible sections) that never renders to the visible DOM.

The RSC pipeline: find chunks -> unescape -> select best chunk -> strip frontmatter -> convert MDX components to markdown -> remove SVGs -> resolve URLs -> normalize.

Strategy 1: DOM-based extraction

If no RSC payload is found, the DOM pipeline runs:

raw HTML
  |
  v
Parse (scraper crate)
  |
  v
Clean - remove noise elements:
  - Tags: script, style, nav, footer, aside, form, iframe, svg, etc.
  - Hidden: display:none, visibility:hidden, aria-hidden, hidden attr
  - Ads: class/id matching (ad-banner, sponsored, cookie-consent, etc.)
  - Site headers: <header> with <nav> (article headers are kept)
  |
  v
Locate - find content root via selector cascade:
  main > article > [role="main"] > #content > body (fallback)
  |
  v
Render - convert to plain text or markdown:
  - Headings, bold, italic, links, images
  - Strikethrough, subscript, superscript, highlight
  - Ordered/unordered lists, tables
  - Blockquotes, code blocks (with language detection), inline code
  - Details/summary, figure/figcaption
  - Horizontal rules, paragraph separation

API

`Readability`

let reader = Readability::new(html_str);
let content: Option<String> = reader.extract(OutputFormat::Markdown);

`OutputFormat`

Variant	Description
`PlainText`	Strip all markup, preserve paragraph breaks
`Markdown`	Convert HTML to markdown syntax

Individual functions

Each pipeline stage is also exported for standalone use:

use readability_rs::{clean, locate_content, render, OutputFormat};
use scraper::Html;

let doc = Html::parse_document(html_str);
let cleaned = clean(&doc);
let root = locate_content(&cleaned).unwrap();
let markdown = render(&root, OutputFormat::Markdown);

RSC extraction

use readability_rs::rsc::extract_rsc_content;

// Returns Some(markdown) if RSC payload found, None otherwise
let result = extract_rsc_content(html_str, Some("https://docs.example.com"));

Markdown conversion reference

HTML	Markdown
`<h1>`-`<h6>`	`#`-`######`
`<strong>`, `<b>`	`bold`
`<em>`, `<i>`	`italic`
`<s>`, `<del>`, `<strike>`	`~~strikethrough~~`
`<sub>`	`~subscript~`
`<sup>`	`^superscript^`
`<mark>`	`==highlight==`
`<a href="url">`	`[text](url)`
`<img src="" alt="">`	`![alt](src)`
`<ul>` + `<li>`	`- item`
`<ol>` + `<li>`	`1. item`
`<blockquote>`	`> text`
`<code>`	`code`
`<pre>`	fenced code block
`<pre><code class="language-X">`	```X fenced block
`<table>`	pipe table with `\| --- \|` separator
`<details>` + `<summary>`	bold title + body content
`<figure>` + `<figcaption>`	image + italic caption
`<hr>`	`---`

Feature examples

Each example below shows the HTML input and the markdown output the renderer produces.

Tables

<table>
  <thead>
    <tr><th>Name</th><th>Age</th></tr>
  </thead>
  <tbody>
    <tr><td>Alice</td><td>30</td></tr>
    <tr><td>Bob</td><td>25</td></tr>
  </tbody>
</table>

Output:

| Name | Age |
| --- | --- |
| Alice | 30 |
| Bob | 25 |

Strikethrough, subscript, superscript, highlight

<p><del>old price</del> <mark>new price</mark></p>
<p>H<sub>2</sub>O and x<sup>2</sup></p>

Output:

~~old price~~ ==new price==

H~2~O and x^2^

Details / Summary

<details>
  <summary>Click to expand</summary>
  <p>This content was hidden behind a toggle.</p>
</details>

Output:

**Click to expand**

This content was hidden behind a toggle.

Figure with caption

<figure>
  <img src="photo.jpg" alt="Sunset over the ocean">
  <figcaption>A beautiful sunset at the beach</figcaption>
</figure>

Output:

![Sunset over the ocean](photo.jpg)
*A beautiful sunset at the beach*

Code blocks with language detection

<pre><code class="language-rust">fn main() {
    println!("Hello, world!");
}</code></pre>

Output:

```rust
fn main() {
    println!("Hello, world!");
}
```

Without a language-* class, the fence opens with plain ```.

Horizontal rule

<p>Section one</p>
<hr>
<p>Section two</p>

Output:

Section one

---

Section two

Project structure

src/
  lib.rs       - Public API: Readability struct, extract() pipeline
  cleaner.rs   - Strip noise elements (pre-compiled selectors via OnceLock)
  locator.rs   - Find content root via CSS selector cascade
  renderer.rs  - Recursive HTML-to-markdown/plaintext conversion
  rsc.rs       - RSC/MDX extraction for Next.js documentation sites
tests/
  integration.rs - End-to-end pipeline tests

Tests

cargo test

130+ tests covering all modules: cleaner (noise removal, hidden elements, ads, site headers), locator (selector cascade, fallbacks), renderer (headings, lists, tables, inline formatting, code language detection, details/summary, figure/figcaption), RSC (chunk finding, unescaping, MDX stripping, URL resolution), and integration (full pipeline).

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

readability-rs

What it does

Installation

Quick start

How it works

Strategy 0: RSC extraction (Next.js sites)

Strategy 1: DOM-based extraction

API

`Readability`

`OutputFormat`

Individual functions

RSC extraction

Markdown conversion reference

Feature examples

Tables

Strikethrough, subscript, superscript, highlight

Details / Summary

Figure with caption

Code blocks with language detection

Horizontal rule

Project structure

Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

readability-rs

What it does

Installation

Quick start

How it works

Strategy 0: RSC extraction (Next.js sites)

Strategy 1: DOM-based extraction

API

Readability

OutputFormat

Individual functions

RSC extraction

Markdown conversion reference

Feature examples

Tables

Strikethrough, subscript, superscript, highlight

Details / Summary

Figure with caption

Code blocks with language detection

Horizontal rule

Project structure

Tests

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Readability`

`OutputFormat`

Packages