Handwriting OCR

A web application that converts handwritten documents to text using OCR, then learns your handwriting over time through corrections. Upload or photograph handwritten pages (or import from Google Photos), get instant transcriptions via Gemini Flash or TrOCR, and correct mistakes through a gamified "Play" mode. Corrections feed back into per-user LoRA fine-tuning, improving accuracy over time.

Features

Dual OCR Engine -- Gemini 2.5 Flash API (primary, high-quality) with TrOCR local fallback
Auto-rotation -- Detects and corrects image orientation (Gemini: single-prompt detection; TrOCR: tries all 4 orientations)
Perspective Warp -- Detects notebook page corners in camera photos and applies perspective transform to remove desk/background, producing a clean rectangular page image
Deskew -- Straightens small text skew via Hough line detection so bounding boxes align with horizontal text
Ink-Aware Bbox Alignment -- Detects actual ink line positions using Otsu binarization + horizontal projection, then snaps Gemini's bounding boxes to real text positions (corrects spacing drift on long pages)
Auto-crop -- Detects content bounds to focus OCR on the writing area
Custom Bounding Boxes -- Draw boxes on the page to OCR specific regions
Bbox Training Mode -- Manually redraw bounding boxes for any OCR result to correct alignment
Play Mode -- Gamified correction interface that prioritizes low-confidence results
Personalized Fine-tuning -- Per-user LoRA adapters trained on your corrections
Calibration -- Bootstrap training with a single handwriting sample
Google Photos Import -- Import photos directly via the Google Photos Picker API
Full-text Search -- Whoosh-indexed search across all your transcribed documents
Model Export -- Download your personalized LoRA weights

Quick Start

Prerequisites

Python 3.10+
Node.js 18+
A Google Cloud project with OAuth 2.0 credentials
(Optional) A Gemini API key for high-quality OCR

Setup

# Clone
git clone https://github.com/lukeiseman/handwriting-ocr.git
cd handwriting-ocr

# Configure
cp .env.example .env
# Edit .env with your Google OAuth credentials and (optional) Gemini API key

# Backend
cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Frontend (in a separate terminal)
cd frontend
npm install

Run

# Backend (from backend/)
source .venv/bin/activate
uvicorn app.main:app --reload --port 8000

# Frontend (from frontend/)
npm run dev

Open http://localhost:5173 in your browser. The frontend dev server proxies API requests to the backend on port 8000.

Docker

docker compose up --build
# App available at http://localhost:8000

Architecture

handwriting-ocr/
├── backend/
│   ├── app/
│   │   ├── main.py          # FastAPI app, CORS, lifespan, DB migration
│   │   ├── auth.py          # JWT token verification
│   │   ├── config.py        # Pydantic settings
│   │   ├── database.py      # SQLAlchemy async engine (SQLite w/ 30s busy timeout)
│   │   ├── models.py        # ORM models (User, Document, Page, OcrResult, Correction, UserModel)
│   │   ├── schemas.py       # Pydantic request/response schemas
│   │   ├── ocr.py           # OCR engines, image processing, bbox alignment
│   │   ├── finetune.py      # LoRA fine-tuning pipeline
│   │   └── routes/
│   │       ├── auth.py          # Google OAuth login/callback, JWT
│   │       ├── documents.py     # Upload, camera, CRUD, rotate, crop
│   │       ├── ocr.py           # Trigger OCR, get results, process bbox, train bbox
│   │       ├── corrections.py   # Submit corrections, Play mode batches
│   │       ├── search.py        # Full-text search (Whoosh)
│   │       ├── photos.py        # Google Photos Picker import
│   │       └── model.py         # Training, calibration, export
│   └── requirements.txt
├── frontend/
│   ├── src/
│   │   ├── App.jsx
│   │   ├── api.js               # Axios client + all API functions
│   │   ├── hooks/useAuth.jsx    # Auth context + token management
│   │   ├── components/
│   │   │   ├── PageViewer.jsx           # Image viewer with bbox overlays, draw/crop/train modes
│   │   │   └── BboxHighlightViewer.jsx  # Bbox drawing component (Play, Calibrate)
│   │   └── pages/
│   │       ├── Login.jsx         # Google OAuth sign-in
│   │       ├── Dashboard.jsx     # Document list, upload, navigation
│   │       ├── Upload.jsx        # Multi-source upload (file, camera, Google Photos)
│   │       ├── DocumentView.jsx  # Page viewer + OCR results + train mode
│   │       ├── Play.jsx          # Correction game with speech-to-text
│   │       ├── Search.jsx        # Full-text search
│   │       ├── Model.jsx         # Training controls + export
│   │       └── Calibrate.jsx     # Bootstrap training with sample text
│   └── package.json
├── Dockerfile              # Multi-stage (Node build + Python runtime)
├── docker-compose.yml
├── .env.example
└── CLAUDE.md               # AI assistant instructions

How It Works

Data Flow

Upload -- Images saved to data/uploads/user_{id}/, Document + Page rows created
Pre-process -- Auto-rotation detects orientation and bakes into the image file. Perspective warp detects notebook page corners and removes background. Deskew straightens small text skew.
OCR -- Gemini (or TrOCR) transcribes the page, returning text with bounding boxes. Ink line detection snaps bounding boxes to actual text positions.
Correct -- Play mode surfaces lowest-confidence results first, user approves or corrects
Train -- Corrections crop original images to bounding boxes, LoRA fine-tunes TrOCR decoder attention layers (q_proj, v_proj), new adapter version saved
Improve -- Subsequent OCR loads user's LoRA adapter for better accuracy

OCR Pipeline

The full Gemini OCR pipeline for a page:

Camera Photo
    │
    ▼
┌─────────────────┐
│  Auto-Rotation   │  Gemini detects orientation → bake rotation into file
└────────┬────────┘
         ▼
┌─────────────────┐
│ Perspective Warp │  Gemini detects 4 page corners → OpenCV warpPerspective
└────────┬────────┘  removes desk/hands/background → clean page rectangle
         ▼
┌─────────────────┐
│     Deskew       │  Hough line detection → rotate to straighten text
└────────┬────────┘
         ▼
┌─────────────────┐
│   Gemini OCR     │  Full-page prompt → returns JSON array of
└────────┬────────┘  {text, box: [y1,x1,y2,x2]} entries (0-1000 normalized)
         ▼
┌─────────────────┐
│  Ink Detection   │  Otsu binarization → horizontal projection →
└────────┬────────┘  find actual text line positions in the image
         ▼
┌─────────────────┐
│  Bbox Alignment  │  Sequential matching: snap each Gemini text line
└────────┬────────┘  to nearest detected ink line (fixes spacing drift)
         ▼
    OCR Results
    (text + aligned bboxes)

Gemini Flash (when GEMINI_API_KEY is set):

Sends full page image to Gemini 2.5 Flash for transcription with bounding boxes
Auto-rotation via single-prompt orientation detection (0/90/180/270)
Perspective warp uses Gemini to detect page corners, with 2% outward padding to avoid trimming content
Bounding box alignment corrects Gemini's uniform y-spacing grid (which drifts from actual ruled-line spacing on notebook pages) by detecting real ink positions via Otsu binarization + horizontal projection
API calls include automatic retry (3 attempts with exponential backoff) for transient network errors

TrOCR (local fallback):

microsoft/trocr-large-handwritten via HuggingFace Transformers
Line segmentation via horizontal projection profiles
Auto-rotation by trying all 4 orientations, keeping highest confidence
Confidence from average token log-probabilities

Bounding Box Alignment (_build_direct_segments):

Gemini returns text with bounding boxes that use a uniform vertical grid. On notebook pages with ruled lines, this grid spacing (~121px) often differs from the actual line spacing (~109px), causing progressive drift -- by line 25+, boxes can be 300px off from the actual text. The alignment pipeline fixes this:

Ink line detection (_detect_ink_lines): Otsu binarization on the middle 75% of the image width, horizontal projection to find rows with ink, contiguous run detection with minimum height filtering (15px), and automatic splitting of oversized blobs using projection valleys
Body-start detection: Identifies header/date lines (which have non-standard spacing to the next line) and excludes them from ink matching
Sequential matching: Each Gemini text line (in reading order) is matched to the next available ink line, skipping oversized blobs. This avoids the problem of distance-based matching where Gemini's drifted y-coordinates would match to the wrong ink line
Extrapolation: Unmatched lines at the end of the page are spaced using the median ink line spacing from the last matched position
Height from ink: Each box height comes from the actual ink line height, not a uniform value

Fine-tuning

Per-user LoRA adapters via the PEFT library:

Rank: 8, Alpha: 16
Target: Decoder attention (q_proj, v_proj)
Storage: data/models/user_{id}/v{N}/
Training: CPU-compatible, small batches from correction pairs

Configuration

Copy .env.example to .env:

Variable	Required	Description
`GOOGLE_CLIENT_ID`	Yes	Google Cloud OAuth 2.0 client ID
`GOOGLE_CLIENT_SECRET`	Yes	Google Cloud OAuth 2.0 client secret
`SECRET_KEY`	Yes	Random string for JWT signing
`GEMINI_API_KEY`	No	Gemini API key for high-quality OCR
`DATABASE_URL`	No	SQLite path (default: `sqlite:///./data/app.db`)
`UPLOAD_DIR`	No	Upload directory (default: `./data/uploads`)
`MODEL_DIR`	No	Model directory (default: `./data/models`)

Google Cloud Setup

Create a project in Google Cloud Console
Enable the Google Photos Picker API
Create OAuth 2.0 credentials (Web application)
Set authorized redirect URI to http://localhost:8000/auth/callback
Copy client ID and secret to .env

API Endpoints

Authentication

Method	Path	Description
GET	`/auth/login`	Redirect to Google OAuth
GET	`/auth/callback`	OAuth callback, mints JWT
GET	`/auth/me`	Current user profile
POST	`/auth/logout`	Clear session

Documents

Method	Path	Description
GET	`/documents`	List all documents
POST	`/documents/upload`	Upload image files
POST	`/documents/camera`	Camera capture (base64)
GET	`/documents/{id}`	Document with pages + results
DELETE	`/documents/{id}`	Delete document and files
POST	`/documents/pages/{id}/rotate`	Rotate page image
POST	`/documents/pages/{id}/crop`	Set crop region
POST	`/documents/pages/{id}/crop/clear`	Clear crop
POST	`/documents/pages/{id}/crop/auto`	Auto-detect crop

OCR

Method	Path	Description
POST	`/ocr/process/{page_id}`	Trigger OCR (background)
POST	`/ocr/process-document/{doc_id}`	OCR all pages
POST	`/ocr/process-bbox/{page_id}`	OCR drawn region (sync)
GET	`/ocr/results/{page_id}`	Get OCR results
GET	`/ocr/processing-status`	Poll processing state
PUT	`/ocr/result/{result_id}/bbox`	Update result bounding box (train mode)

Corrections

Method	Path	Description
POST	`/corrections`	Submit correction
GET	`/corrections`	List corrections
GET	`/corrections/play`	Get Play mode batch
POST	`/corrections/play/submit`	Submit Play correction

Search, Photos, Model

Method	Path	Description
GET	`/search`	Full-text search
POST	`/photos/picker/session`	Create Photos Picker session
GET	`/photos/picker/session/{id}`	Poll picker status
POST	`/photos/picker/import`	Import selected photos
GET	`/model/status`	Model version + stats
POST	`/model/train`	Start fine-tuning
POST	`/model/calibrate`	Calibrate with sample
GET	`/model/export`	Download LoRA weights

Tech Stack

Backend: FastAPI, SQLAlchemy 2.0 (async), aiosqlite, python-jose (JWT), authlib (OAuth), OpenCV

Frontend: React 19, Vite 6, Tailwind CSS 4, React Router 6, TanStack Query 5, Axios

ML/Vision: Google GenAI (Gemini 2.5 Flash), Transformers (TrOCR), PEFT (LoRA), PyTorch, OpenCV (perspective warp, deskew, ink detection)

Search: Whoosh

Deployment

Target: Google Cloud Run with persistent volume for data/.

The Dockerfile is multi-stage: Node builds the frontend, Python serves both the API and static assets. Mount data/ as a persistent volume or use GCS for production storage.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Handwriting OCR

Features

Quick Start

Prerequisites

Setup

Run

Docker

Architecture

How It Works

Data Flow

OCR Pipeline

Fine-tuning

Configuration

Google Cloud Setup

API Endpoints

Authentication

Documents

OCR

Corrections

Search, Photos, Model

Tech Stack

Deployment

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Handwriting OCR

Features

Quick Start

Prerequisites

Setup

Run

Docker

Architecture

How It Works

Data Flow

OCR Pipeline

Fine-tuning

Configuration

Google Cloud Setup

API Endpoints

Authentication

Documents

OCR

Corrections

Search, Photos, Model

Tech Stack

Deployment

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages