Automated, LLM-powered data extraction and web-automation pipeline designed to transform unstructured files into actionable insights and execute complex browser tasks.
- Overview
- Features
- Architecture
- Quick Start
- Usage & Examples
- Configuration
- API Reference
- Troubleshooting
- Performance & Security
- Development
- Contributing
- Roadmap & Known Issues
- License & Credits
The Enterprise Data Pipeline is a robust framework built to bridge the gap between static data extraction and dynamic web action. Organizations often struggle with "siloed data" trapped in PDFs, spreadsheets, and images that require manual entry into web portals or internal tools. This repository provides a solution by monitoring local directories for various file formats, extracting their content using specialized engines (including OCR), and passing that data to Large Language Models (LLMs) to drive automated browser agents.
By integrating Browser-Use, the system goes beyond traditional ETL (Extract, Transform, Load). It acts as a digital worker that can navigate complex websites, perform price comparisons, or fill out web forms based on the processed information. Whether you are using Google Gemini for cloud-scale reasoning or Ollama for local-first privacy, the pipeline offers a flexible, event-driven architecture.
Who is this for?
- Data Engineers: Building automated document processing and ingestion workflows.
- RPA Developers: Transitioning from brittle, selector-based automation to LLM-native agents.
- Operations Teams: Needing to automate repetitive data-gathering or form-filling tasks.
- Product Teams: Prototyping AI agents that require real-time web interaction.
- โจ Universal Support: Native extractors for
.txt,.md,.json,.csv,.pdf, and.xlsx. - ๐๏ธ OCR Integration: Built-in
easyocrsupport for processing scanned images and complex PDF layouts. - โก Async Processing: Concurrent file handling using
aiofilesand multi-worker architecture.
- ๐ฏ Browser-Use Integration: Control Chromium-based browsers via natural language instructions.
- ๐ง Dual Provider Support: Use Google Gemini (Native/LangChain) or Ollama for local execution.
- ๐ Workflow Orchestration: Define multi-step tasks in YAML (e.g., Code Review, Data Analysis).
- ๐ Real-time Dashboard: Built-in FastAPI/Websocket dashboard for monitoring agent health.
- ๐พ Vector Memory: Integrated Pinecone service for long-term document context and RAG.
- ๐ต๏ธ File Watcher: Robust
watchdogimplementation to trigger events on file updates.
The system follows a modular architecture where the PipelineProcessor acts as the central hub, coordinating between filesystem events, data extractors, and LLM generators.
graph TD
subgraph "Ingestion Layer"
FW[File Watcher]
CLI[CLI Interface]
Data[(Local Data Dir)]
end
subgraph "Processing Core"
PP[Pipeline Processor]
EXT{Extractor Factory}
OCR[EasyOCR Engine]
end
subgraph "Intelligence & Action"
GEN[LLM Generator]
BUA[Browser-Use Agent]
PC[(Pinecone Vector DB)]
end
Data --> FW
FW --> PP
CLI --> PP
PP --> EXT
EXT --> OCR
EXT --> GEN
GEN --> BUA
BUA --> PC
PP --> DB[FastAPI Dashboard]
sequenceDiagram
participant U as File System
participant W as Watcher
participant P as Processor
participant E as Extractor
participant L as LLM (Gemini)
participant B as BrowserAgent
U->>W: New File (e.g., invoice.pdf)
W->>P: process_file(path)
P->>E: get_content()
E-->>P: Structured Text
P->>L: Generate Plan(text)
L-->>P: JSON: {action: "pay_invoice"}
P->>B: Execute Web Action
B-->>P: Success/Failure
P->>U: Save Log (output.md)
| Layer | Technology | Purpose |
|---|---|---|
| Language | Python 3.10+ | Core logic and scripting |
| LLM Interface | LangChain / ChatGoogle | Abstracted LLM communication |
| Automation | Playwright / Browser-Use | Headless browser control |
| Database | Pinecone | Vector storage for RAG workflows |
| UI | FastAPI / Tailwind | Dashboard and monitoring |
- Python 3.10 or 3.11 (3.12 support is experimental)
- Playwright dependencies:
playwright install - A Google Gemini API Key or a local Ollama instance
-
Clone the repository:
git clone https://github.com/WomB0ComB0/browser-use.git cd browser-use -
Install dependencies:
pip install -r requirements.txt playwright install chromium
-
Configure Environment:
cp .env.example .env # Edit .env and add your GEMINI_API_KEY
Run the Gemini native demo to verify your setup:
python demo_gemini_native.pyExpected Output:
[Browser-Use] Initializing session...
[Agent] Task: Search for latest news on autonomous agents.
[Browser] Navigating to google.com...
[Result] Found 5 articles. Summary: Agents are becoming more autonomous...
Monitor the data/ directory for new files and process them automatically.
python run_pipeline.py start --config config.yamlForce the pipeline to process a specific file with a dedicated workflow.
python run_pipeline.py process ./data/sample_users.txt --workflow data_analysisLaunch the web-based monitoring tool:
python -m pipeline.dashboard.app
# Open http://localhost:8000 in your browserAdvanced: Parallel Browser Execution
To run multiple browser agents in parallel (e.g., for price scraping), use the demo_parallel.yaml workflow:
python run_pipeline.py process ./data/items.csv --workflow demo_parallelThis leverages Python's asyncio.gather within the orchestrator to handle multiple browser instances concurrently.
The behavior of the processor and extractors is defined in config.yaml.
| Variable | Default | Description |
|---|---|---|
directories.data |
./data |
Where the watcher looks for files |
processing.workers |
4 |
Concurrent files being extracted |
generator.provider |
gemini |
gemini or ollama |
browser.headless |
true |
Set to false to see the browser window |
memory.enabled |
false |
Enable/Disable Pinecone vector storage |
| Variable | Required | Description |
|---|---|---|
GEMINI_API_KEY |
Yes (if Gemini) | Google AI Studio API Key |
PINECONE_API_KEY |
No | API Key for vector memory |
OLLAMA_BASE_URL |
No | Defaults to http://localhost:11434 |
LOG_LEVEL |
No | DEBUG, INFO, WARNING, ERROR |
The core engine for handling file-to-action logic.
| Method | Parameters | Description |
|---|---|---|
initialize() |
None | Sets up extractors and LLM clients. |
process_file(path) |
path: str |
Extracts content and runs the LLM agent. |
get_metrics() |
None | Returns stats on processed files/errors. |
Utility to determine the correct parsing strategy.
| Function | Input | Returns |
|---|---|---|
get_extractor(ext) |
.pdf, .csv, etc. |
BaseExtractor implementation |
| Error Message | Cause | Solution |
|---|---|---|
Executable not found |
Playwright not initialized | Run playwright install chromium |
401 Unauthorized |
Invalid API Key | Check GEMINI_API_KEY in your .env |
OCR failed |
Missing system libs | Install tesseract-ocr or libGL for EasyOCR |
TimeoutError |
Slow web response | Increase browser.timeout in config.yaml |
pipeline/processor.py: Orchestrates the movement of data.pipeline/extractors/: Individual logic for PDF, Excel, OCR, etc.pipeline/generators/: LLM provider implementations (Gemini, Ollama).pipeline/utils/browser_executor.py: Wrapper for Browser-Use actions.
We use pytest for unit and integration testing.
pytest tests/test_orchestrator.py
pytest tests/test_ollama.py- Support for Local Llama 3 via Ollama tool-calling.
- Integration with LangGraph for more complex stateful loops.
- Dockerized deployment for headless worker nodes.
- Support for multi-modal Gemini (Pro Vision) for image-based browser navigation.
- Memory Leaks: Long-running sessions in non-headless mode may consume high RAM.
- Shadow DOM: Browser-Use may occasionally fail to find elements inside deep Shadow DOM structures.
- License: MIT License. See LICENSE for details.
- Inspirations: Built using the browser-use library and LangChain.
- Maintainer: WomB0ComB0