document-ingestion

Here are 16 public repositories matching this topic...

dotfurther / OpenDiscoverPlatformCaseStudy

Case study using dotfurther's Open Discover Platform with the RavenDB document store to rapidly create a full-text search/eDiscovery/information governance capable demonstration application.

metadata text-extraction full-text full-text-search ravendb ediscovery indexing-engine file-format-detection data-breach file-deduplication pii information-governance-catalog personally-identifiable-information archive-extractor pii-detection file-identification full-text-extraction document-ingestion information-governance

Updated May 28, 2024

iamarunbrahma / rag-ingest

Star

RAG-Ingest: A tool for converting PDFs to markdown and indexing them for enhanced Retrieval Augmented Generation (RAG) capabilities.

information-retrieval aws-s3 document-ingestion hybrid-search qdrant llamaindex retrieval-augmented-generation ollama pdf-to-markdown contextual-retrieval qdrant-rag

Updated Nov 22, 2024
Python

Self-hosted RAG engine for AI coding assistants. Ingests technical docs & code repositories locally with structure-aware chunking. Serves grounded context via MCP to prevent hallucinations in software development workflows.

Updated Feb 15, 2026
Go

hamittokay / context-window

Star

A simple RAG toolkit.

typescript embeddings openai chunking knowledge-base pinecone ai-toolkit rag vector-search document-ingestion vector-database rag-pipeline source-citations rag-toolkit

Updated Nov 8, 2025
TypeScript

ankit123nag / pdf-rag-assistant

Star

Production-grade RAG backend for document ingestion and semantic retrieval using embeddings and Pinecone.

nodejs docker redis express typescript embeddings semantic-search pinecone rag document-ingestion vector-database clerk langchain retrieval-augmented-generation

Updated Feb 8, 2026
JavaScript

brej-29 / rag-agent-workbench

Star

Production-grade RAG chatbot with a FastAPI + LangGraph backend (Pinecone vector search + Groq LLM + Tavily web fallback) and a Streamlit chat UI, secured via API key and observable in LangSmith.

chatbot arxiv semantic-search observability pinecone rag fastapi groq streamlit document-ingestion vector-database openalex llm langchain retrieval-augmented-generation langsmith langgraph tavily docling

Updated Jan 17, 2026
Python

msmrexe / graphrag-query-summarization

Star

An implementation of the GraphRAG pipeline (based on the 2024 paper "From Local to Global" by Edge et al.) for query-focused summarization of large text corpora.

university-project course-project knowledge-graph-construction rag hierarchial-language-model document-ingestion qfs query-focused-summarization leiden-algorithm system-2 retrieval-augmented-generation graphrag global-query llm-graph

Updated Nov 5, 2025
Python

SamD / selfhosted-rag-doc-chat-prototype

Star

Self-hosted RAG prototype to ingest PDFs/HTML and chat with them via a local UI

redis open-source ocr self-hosted astro multi-processing rag fastapi vector-search huggingface pdf-processing document-ingestion llm qdrant-vector-database chromadb local-llm retrieval-augmented-generation

Updated Feb 14, 2026
Python

RAK0152 / doc-watch-rag

Star

Async document watcher that keeps your RAG index hot. Automatically ingests new or changed documents into a live RAG pipeline with built-in observability.

asyncio sop observability rag document-ingestion vector-database sentence-transformers llm chromadb retrieval-augmented-generation watchfiles enterprise-ai

Updated Dec 21, 2025
Python

JoshPola96 / brainwonders-parent-rag-qa

Star

AI-powered RAG assistant for parents to get instant, context-aware answers on Brainwonders’ career counseling programs, pricing, and services. Built with Streamlit, LangChain, ChromaDB, and Google Gemma LLM for fast, multi-document retrieval and conversational Q&A.

python nlp qa chatbot embeddings knowledge-base rag multi-document streamlit document-ingestion llm langchain chromadb educational-ai

Updated Jun 18, 2025
Python

ScientistSameer / History-of-Lab-Records

Star

An AI Analytics Dashboard for research labs analytics, collaboration, and email workflow using React and FastAPI.

Updated Jan 4, 2026
JavaScript

framerecall / FrameRecall

Star

Store millions of text chunks inside ultra-compact MP4 files, index them with local embeddings, and retrieve answers instantly for fully offline RAG with any LLM.

python ai chatbot qr-code openai chunking semantic-search video-encoding memory-systems multimodal file-processing document-ingestion llm context-retrieval

Updated Jul 2, 2025
Python

JoshPola96 / agentic-rag-chatbot-mcp

Star

Agentic RAG Chatbot using multi-agent architecture and Streamlit. Ingests PDFs, DOCX, PPTX, CSV, TXT, and Markdown files to provide contextually accurate answers with a persistent knowledge base. Supports multi-turn conversations, source citations, and dynamic document uploads.