Skip to content

feat: add BM25 hybrid search with RRF, update chunking for Notion ing…#31

Merged
emmanuelkb merged 2 commits intomainfrom
feature/RAG-pipeline-improvement-v2
Mar 28, 2026
Merged

feat: add BM25 hybrid search with RRF, update chunking for Notion ing…#31
emmanuelkb merged 2 commits intomainfrom
feature/RAG-pipeline-improvement-v2

Conversation

@RubyRyn
Copy link
Copy Markdown
Owner

@RubyRyn RubyRyn commented Mar 23, 2026

  • Add bm25s dependency and BM25Manager (build/search/save/load with page title indexing)
  • Add HybridRetriever with RRF merge (k=60): vector(15) + BM25(15) -> top 15
  • Replace direct ChromaDB query with HybridRetriever in both conversation endpoints
  • Bump reranker top_k from 3 to 10, context cap from 4000 to 5000 chars
  • Remove MarkdownHeaderTextSplitter from Notion ingestion; use RecursiveCharacterTextSplitter only
  • Increase Notion chunk_size from 500 to 1000, overlap from 100 to 200
  • Build and save BM25 index after every Notion ingestion run

…estion

- Add bm25s dependency and BM25Manager (build/search/save/load with page title indexing)
- Add HybridRetriever with RRF merge (k=60): vector(15) + BM25(15) -> top 15
- Replace direct ChromaDB query with HybridRetriever in both conversation endpoints
- Bump reranker top_k from 3 to 10, context cap from 4000 to 5000 chars
- Remove MarkdownHeaderTextSplitter from Notion ingestion; use RecursiveCharacterTextSplitter only
- Increase Notion chunk_size from 500 to 1000, overlap from 100 to 200
- Build and save BM25 index after every Notion ingestion run

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@emmanuelkb emmanuelkb self-requested a review March 25, 2026 11:38
@emmanuelkb
Copy link
Copy Markdown
Collaborator

emmanuelkb commented Mar 25, 2026

@RubyRyn Conversation history removed — Both send_message and stream_message no longer pass conversation_history to gemini.ask_workmate(). This means the LLM loses multi-turn
context within a conversation. Was this intentional? If ask_workmate still accepts that parameter, this is a regression

@emmanuelkb
Copy link
Copy Markdown
Collaborator

New env var VOYAGE_API_KEY required — No .env.example update, no documentation, and the app will crash at startup if the reranker is initialized without it. Should be
documented and ideally fail gracefully or lazily.

@RubyRyn
Copy link
Copy Markdown
Owner Author

RubyRyn commented Mar 28, 2026

  • Fixed Conversation history missing issue.
  • .env.example - Added GIMINI_API_KEY, VOYAGE_API_KEY and NOTION_TOKEN.
  • config.py - Added VOYAGE_API_KEY: str = "" so it's tracked through Pydantic settings like other keys.
  • voyage_reranker.py - No longer raises on missing key. Sets self.client = None with a waring log, and rerank() returns chunks[:top_k] unranked when the client is absent. The voyageai import is also deferred to only when the key is present.

Copy link
Copy Markdown
Collaborator

@emmanuelkb emmanuelkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great

@emmanuelkb emmanuelkb merged commit 13648ac into main Mar 28, 2026
@emmanuelkb emmanuelkb deleted the feature/RAG-pipeline-improvement-v2 branch March 28, 2026 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants