Skip to content

Add BM25SRetriever: pure-Python BM25 with no Java/Pyserini dependency#116

Draft
Copilot wants to merge 2 commits into
mainfrom
copilot/add-bm25s-integration
Draft

Add BM25SRetriever: pure-Python BM25 with no Java/Pyserini dependency#116
Copilot wants to merge 2 commits into
mainfrom
copilot/add-bm25s-integration

Conversation

Copilot AI commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

The existing BM25Retriever requires Pyserini (~7GB install) and a JVM. This adds BM25SRetriever backed by bm25s — a pure-Python implementation that runs significantly faster and weighs ~479MB total.

New: BM25SRetriever

  • rankify/retrievers/bm25s_retriever.py — new retriever; builds a bm25s index from a JSONL or TSV corpus on first use, persists it to disk, and loads it on subsequent runs. No Java, no JVM, no Lucene.
  • Corpus formats supported: JSONL ({"id", "title", "text"}) and TSV (id\ttext\ttitle, same layout as psgs_w100.tsv)
  • Optional PyStemmer support via stemmer_lang parameter
  • Self-contained _has_answers — no pyserini import required

Integration

  • retriever.py — adds "bm25s" to METHOD_MAP; all retriever imports wrapped in try/except so missing optional deps (pyserini, faiss, gensim, etc.) no longer block imports of unrelated retrievers
  • __init__.py — exports BM25SRetriever; same graceful import handling
  • bm25_retriever.py / diver_bm25_retriever.py — pyserini imports made lazy; raise a descriptive ImportError pointing to BM25SRetriever if pyserini is absent
  • pyproject.tomlbm25s>=0.2.0 added to [retriever] optional deps

Usage

from rankify.retrievers import Retriever

# First run: builds and persists the index
retriever = Retriever(
    method="bm25s",
    n_docs=10,
    corpus_path="/path/to/corpus.jsonl",  # or .tsv
    index_folder="/path/to/index_dir",
)

# Subsequent runs: loads pre-built index, no corpus_path needed
retriever = Retriever(method="bm25s", n_docs=10, index_folder="/path/to/index_dir")
results = retriever.retrieve(documents)
Copilot AI changed the title [WIP] Add BM25s as a replacement for Pyserini Apr 24, 2026
Copilot AI requested a review from abdoelsayed2016 April 24, 2026 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants