BM25 · TF-IDF · Query Expansion · Pseudo-Relevance Feedback · NER
Hi, I'm Isaac. I'm studying Computer Science at the University of East Anglia in Norwich, and this is one of the projects I'm most proud of from my second year. It's a search engine I built for my Information Retrieval module — this page walks through how it works, what I tested, and what I found.
59 international soccer players. Each document contains a player's name, position, nationality, birthplace and national team — all indexed and searchable.
An inverted index built from scratch. Pre-computed document norms for fast cosine similarity, dict-of-dicts postings for O(1) term lookup.
Eight benchmark queries with manual relevance judgements — proper academic evaluation. Five system configurations compared across five IR metrics. The best one hits a perfect nDCG score.
Click any algorithm to expand its formula and details.
Classic vector space model. Document norms are pre-computed at index time so queries stay fast. L2 cosine normalisation is optional — you can toggle it off if you want raw scores.
A probabilistic ranking model that penalises long documents relative to the corpus average. Outperforms TF-IDF across every metric in the evaluation.
A hand-built dictionary of demonym and role synonyms applied before retrieval. The biggest single performance gain in the evaluation — P@10 nearly doubles from 0.238 to 0.412.
Two-pass retrieval with no human input needed — the system assumes the top results are relevant and expands the query from there. Achieves perfect recall, though precision drops slightly as the extra terms introduce some noise.
Requires explicit relevance labels — uses the qrels to move the query vector closer to relevant documents and away from irrelevant ones in TF-IDF space.
The most experimental part of the project. spaCy pulls named entities from both queries and documents — if they overlap, the document's score gets a proportional boost. Results are LRU-cached to avoid hammering the NLP pipeline on every query.
Results are simulated from real system output. Toggle the options to see how each configuration changes the ranking.
Metrics computed at k=10 across five retrieval configurations and eight benchmark queries.
| Configuration | P@10 | R@10 | MAP@10 | nDCG@10 | MRR@10 |
|---|---|---|---|---|---|
| TF-IDF baseline | 0.213 | 0.875 | 0.612 | 0.821 | 0.875 |
| BM25 baseline | 0.238 | 0.938 | 0.641 | 0.849 | 0.938 |
| BM25 + QE | 0.412 | 1.000 | 0.821 | 1.000 | 1.000 |
| BM25 + QE + PRF | 0.237 | 1.000 | 0.731 | 0.951 | 1.000 |
| BM25 + QE + Rocchio | 0.263 | 1.000 | 0.756 | 0.964 | 1.000 |