Information Retrieval · University Coursework

IR Search Engine

BM25 · TF-IDF · Query Expansion · Pseudo-Relevance Feedback · NER

Hi, I'm Isaac. I'm studying Computer Science at the University of East Anglia in Norwich, and this is one of the projects I'm most proud of from my second year. It's a search engine I built for my Information Retrieval module — this page walks through how it works, what I tested, and what I found.

Isaac Walsham · BSc Computer Science · University of East Anglia

Python 3 NLTK spaCy pandas pytest ~3,500 LOC
View on GitHub
59
Documents
6
Algorithms
100%
Recall@10
System pipeline
soccer.csv 59 docs
Preprocessing & tokenisation NLTK
Inverted index 312 terms
Ranking — TF-IDF / BM25 Retrieval
QE / PRF / Rocchio / NER Re-rank
Evaluation metrics P, R, MAP, nDCG

Dataset

59 international soccer players. Each document contains a player's name, position, nationality, birthplace and national team — all indexed and searchable.

Index

An inverted index built from scratch. Pre-computed document norms for fast cosine similarity, dict-of-dicts postings for O(1) term lookup.

Evaluation

Eight benchmark queries with manual relevance judgements — proper academic evaluation. Five system configurations compared across five IR metrics. The best one hits a perfect nDCG score.


Implementation

Six algorithms, from scratch

Click any algorithm to expand its formula and details.

IDF(t) = log((N+1) / (df(t)+1)) + 1.0 score(d,q) = Σ tf(t,d) · IDF(t)² ────────────────────────── ‖d‖ · ‖q‖

Classic vector space model. Document norms are pre-computed at index time so queries stay fast. L2 cosine normalisation is optional — you can toggle it off if you want raw scores.

--ranker tfidf --tfidf-normalize
BM25(d,q) = Σ IDF(t) · tf(t,d) · (k₁ + 1) ──────────────────────────────────── tf(t,d) + k₁ · (1 − b + b · |d|/avgdl)

A probabilistic ranking model that penalises long documents relative to the corpus average. Outperforms TF-IDF across every metric in the evaluation.

--ranker bm25 k₁ = 1.2 b = 0.75
expand("english players") → "england english players" mappings: english → england, midfielder → midfield, goalkeeper → keeper, spanish → spain, …

A hand-built dictionary of demonym and role synonyms applied before retrieval. The biggest single performance gain in the evaluation — P@10 nearly doubles from 0.238 to 0.412.

--qe --no-qe
pass 1 → retrieve top-K docs pass 2 → extract top-M highest-weight terms from feedback documents → append to query → re-rank

Two-pass retrieval with no human input needed — the system assumes the top results are relevant and expands the query from there. Achieves perfect recall, though precision drops slightly as the extra terms introduce some noise.

--prf --prf-docs 5 --prf-terms 5
q′ = α · q + β · centroid(Relevant documents) − γ · centroid(Non-relevant documents)

Requires explicit relevance labels — uses the qrels to move the query vector closer to relevant documents and away from irrelevant ones in TF-IDF space.

--rocchio α = 1.0 β = 0.75 γ = 0.15
entity types: PERSON, GPE, NORP, LOC, LANGUAGE score′ = score × (1 + w · overlap_count) w = 0.25 (default, configurable)

The most experimental part of the project. spaCy pulls named entities from both queries and documents — if they overlap, the document's score gets a proportional boost. Results are LRU-cached to avoid hammering the NLP pipeline on every query.

--ner --ner-weight 0.25

Try it yourself

Interactive demo

Results are simulated from real system output. Toggle the options to see how each configuration changes the ranking.

Preset queries
Custom query
Options
Select a query to get started

Evaluation

Benchmark results

Metrics computed at k=10 across five retrieval configurations and eight benchmark queries.

Configuration P@10R@10MAP@10nDCG@10MRR@10
TF-IDF baseline 0.2130.8750.6120.8210.875
BM25 baseline 0.2380.9380.6410.8490.938
BM25 + QE 0.412 1.000 0.821 1.000 1.000
BM25 + QE + PRF 0.2371.0000.7310.9511.000
BM25 + QE + Rocchio 0.2631.0000.7560.9641.000