Information Retrieval · University Coursework

IR Search Engine

BM25 · TF-IDF · Query Expansion · Pseudo-Relevance Feedback · NER

Hi, I'm Isaac. I'm studying Computer Science at the University of East Anglia in Norwich, and this is one of the projects I'm most proud of from my second year. It's a search engine I built for my Information Retrieval module — this page walks through how it works, what I tested, and what I found.

Isaac Walsham · BSc Computer Science · University of East Anglia

Python 3 NLTK spaCy pandas pytest ~3,500 LOC

View on GitHub

Documents

Algorithms

100%

Recall@10

System pipeline

soccer.csv 59 docs

Preprocessing & tokenisation NLTK

Inverted index 312 terms

Ranking — TF-IDF / BM25 Retrieval

QE / PRF / Rocchio / NER Re-rank

Evaluation metrics P, R, MAP, nDCG

Dataset

59 international soccer players. Each document contains a player's name, position, nationality, birthplace and national team — all indexed and searchable.

Index

An inverted index built from scratch. Pre-computed document norms for fast cosine similarity, dict-of-dicts postings for O(1) term lookup.

Evaluation

Eight benchmark queries with manual relevance judgements — proper academic evaluation. Five system configurations compared across five IR metrics. The best one hits a perfect nDCG score.

Implementation

Six algorithms, from scratch

Click any algorithm to expand its formula and details.

IDF(t) = log((N+1) / (df(t)+1)) + 1.0 score(d,q) = Σ tf(t,d) · IDF(t)² ────────────────────────── ‖d‖ · ‖q‖

Classic vector space model. Document norms are pre-computed at index time so queries stay fast. L2 cosine normalisation is optional — you can toggle it off if you want raw scores.

--ranker tfidf --tfidf-normalize

BM25(d,q) = Σ IDF(t) · tf(t,d) · (k₁ + 1) ──────────────────────────────────── tf(t,d) + k₁ · (1 − b + b · |d|/avgdl)

A probabilistic ranking model that penalises long documents relative to the corpus average. Outperforms TF-IDF across every metric in the evaluation.

--ranker bm25 k₁ = 1.2 b = 0.75

expand("english players") → "england english players" mappings: english → england, midfielder → midfield, goalkeeper → keeper, spanish → spain, …

A hand-built dictionary of demonym and role synonyms applied before retrieval. The biggest single performance gain in the evaluation — P@10 nearly doubles from 0.238 to 0.412.

--qe --no-qe

pass 1 → retrieve top-K docs pass 2 → extract top-M highest-weight terms from feedback documents → append to query → re-rank

Two-pass retrieval with no human input needed — the system assumes the top results are relevant and expands the query from there. Achieves perfect recall, though precision drops slightly as the extra terms introduce some noise.

--prf --prf-docs 5 --prf-terms 5

q′ = α · q + β · centroid(Relevant documents) − γ · centroid(Non-relevant documents)

Requires explicit relevance labels — uses the qrels to move the query vector closer to relevant documents and away from irrelevant ones in TF-IDF space.

--rocchio α = 1.0 β = 0.75 γ = 0.15

entity types: PERSON, GPE, NORP, LOC, LANGUAGE score′ = score × (1 + w · overlap_count) w = 0.25 (default, configurable)

The most experimental part of the project. spaCy pulls named entities from both queries and documents — if they overlap, the document's score gets a proportional boost. Results are LRU-cached to avoid hammering the NLP pipeline on every query.

--ner --ner-weight 0.25

Configuration	P@10	R@10	MAP@10	nDCG@10	MRR@10
TF-IDF baseline	0.213	0.875	0.612	0.821	0.875
BM25 baseline	0.238	0.938	0.641	0.849	0.938
BM25 + QE	0.412	1.000	0.821	1.000	1.000
BM25 + QE + PRF	0.237	1.000	0.731	0.951	1.000
BM25 + QE + Rocchio	0.263	1.000	0.756	0.964	1.000

IR Search Engine

Six algorithms, from scratch

Interactive demo

Benchmark results