RAG Pipeline

RAG Pipeline started from a recurring issue I noticed while reading research papers:

Most retrieval systems can find words, but they struggle to understand technical context.

Traditional search pipelines often fail when:

queries use terminology different from the paper
important sections are buried behind irrelevant matches
retrieval systems prioritize figure captions or references
semantic search misses structurally important sections
generated answers hallucinate outside the source material

I wanted to build a retrieval system that treated research papers as structured technical documents instead of plain text blobs.

The problem

Research papers contain dense technical information distributed across:

introductions
methodology sections
experiments
ablation studies
appendices
equations and figures

Basic embedding search often struggles because semantic similarity alone is not enough.

For example:

a query about “training incentives” may never explicitly contain the word “reward”
a query about datasets may retrieve references instead of experimental setup sections
dense retrieval frequently ignores document structure entirely

RAG Pipeline

What we built

Hybrid dense + sparse retrieval — combines semantic embeddings with BM25 keyword retrieval.
Query expansion engine — expands user queries using domain-specific synonyms and pseudo-relevance feedback.
Section-aware ranking — injects document structure into retrieval scoring.
Cross-encoder reranking — deep relevance scoring for final candidate selection.
Noise filtering pipeline — removes references, figure captions, equations, and low-signal chunks.
LLM-powered answer generation — retrieved context passed into Groq-hosted Llama models with anti-hallucination prompting.
Evaluation & failure analysis tooling — retrieval debugging pipeline for analyzing ranking failures.

The architecture

The system uses a multi-stage retrieval pipeline instead of a single retrieval pass.

User Query
    ↓
Query Expansion
    ↓
Dense Search (FAISS)
    +
Sparse Search (BM25)
    ↓
Pseudo-Relevance Feedback
    ↓
Re-retrieval
    ↓
Section Prior Injection
    ↓
Noise Filtering
    ↓
Cross-Encoder Reranking
    ↓
Context Builder
    ↓
LLM Answer Generation

Each stage exists to solve a specific retrieval weakness.

The retrieval pipeline

The core retrieval engine combines several ranking strategies together.

Dense retrieval

Semantic embeddings generated using Sentence Transformers are indexed with FAISS for approximate nearest-neighbor search.

dense_index, embeddings = indexer.build_dense_index(chunks)

This handles semantic similarity well but can still miss structural intent.

Sparse retrieval

BM25 retrieval runs in parallel to preserve exact keyword matching.

bm25 = indexer.build_bm25_index(chunks)

This improves precision for technical terminology and formulas.

Query expansion

The query expansion layer addresses vocabulary mismatch.

"training" → ["reward", "PPO", "policy learning", ...]

Pseudo-relevance feedback then extracts additional expansion terms dynamically from top-ranked chunks.

Section-aware ranking

The system maps query intent to likely document sections.

For example:

Reward queries → "reward", "3.1"
Dataset queries → "dataset", "4.2"
Conclusion queries → "conclusion", "6."

Matching sections receive additional ranking boosts during final scoring.

This dramatically improved retrieval quality for structurally repetitive academic papers.

Decisions worth calling out

Hybrid retrieval over embedding-only search — dense retrieval alone was inconsistent for technical terminology.
Cross-encoder reranking only at final stages — reranking everything would have been computationally expensive.
Section priors as explicit heuristics — embeddings often fail to capture structural intent reliably.
Noise filtering before reranking — removing references and figure captions improved reranker quality significantly.
Low-temperature generation — the LLM layer prioritizes factual grounding over creativity.

Trade-offs I made

The biggest trade-off was latency versus retrieval quality.

Multi-stage retrieval pipelines improve answer accuracy substantially, but every stage adds additional compute and orchestration overhead.

Cross-encoder reranking especially improves relevance quality while becoming the slowest part of the retrieval stack.

I also intentionally constrained the generator with strict anti-hallucination prompting, which occasionally produces incomplete answers instead of speculative ones.

I preferred conservative generation over confident hallucination.

Evaluation & debugging

The system includes dedicated retrieval evaluation tooling.

Metrics include:

Recall@5
Mean Reciprocal Rank (MRR)
Hit Rate

I also built a failure analysis pipeline to inspect retrieval misses and ranking mistakes.

FAILED QUERY:
"What is the dynamic obstacle scenario?"
 
Expected:
Section 4.2
 
Retrieved:
Section 3.1 reward function

This made it significantly easier to improve ranking heuristics iteratively.

Performance

The retrieval system supports:

FAISS approximate nearest-neighbor search
HNSW indexing for larger corpora
modular retrieval pipelines
configurable chunk sizes
adjustable reranking depth
tunable retrieval fusion weights

The architecture was designed to remain extensible rather than tightly coupled to one model provider or retrieval strategy.

What it taught me

This project changed how I think about retrieval systems.

A good RAG pipeline is rarely about a single model. Most of the performance gains come from orchestration — retrieval quality, ranking strategy, chunking logic, structural priors, and context construction.

It also taught me that academic document retrieval is fundamentally different from generic semantic search. Papers have structure, hierarchy, and recurring patterns that retrieval systems need to understand explicitly instead of treating everything as flat text.