RAG Pipeline
A multi-stage Retrieval-Augmented Generation pipeline for research papers combining hybrid retrieval, query expansion, section-aware ranking, and cross-encoder reranking for accurate technical Q&A.

RAG Pipeline started from a recurring issue I noticed while reading research papers:
Most retrieval systems can find words, but they struggle to understand technical context.
Traditional search pipelines often fail when:
- queries use terminology different from the paper
- important sections are buried behind irrelevant matches
- retrieval systems prioritize figure captions or references
- semantic search misses structurally important sections
- generated answers hallucinate outside the source material
I wanted to build a retrieval system that treated research papers as structured technical documents instead of plain text blobs.
The problem
Research papers contain dense technical information distributed across:
- introductions
- methodology sections
- experiments
- ablation studies
- appendices
- equations and figures
Basic embedding search often struggles because semantic similarity alone is not enough.
For example:
- a query about “training incentives” may never explicitly contain the word “reward”
- a query about datasets may retrieve references instead of experimental setup sections
- dense retrieval frequently ignores document structure entirely
RAG Pipeline
What we built
- Hybrid dense + sparse retrieval — combines semantic embeddings with BM25 keyword retrieval.
- Query expansion engine — expands user queries using domain-specific synonyms and pseudo-relevance feedback.
- Section-aware ranking — injects document structure into retrieval scoring.
- Cross-encoder reranking — deep relevance scoring for final candidate selection.
- Noise filtering pipeline — removes references, figure captions, equations, and low-signal chunks.
- LLM-powered answer generation — retrieved context passed into Groq-hosted Llama models with anti-hallucination prompting.
- Evaluation & failure analysis tooling — retrieval debugging pipeline for analyzing ranking failures.
The architecture
The system uses a multi-stage retrieval pipeline instead of a single retrieval pass.
User Query
↓
Query Expansion
↓
Dense Search (FAISS)
+
Sparse Search (BM25)
↓
Pseudo-Relevance Feedback
↓
Re-retrieval
↓
Section Prior Injection
↓
Noise Filtering
↓
Cross-Encoder Reranking
↓
Context Builder
↓
LLM Answer GenerationEach stage exists to solve a specific retrieval weakness.
The retrieval pipeline
The core retrieval engine combines several ranking strategies together.
Dense retrieval
Semantic embeddings generated using Sentence Transformers are indexed with FAISS for approximate nearest-neighbor search.
dense_index, embeddings = indexer.build_dense_index(chunks)This handles semantic similarity well but can still miss structural intent.
Sparse retrieval
BM25 retrieval runs in parallel to preserve exact keyword matching.
bm25 = indexer.build_bm25_index(chunks)This improves precision for technical terminology and formulas.
Query expansion
The query expansion layer addresses vocabulary mismatch.
"training" → ["reward", "PPO", "policy learning", ...]Pseudo-relevance feedback then extracts additional expansion terms dynamically from top-ranked chunks.
Section-aware ranking
The system maps query intent to likely document sections.
For example:
Reward queries → "reward", "3.1"
Dataset queries → "dataset", "4.2"
Conclusion queries → "conclusion", "6."Matching sections receive additional ranking boosts during final scoring.
This dramatically improved retrieval quality for structurally repetitive academic papers.
Decisions worth calling out
- Hybrid retrieval over embedding-only search — dense retrieval alone was inconsistent for technical terminology.
- Cross-encoder reranking only at final stages — reranking everything would have been computationally expensive.
- Section priors as explicit heuristics — embeddings often fail to capture structural intent reliably.
- Noise filtering before reranking — removing references and figure captions improved reranker quality significantly.
- Low-temperature generation — the LLM layer prioritizes factual grounding over creativity.
Trade-offs I made
The biggest trade-off was latency versus retrieval quality.
Multi-stage retrieval pipelines improve answer accuracy substantially, but every stage adds additional compute and orchestration overhead.
Cross-encoder reranking especially improves relevance quality while becoming the slowest part of the retrieval stack.
I also intentionally constrained the generator with strict anti-hallucination prompting, which occasionally produces incomplete answers instead of speculative ones.
I preferred conservative generation over confident hallucination.
Evaluation & debugging
The system includes dedicated retrieval evaluation tooling.
Metrics include:
- Recall@5
- Mean Reciprocal Rank (MRR)
- Hit Rate
I also built a failure analysis pipeline to inspect retrieval misses and ranking mistakes.
FAILED QUERY:
"What is the dynamic obstacle scenario?"
Expected:
Section 4.2
Retrieved:
Section 3.1 reward functionThis made it significantly easier to improve ranking heuristics iteratively.
Performance
The retrieval system supports:
- FAISS approximate nearest-neighbor search
- HNSW indexing for larger corpora
- modular retrieval pipelines
- configurable chunk sizes
- adjustable reranking depth
- tunable retrieval fusion weights
The architecture was designed to remain extensible rather than tightly coupled to one model provider or retrieval strategy.
What it taught me
This project changed how I think about retrieval systems.
A good RAG pipeline is rarely about a single model. Most of the performance gains come from orchestration — retrieval quality, ranking strategy, chunking logic, structural priors, and context construction.
It also taught me that academic document retrieval is fundamentally different from generic semantic search. Papers have structure, hierarchy, and recurring patterns that retrieval systems need to understand explicitly instead of treating everything as flat text.