2024-05-17 • 7 min read
Can Retrieval-Augmented Generation (RAG) work reliably on real-world government PDFs in low-resource domains like public health in Ghana? That was the central question behind this project._
Ghana’s Ministry of Health publishes critical national policy documents as PDF files — spanning topics like antimicrobial resistance, malaria protocols, medical waste, and healthcare financing. These documents are comprehensive, but they’re not search-friendly. They lack structure, indexing, or semantic markup. Worse, many are lengthy, scanned, or inconsistently formatted.
This makes it difficult for health workers, researchers, or journalists to extract factual answers quickly — even for seemingly simple questions like:
While RAG has shown strong performance in benchmark datasets like Natural Questions or TriviaQA, these datasets are clean, structured, and mostly Western-centric. I wanted to explore how RAG behaves when applied to noisy, under-curated, and domain-specific documents in an African context.
I collected 36 PDF documents from moh.gov.gh, Ghana’s Ministry of Health portal. I filtered out unusable scans using word count heuristics and retained machine-readable documents for processing.
I experimented with three chunking approaches:
All chunk logs were saved for retrieval analysis. Sentence-based and overlap chunking are reserved for follow-up evaluation.
Embedded all chunks using sentence-transformers (MiniLM)
Indexed with FAISS
Used Hugging Face’s flan-t5-base
for:
Truncated inputs to a max of 480 tokens for safety
I manually annotated 10 gold QA pairs drawn directly from the PDFs. These pairs served as a stable benchmark to compare:
Each answer was rated on:
| Metric | Baseline | RAG | Gold |
| ------------- | -------- | ---- | ---- |
| Accuracy | 0.45 | 0.60 | 1.00 |
| Hallucination | 0.10 | 0.00 | 0.00 |
| Fluency | 1.00 | 1.00 | 1.00 |
This wasn’t about pushing SOTA performance. It was about testing how modern QA tools behave outside benchmark environments, on real-world policy documents in a region and sector often ignored in NLP.