Testing RAG in the Wild: Ghana’s Health Policies as a Case Study

2024-05-177 min read

Testing RAG in the Wild: Ghana’s Health Policies as a Case Study

Can Retrieval-Augmented Generation (RAG) work reliably on real-world government PDFs in low-resource domains like public health in Ghana? That was the central question behind this project._

Context: The Problem with Policy Access in Low-Resource Domains

Ghana’s Ministry of Health publishes critical national policy documents as PDF files — spanning topics like antimicrobial resistance, malaria protocols, medical waste, and healthcare financing. These documents are comprehensive, but they’re not search-friendly. They lack structure, indexing, or semantic markup. Worse, many are lengthy, scanned, or inconsistently formatted.

This makes it difficult for health workers, researchers, or journalists to extract factual answers quickly — even for seemingly simple questions like:

While RAG has shown strong performance in benchmark datasets like Natural Questions or TriviaQA, these datasets are clean, structured, and mostly Western-centric. I wanted to explore how RAG behaves when applied to noisy, under-curated, and domain-specific documents in an African context.


Approach: How I Built the QA Pipeline

Step 1: Real Document Scraping

I collected 36 PDF documents from moh.gov.gh, Ghana’s Ministry of Health portal. I filtered out unusable scans using word count heuristics and retained machine-readable documents for processing.

Step 2: Chunking Strategies

I experimented with three chunking approaches:

All chunk logs were saved for retrieval analysis. Sentence-based and overlap chunking are reserved for follow-up evaluation.

Step 3: Retriever + Generator

Step 4: Manual Evaluation

I manually annotated 10 gold QA pairs drawn directly from the PDFs. These pairs served as a stable benchmark to compare:

Each answer was rated on:


Results: What Surprised Me


| Metric        | Baseline | RAG  | Gold |
| ------------- | -------- | ---- | ---- |
| Accuracy      | 0.45     | 0.60 | 1.00 |
| Hallucination | 0.10     | 0.00 | 0.00 |
| Fluency       | 1.00     | 1.00 | 1.00 |

Performance Chart


Research Reflections

This wasn’t about pushing SOTA performance. It was about testing how modern QA tools behave outside benchmark environments, on real-world policy documents in a region and sector often ignored in NLP.

Key insights:


Future Work