RAG Evaluation Pipeline

This project implements a comprehensive Retrieval Augmented Generation (RAG) evaluation pipeline to measure key performance indicators of LLM responses, particularly focusing on hallucination rates and answer faithfulness, using the Ragas framework with ground-truth datasets.

Problem Statement

Large Language Models (LLMs) integrated into RAG systems can sometimes generate responses that are plausible but factually incorrect or inconsistent with the provided source context. This phenomenon, known as "hallucination," undermines the reliability and trustworthiness of RAG applications in production environments. This pipeline addresses the critical need for a systematic and automated approach to quantify and monitor hallucination rates, alongside other essential metrics like answer relevancy and context utilization, against ground-truth data. The goal is to ensure the factual accuracy and faithfulness of LLM outputs, enabling developers to build more robust and dependable RAG systems.

Architecture Diagram

+-------------------+      +---------------------+      +---------------------+
|    Data Layer     |      |    RAG Pipeline     |      |   Evaluation Layer  |
|-------------------|      |---------------------|      |---------------------|
| - Ground-truth    |      | - Document Chunking |      | - Ragas Dataset     |
|   Dataset Loader  |<-----|- Embedding Gen.    |<-----|   Conversion      |
|   (CSV/JSON)      |      | - Vector Store      |      | - Run Ragas Metrics |
| - Document Corpus |      |   Indexing          |      |   (Faithfulness,    |
|   Loader          |      | - Retriever         |      |   Answer Relevancy, |
| - Dataset Schema  |      | - LLM Response      |      |   Context Precision,|
|   Validation      |      |   Generation        |      |   Context Recall)   |
+-------------------+      +---------------------+      | - Hallucination     |
         |                                               |   Score (1 - Faith.)|
         |                                               +----------^----------+
         |                                                          |
         +----------------------------------------------------------+
                                     |
                                     v
                           +---------------------+
                           |      Reporting      |
                           |---------------------|
                           | - Aggregate Scores  |
                           | - Global Metrics    |
                           |   Summary           |
                           | - Export Results    |
                           |   (JSON + CSV)      |
                           +---------------------+

How Ragas Evaluates Hallucinations

Ragas defines Faithfulness as the degree to which the generated answer is grounded in the provided context. A high faithfulness score indicates that the LLM's response can be directly attributed to the retrieved documents, thereby minimizing hallucination.

The hallucination rate in this pipeline is explicitly defined and computed as: hallucination = 1 - faithfulness

This means a faithfulness score of 0.9 would result in a hallucination score of 0.1 (or 10%). Ragas achieves this by:

Fact Extraction: It first extracts atomic facts from the generated answer.
Context Verification: For each extracted fact, it checks if that fact is supported by the retrieved context.
Faithfulness Score: The faithfulness score is the ratio of supported facts to the total number of facts extracted from the answer.

Setup Instructions

Clone the repository:

git clone https://github.com/your-repo/rag-evaluation-pipeline.git
cd rag-evaluation-pipeline

Create a virtual environment (Python 3.10+ recommended):

python -m venv venv
# On Windows
.\venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

Configure environment variables: Copy the .env.example file to .env and fill in your API keys.

cp .env.example .env
# Open .env and add your OpenAI or Azure OpenAI keys

Example .env content:

OPENAI_API_KEY="sk-YOUR_OPENAI_API_KEY"

# Or for Azure OpenAI
# AZURE_OPENAI_API_KEY="your_azure_openai_api_key_here"
# AZURE_OPENAI_ENDPOINT="https://your-resource-name.openai.azure.com/"
# AZURE_OPENAI_API_VERSION="2023-05-15"
# AZURE_OPENAI_DEPLOYMENT_NAME="your-gpt-35-turbo-deployment-name"
# AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME="your-text-embedding-ada-002-deployment-name"

VECTOR_STORE_PATH="./vectorstore"
DOCUMENT_PATH="./data/documents"
GROUND_TRUTH_PATH="./data/ground_truth"
EVALUATION_OUTPUT_PATH="./evaluation_results"

LLM_MODEL_NAME="gpt-3.5-turbo"
EMBEDDING_MODEL_NAME="text-embedding-ada-002"

Dataset Format Examples

Ground-truth Dataset (`data/ground_truth/ground_truth.csv`)

The ground-truth dataset should be a CSV or JSON file containing at least the following columns:

question: The query posed to the RAG system.
ground_truth_answer: The expected correct answer.
ground_truth_context: A relevant context snippet that supports the ground_truth_answer.

Example ground_truth.csv:

question,ground_truth_answer,ground_truth_context
"What is the capital of France?","Paris is the capital of France.","Paris is a major European city and the capital of France. It is known for its art, fashion, gastronomy, and culture."
"What is the largest ocean on Earth?","The Pacific Ocean is the largest ocean on Earth.","The Pacific Ocean is the largest and deepest of Earth's five oceanic divisions. It extends from the Arctic in the north to the Southern Ocean in the south."
"Who painted the Mona Lisa?","Leonardo da Vinci painted the Mona Lisa.","The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. Considered an archetypal masterpiece of the Italian Renaissance, it has been described as 'the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world'."

Document Corpus (`data/documents/`)

The document corpus can consist of various file types (e.g., .txt, .pdf, .csv, .json). These documents will be chunked and indexed into the vector store.

Example data/documents/france.txt:

France is a country located in Western Europe. Its capital city is Paris. Paris is a major global center for art, fashion, gastronomy, and culture. The Eiffel Tower is a famous landmark in Paris.

How to Run Ingestion and Evaluation

To run the entire pipeline, execute the main.py script:

python src/main.py

This script will perform the following steps:

Load documents from data/documents/.
Chunk the documents and create/load a FAISS vector store in vectorstore/.
Initialize the RAG pipeline using the configured LLM and embeddings.
Load the ground-truth dataset from data/ground_truth/ground_truth.csv.
Generate RAG responses for each question in the ground-truth dataset.
Run Ragas evaluation metrics (faithfulness, answer relevancy, context precision, context recall).
Calculate the explicit hallucination score (1 - faithfulness).
Aggregate and export the detailed and summary results to evaluation_results/.

Example Output

After running python src/main.py, you will find two files in the evaluation_results/ directory:

evaluation_results_detailed.csv: Contains individual scores for each question.

question	answer	contexts	ground_truth_answer	ground_truth_context	faithfulness	answer_relevancy	context_precision	context_recall	hallucination_score
What is the capital of France?	Paris is the capital of France.	[Document(page_content='France is a country located in Western Europe. Its capital city is Paris...')]	Paris is the capital of France.	[Document(page_content='Paris is a major European city and the capital of France. It is known for its art, fashion, gastronomy, and culture.')]	1.0	1.0	1.0	1.0	0.0
What is the largest ocean?	The Pacific Ocean is the largest ocean.	[Document(page_content='The Pacific Ocean is the largest and deepest of Earth's oceanic divisions...')]	The Pacific Ocean is the largest ocean.	[Document(page_content='The Pacific Ocean is the largest and deepest of Earth's five oceanic divisions. It extends from the Arctic in the north to the Southern Ocean in the south.')]	1.0	1.0	1.0	1.0	0.0

evaluation_results_aggregated.json: Contains the mean scores for all metrics.

{
    "faithfulness": 0.95,
    "answer_relevancy": 0.98,
    "context_precision": 0.92,
    "context_recall": 0.90,
    "hallucination_score": 0.05
}

(Note: The example output above is illustrative. Actual scores will vary based on the LLM, embedding model, documents, and ground-truth data.)

How to Extend Metrics / Models

Adding New Ragas Metrics

To incorporate additional Ragas metrics:

Import the desired metric from ragas.metrics in src/evaluation/evaluator.py.
Add the new metric to the metrics list passed to the evaluate function in src/evaluation/evaluator.py.

Integrating Different LLMs or Embedding Models

LLMs:
- In src/rag/rag_pipeline.py, the get_llm function can be extended to support other langchain_community or langchain_openai LLMs.
- Ensure any new LLM is compatible with BaseChatForToolCalling or BaseLanguageModel.
- Update src/config/settings.py to include any new environment variables required for the new LLM.
Embedding Models:
- In src/ingestion/embedding.py, the get_embedding_model function can be extended to support other langchain_community or langchain_openai embedding models.
- Ensure any new embedding model is compatible with Embeddings.
- Update src/config/settings.py to include any new environment variables required for the new embedding model.

Changing Vector Store

To use a different vector store (e.g., Chroma instead of FAISS):

Modify src/ingestion/vector_store.py to use the new vector store's API (e.g., Chroma.from_documents and Chroma.as_retriever).
Update requirements.txt with the necessary package for the new vector store (chromadb).
Ensure compatibility with langchain_core.vectorstores.VectorStore.

Common Failure Modes and Pitfalls

Missing API Keys: The most common issue. Ensure all required API keys (OPENAI_API_KEY, AZURE_OPENAI_API_KEY, etc.) are correctly set in your .env file.
Incorrect File Paths: Verify that DOCUMENT_PATH, GROUND_TRUTH_PATH, VECTOR_STORE_PATH, and EVALUATION_OUTPUT_PATH in .env (and settings.py) point to the correct directories relative to the project root.
Empty Documents Directory: If data/documents/ is empty, the vector store will not be created, leading to retrieval errors.
Ground Truth Format Mismatch: Ensure your ground_truth.csv (or JSON) adheres to the expected column names (question, ground_truth_answer, ground_truth_context).
LLM Rate Limits: Large datasets might hit LLM API rate limits. Consider implementing retry mechanisms or reducing the dataset size for initial runs.
Out-of-Memory Errors: Processing very large document corpuses or generating responses for extensive ground-truth datasets can consume significant memory. Adjust chunk_size or run on machines with more RAM.
Dependency Conflicts: If you encounter ModuleNotFoundError or similar issues, verify that all packages in requirements.txt are installed and that your virtual environment is activated.
Azure OpenAI Configuration: For Azure, ensure AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT_NAME, and AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME are correctly configured, matching your Azure OpenAI resource. The AZURE_OPENAI_API_VERSION should also be compatible.
Dangerous Deserialization Warning: When loading FAISS vector stores, allow_dangerous_deserialization=True is used. This is acceptable for local, trusted data but should be handled with care in production if the vector store source is untrusted.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
src		src
tests		tests
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Evaluation Pipeline

Problem Statement

Architecture Diagram

How Ragas Evaluates Hallucinations

Setup Instructions

Dataset Format Examples

Ground-truth Dataset (`data/ground_truth/ground_truth.csv`)

Document Corpus (`data/documents/`)

How to Run Ingestion and Evaluation

Example Output

How to Extend Metrics / Models

Adding New Ragas Metrics

Integrating Different LLMs or Embedding Models

Changing Vector Store

Common Failure Modes and Pitfalls

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Evaluation Pipeline

Problem Statement

Architecture Diagram

How Ragas Evaluates Hallucinations

Setup Instructions

Dataset Format Examples

Ground-truth Dataset (data/ground_truth/ground_truth.csv)

Document Corpus (data/documents/)

How to Run Ingestion and Evaluation

Example Output

How to Extend Metrics / Models

Adding New Ragas Metrics

Integrating Different LLMs or Embedding Models

Changing Vector Store

Common Failure Modes and Pitfalls

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Ground-truth Dataset (`data/ground_truth/ground_truth.csv`)

Document Corpus (`data/documents/`)

Packages