This project implements a comprehensive Retrieval Augmented Generation (RAG) evaluation pipeline to measure key performance indicators of LLM responses, particularly focusing on hallucination rates and answer faithfulness, using the Ragas framework with ground-truth datasets.
Large Language Models (LLMs) integrated into RAG systems can sometimes generate responses that are plausible but factually incorrect or inconsistent with the provided source context. This phenomenon, known as "hallucination," undermines the reliability and trustworthiness of RAG applications in production environments. This pipeline addresses the critical need for a systematic and automated approach to quantify and monitor hallucination rates, alongside other essential metrics like answer relevancy and context utilization, against ground-truth data. The goal is to ensure the factual accuracy and faithfulness of LLM outputs, enabling developers to build more robust and dependable RAG systems.
+-------------------+ +---------------------+ +---------------------+
| Data Layer | | RAG Pipeline | | Evaluation Layer |
|-------------------| |---------------------| |---------------------|
| - Ground-truth | | - Document Chunking | | - Ragas Dataset |
| Dataset Loader |<-----|- Embedding Gen. |<-----| Conversion |
| (CSV/JSON) | | - Vector Store | | - Run Ragas Metrics |
| - Document Corpus | | Indexing | | (Faithfulness, |
| Loader | | - Retriever | | Answer Relevancy, |
| - Dataset Schema | | - LLM Response | | Context Precision,|
| Validation | | Generation | | Context Recall) |
+-------------------+ +---------------------+ | - Hallucination |
| | Score (1 - Faith.)|
| +----------^----------+
| |
+----------------------------------------------------------+
|
v
+---------------------+
| Reporting |
|---------------------|
| - Aggregate Scores |
| - Global Metrics |
| Summary |
| - Export Results |
| (JSON + CSV) |
+---------------------+
Ragas defines Faithfulness as the degree to which the generated answer is grounded in the provided context. A high faithfulness score indicates that the LLM's response can be directly attributed to the retrieved documents, thereby minimizing hallucination.
The hallucination rate in this pipeline is explicitly defined and computed as:
hallucination = 1 - faithfulness
This means a faithfulness score of 0.9 would result in a hallucination score of 0.1 (or 10%). Ragas achieves this by:
- Fact Extraction: It first extracts atomic facts from the generated answer.
- Context Verification: For each extracted fact, it checks if that fact is supported by the retrieved context.
- Faithfulness Score: The faithfulness score is the ratio of supported facts to the total number of facts extracted from the answer.
- Clone the repository:
git clone https://github.com/your-repo/rag-evaluation-pipeline.git cd rag-evaluation-pipeline - Create a virtual environment (Python 3.10+ recommended):
python -m venv venv # On Windows .\venv\Scripts\activate # On macOS/Linux source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
- Configure environment variables:
Copy the
.env.examplefile to.envand fill in your API keys.Examplecp .env.example .env # Open .env and add your OpenAI or Azure OpenAI keys.envcontent:OPENAI_API_KEY="sk-YOUR_OPENAI_API_KEY" # Or for Azure OpenAI # AZURE_OPENAI_API_KEY="your_azure_openai_api_key_here" # AZURE_OPENAI_ENDPOINT="https://your-resource-name.openai.azure.com/" # AZURE_OPENAI_API_VERSION="2023-05-15" # AZURE_OPENAI_DEPLOYMENT_NAME="your-gpt-35-turbo-deployment-name" # AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME="your-text-embedding-ada-002-deployment-name" VECTOR_STORE_PATH="./vectorstore" DOCUMENT_PATH="./data/documents" GROUND_TRUTH_PATH="./data/ground_truth" EVALUATION_OUTPUT_PATH="./evaluation_results" LLM_MODEL_NAME="gpt-3.5-turbo" EMBEDDING_MODEL_NAME="text-embedding-ada-002"
The ground-truth dataset should be a CSV or JSON file containing at least the following columns:
question: The query posed to the RAG system.ground_truth_answer: The expected correct answer.ground_truth_context: A relevant context snippet that supports theground_truth_answer.
Example ground_truth.csv:
question,ground_truth_answer,ground_truth_context
"What is the capital of France?","Paris is the capital of France.","Paris is a major European city and the capital of France. It is known for its art, fashion, gastronomy, and culture."
"What is the largest ocean on Earth?","The Pacific Ocean is the largest ocean on Earth.","The Pacific Ocean is the largest and deepest of Earth's five oceanic divisions. It extends from the Arctic in the north to the Southern Ocean in the south."
"Who painted the Mona Lisa?","Leonardo da Vinci painted the Mona Lisa.","The Mona Lisa is a half-length portrait painting by Italian artist Leonardo da Vinci. Considered an archetypal masterpiece of the Italian Renaissance, it has been described as 'the best known, the most visited, the most written about, the most sung about, the most parodied work of art in the world'."The document corpus can consist of various file types (e.g., .txt, .pdf, .csv, .json). These documents will be chunked and indexed into the vector store.
Example data/documents/france.txt:
France is a country located in Western Europe. Its capital city is Paris. Paris is a major global center for art, fashion, gastronomy, and culture. The Eiffel Tower is a famous landmark in Paris.
To run the entire pipeline, execute the main.py script:
python src/main.pyThis script will perform the following steps:
- Load documents from
data/documents/. - Chunk the documents and create/load a FAISS vector store in
vectorstore/. - Initialize the RAG pipeline using the configured LLM and embeddings.
- Load the ground-truth dataset from
data/ground_truth/ground_truth.csv. - Generate RAG responses for each question in the ground-truth dataset.
- Run Ragas evaluation metrics (faithfulness, answer relevancy, context precision, context recall).
- Calculate the explicit hallucination score (
1 - faithfulness). - Aggregate and export the detailed and summary results to
evaluation_results/.
After running python src/main.py, you will find two files in the evaluation_results/ directory:
evaluation_results_detailed.csv: Contains individual scores for each question.
| question | answer | contexts | ground_truth_answer | ground_truth_context | faithfulness | answer_relevancy | context_precision | context_recall | hallucination_score |
|---|---|---|---|---|---|---|---|---|---|
| What is the capital of France? | Paris is the capital of France. | [Document(page_content='France is a country located in Western Europe. Its capital city is Paris...')] | Paris is the capital of France. | [Document(page_content='Paris is a major European city and the capital of France. It is known for its art, fashion, gastronomy, and culture.')] | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| What is the largest ocean? | The Pacific Ocean is the largest ocean. | [Document(page_content='The Pacific Ocean is the largest and deepest of Earth's oceanic divisions...')] | The Pacific Ocean is the largest ocean. | [Document(page_content='The Pacific Ocean is the largest and deepest of Earth's five oceanic divisions. It extends from the Arctic in the north to the Southern Ocean in the south.')] | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
evaluation_results_aggregated.json: Contains the mean scores for all metrics.
{
"faithfulness": 0.95,
"answer_relevancy": 0.98,
"context_precision": 0.92,
"context_recall": 0.90,
"hallucination_score": 0.05
}(Note: The example output above is illustrative. Actual scores will vary based on the LLM, embedding model, documents, and ground-truth data.)
To incorporate additional Ragas metrics:
- Import the desired metric from
ragas.metricsinsrc/evaluation/evaluator.py. - Add the new metric to the
metricslist passed to theevaluatefunction insrc/evaluation/evaluator.py.
- LLMs:
- In
src/rag/rag_pipeline.py, theget_llmfunction can be extended to support otherlangchain_communityorlangchain_openaiLLMs. - Ensure any new LLM is compatible with
BaseChatForToolCallingorBaseLanguageModel. - Update
src/config/settings.pyto include any new environment variables required for the new LLM.
- In
- Embedding Models:
- In
src/ingestion/embedding.py, theget_embedding_modelfunction can be extended to support otherlangchain_communityorlangchain_openaiembedding models. - Ensure any new embedding model is compatible with
Embeddings. - Update
src/config/settings.pyto include any new environment variables required for the new embedding model.
- In
To use a different vector store (e.g., Chroma instead of FAISS):
- Modify
src/ingestion/vector_store.pyto use the new vector store's API (e.g.,Chroma.from_documentsandChroma.as_retriever). - Update
requirements.txtwith the necessary package for the new vector store (chromadb). - Ensure compatibility with
langchain_core.vectorstores.VectorStore.
- Missing API Keys: The most common issue. Ensure all required API keys (
OPENAI_API_KEY,AZURE_OPENAI_API_KEY, etc.) are correctly set in your.envfile. - Incorrect File Paths: Verify that
DOCUMENT_PATH,GROUND_TRUTH_PATH,VECTOR_STORE_PATH, andEVALUATION_OUTPUT_PATHin.env(andsettings.py) point to the correct directories relative to the project root. - Empty Documents Directory: If
data/documents/is empty, the vector store will not be created, leading to retrieval errors. - Ground Truth Format Mismatch: Ensure your
ground_truth.csv(or JSON) adheres to the expected column names (question,ground_truth_answer,ground_truth_context). - LLM Rate Limits: Large datasets might hit LLM API rate limits. Consider implementing retry mechanisms or reducing the dataset size for initial runs.
- Out-of-Memory Errors: Processing very large document corpuses or generating responses for extensive ground-truth datasets can consume significant memory. Adjust
chunk_sizeor run on machines with more RAM. - Dependency Conflicts: If you encounter
ModuleNotFoundErroror similar issues, verify that all packages inrequirements.txtare installed and that your virtual environment is activated. - Azure OpenAI Configuration: For Azure, ensure
AZURE_OPENAI_ENDPOINT,AZURE_OPENAI_DEPLOYMENT_NAME, andAZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAMEare correctly configured, matching your Azure OpenAI resource. TheAZURE_OPENAI_API_VERSIONshould also be compatible. - Dangerous Deserialization Warning: When loading FAISS vector stores,
allow_dangerous_deserialization=Trueis used. This is acceptable for local, trusted data but should be handled with care in production if the vector store source is untrusted.