RAG Evaluation with RAGAS and MLflow

Table of contents

Why This Tutorial?
What You'll Learn
Prerequisites
Setup and Configuration
- LLM Provider Configuration
Sample Knowledge Base
Minimal RAG Pipeline
- Enable MLflow Tracing
Load Golden Dataset
- Generate RAG Responses for Evaluation
RAGAS Evaluation with MLflow
- Prepare Evaluation Data
MLflow Results Analysis
- Interpreting RAGAS Scores
Common Pitfalls and Solutions
Extras: Comparing RAG Variants with MLflow
- Example: Comparing Chunk Sizes
- Comparing Results in MLflow UI
How to inspect results in MLflow UI:
More from MLflow
References, further reading

Why This Tutorial?

Evaluating RAG pipelines is surprisingly difficult. You can build a working retrieval system in an afternoon, but answering "Is it actually good?" requires systematic measurement.

The challenge: Manual evaluation doesn't scale. Eyeballing a few responses tells you almost nothing about overall quality. You need metrics that capture different aspects of RAG performance:

Is the retriever finding relevant documents?
Is the LLM staying faithful to the retrieved context (not hallucinating)?
Are the answers factually correct?

The solution: RAGAS provides standardized metrics for RAG evaluation. MLflow provides experiment tracking. Together, they enable systematic, reproducible evaluation that you can run on every pipeline change.

What made this tricky: The MLflow-RAGAS integration looks simple in the docs, but getting it to work with a real LangChain pipeline required navigating several non-obvious requirements:

Specific model URI formats for different providers
Function signatures that match MLflow's expectations
Proper tracing spans for context-aware metrics

This tutorial documents what actually works, including the gotchas I encountered along the way.

This tutorial demonstrates how to evaluate Retrieval-Augmented Generation (RAG) systems using RAGAS (Retrieval Augmented Generation Assessment) metrics through MLflow integration.

NOTE: RAGAS is a third-party evaluation library. For more details, visit the RAGAS GitHub repository. At the time of writing (January 2026), apart from RAGAS, MLFlow supports another third-party scorer/evaluation library: DeepEval.

What You'll Learn

Build a minimal RAG pipeline using LangChain and FAISS
Create a golden evaluation dataset with expected answers
Evaluate RAG quality using RAGAS metrics (Faithfulness, Context Precision, Context Recall, Factual Correctness)
Track results in MLflow for systematic comparison
Support multiple LLM providers: OpenAI, Azure OpenAI, and Ollama

Prerequisites

Python 3.10+
API key for your chosen LLM provider
Basic understanding of RAG concepts

Setup and Configuration

import os
import warnings
from enum import Enum

import pandas as pd

warnings.filterwarnings("ignore")

LLM Provider Configuration

This tutorial supports three LLM providers. Choose your provider and configure the appropriate environment variables:

Provider	Required Environment Variables
OpenAI	`OPENAI_API_KEY`
Azure OpenAI	`AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_DEPLOYMENT_NAME`
Ollama	None (runs locally on `http://localhost:11434`)

class LLMProvider(Enum):
    OPENAI = "openai"
    AZURE_OPENAI = "azure_openai"
    OLLAMA = "ollama"


# === CONFIGURE YOUR PROVIDER HERE ===
PROVIDER = LLMProvider.AZURE_OPENAI

# Model names per provider
MODEL_CONFIG = {
    LLMProvider.OPENAI: {
        "chat_model": "gpt-4o-mini",
        "embedding_model": "text-embedding-3-small",
    },
    LLMProvider.AZURE_OPENAI: {
        "chat_model": os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini"),
        "embedding_model": os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-3-small"),
    },
    LLMProvider.OLLAMA: {
        "chat_model": "llama3.2:3b",
        "embedding_model": "nomic-embed-text",
    },
}

print(f"Using provider: {PROVIDER.value}")
print(f"Chat model: {MODEL_CONFIG[PROVIDER]['chat_model']}")
print(f"Embedding model: {MODEL_CONFIG[PROVIDER]['embedding_model']}")

Using provider: azure_openai
Chat model: gpt-4o-mini
Embedding model: text-embedding-ada-002-v2

def validate_environment(provider: LLMProvider) -> None:
    """Validate required environment variables for the selected provider."""
    required_vars = {
        LLMProvider.OPENAI: ["OPENAI_API_KEY"],
        LLMProvider.AZURE_OPENAI: [
            "AZURE_OPENAI_ENDPOINT",
            "AZURE_OPENAI_API_KEY",
        ],
        LLMProvider.OLLAMA: [],
    }

    missing = [var for var in required_vars[provider] if not os.getenv(var)]
    if missing:
        raise EnvironmentError(
            f"Missing environment variables for {provider.value}: {missing}\n"
            f"Please set them before continuing."
        )
    
    # Set provider-specific env vars for litellm (used by RAGAS scorers)
    if provider == LLMProvider.OLLAMA:
        os.environ.setdefault("OLLAMA_API_BASE", "http://localhost:11434")
        print(f"OLLAMA_API_BASE set to: {os.environ['OLLAMA_API_BASE']}")
    elif provider == LLMProvider.AZURE_OPENAI:
        # Set litellm Azure env vars from Azure OpenAI vars
        if os.getenv("AZURE_OPENAI_API_KEY") and not os.getenv("AZURE_API_KEY"):
            os.environ["AZURE_API_KEY"] = os.environ["AZURE_OPENAI_API_KEY"]
        if os.getenv("AZURE_OPENAI_ENDPOINT") and not os.getenv("AZURE_API_BASE"):
            os.environ["AZURE_API_BASE"] = os.environ["AZURE_OPENAI_ENDPOINT"]
        os.environ.setdefault("AZURE_API_VERSION", os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-01"))
        print(f"Azure litellm env vars configured")
    
    print(f"Environment validated for {provider.value}")


validate_environment(PROVIDER)

Azure litellm env vars configured
Environment validated for azure_openai

from langchain_openai import ChatOpenAI, OpenAIEmbeddings, AzureChatOpenAI, AzureOpenAIEmbeddings
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.chat_models import ChatOllama


def get_llm(provider: LLMProvider):
    """Factory function to create LLM instance based on provider."""
    config = MODEL_CONFIG[provider]

    if provider == LLMProvider.OPENAI:
        return ChatOpenAI(
            model=config["chat_model"],
            temperature=0,
        )
    elif provider == LLMProvider.AZURE_OPENAI:
        return AzureChatOpenAI(
            azure_deployment=config["chat_model"],
            api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-01"),
            temperature=0,
        )
    elif provider == LLMProvider.OLLAMA:
        return ChatOllama(
            model=config["chat_model"],
            temperature=0,
            base_url="http://localhost:11434",
        )


def get_embeddings(provider: LLMProvider):
    """Factory function to create embeddings instance based on provider."""
    config = MODEL_CONFIG[provider]

    if provider == LLMProvider.OPENAI:
        return OpenAIEmbeddings(model=config["embedding_model"])
    elif provider == LLMProvider.AZURE_OPENAI:
        return AzureOpenAIEmbeddings(
            azure_deployment=config["embedding_model"],
            api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-01"),
        )
    elif provider == LLMProvider.OLLAMA:
        return OllamaEmbeddings(
            model=config["embedding_model"],
            base_url="http://localhost:11434",
        )


def get_mlflow_model_uri(provider: LLMProvider) -> str:
    """Get MLflow model URI for RAGAS scorers (uses litellm format)."""
    config = MODEL_CONFIG[provider]

    if provider == LLMProvider.OPENAI:
        return f"openai:/{config['chat_model']}"
    elif provider == LLMProvider.AZURE_OPENAI:
        # Azure format: azure/<deployment_name>
        return f"azure:/{config['chat_model']}"
    elif provider == LLMProvider.OLLAMA:
        # Ollama format for litellm: ollama/<model_name>
        # Note: ollama_chat format has issues with litellm, use ollama/ prefix
        return f"ollama:/{config['chat_model']}"


llm = get_llm(PROVIDER)
embeddings = get_embeddings(PROVIDER)
mlflow_model_uri = get_mlflow_model_uri(PROVIDER)

print(f"LLM initialized: {type(llm).__name__}")
print(f"Embeddings initialized: {type(embeddings).__name__}")
print(f"MLflow model URI: {mlflow_model_uri}")

LLM initialized: AzureChatOpenAI
Embeddings initialized: AzureOpenAIEmbeddings
MLflow model URI: azure:/gpt-4o-mini

Sample Knowledge Base

We'll create a small knowledge base about MLflow - fitting for a tutorial that uses MLflow for evaluation! This dataset contains key concepts that our RAG system will retrieve from.

import json

# Load knowledge base from external file
with open("data/knowledge_base.json") as f:
    KNOWLEDGE_BASE = json.load(f)

print(f"Knowledge base contains {len(KNOWLEDGE_BASE)} documents")
for i, doc in enumerate(KNOWLEDGE_BASE, 1):
    preview = doc[:80].replace('\n', ' ')
    print(f"  {i}. {preview}...")

Knowledge base contains 20 documents
  1. MLflow Tracking is an API and UI for logging parameters, code versions, metrics,...
  2. The MLflow Model Registry is a centralized model store that provides model linea...
  3. MLflow GenAI provides specialized tools for developing and evaluating generative...
  4. RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework int...
  5. MLflow Projects package code in a reusable, reproducible form. A project is simp...
  6. MLflow's autolog feature automatically logs metrics, parameters, and models duri...
  7. The MLflow Model format is a standard format for packaging machine learning mode...
  8. Evaluation in MLflow can be performed using mlflow.evaluate() for traditional ML...
  9. MLflow Model Serving enables deploying models as REST API endpoints. You can ser...
  10. MLflow Recipes (formerly MLflow Pipelines) provide predefined templates for comm...
  11. The MLflow CLI provides commands for running projects, serving models, and manag...
  12. MLflow's REST API allows programmatic access to the tracking server. Endpoints i...
  13. MLflow experiments organize runs into logical groups. Each experiment has a uniq...
  14. MLflow provides run comparison capabilities through the UI and API. The Compare ...
  15. MLflow artifacts are files associated with runs, such as models, data files, and...
  16. Model signatures in MLflow define the expected input and output schema for model...
  17. MLflow on Databricks provides managed MLflow tracking, model registry, and model...
  18. MLflow supports multiple environment managers for reproducibility. Projects can ...
  19. MLflow Prompt Engineering tools help develop and version prompts for LLM applica...
  20. MLflow integrates with LangChain through mlflow.langchain module. The integratio...

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

documents = [Document(page_content=doc.strip()) for doc in KNOWLEDGE_BASE]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
)
splits = text_splitter.split_documents(documents)

print(f"Split into {len(splits)} chunks")

Split into 20 chunks

vectorstore = FAISS.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

print(f"Vector store created with {vectorstore.index.ntotal} vectors")

Vector store created with 20 vectors

test_query = "What metrics does RAGAS provide?"
test_results = retriever.invoke(test_query)

print(f"Test query: '{test_query}'")
print(f"Retrieved {len(test_results)} documents:")
for i, doc in enumerate(test_results, 1):
    print(f"\n--- Document {i} ---")
    print(doc.page_content[:200] + "...")

Test query: 'What metrics does RAGAS provide?'
Retrieved 3 documents:

--- Document 1 ---
RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework integrated with MLflow for assessing RAG pipelines. Key metrics include: Faithfulness (measures if the answer is grounded i...

--- Document 2 ---
MLflow GenAI provides specialized tools for developing and evaluating generative AI applications. It includes mlflow.genai.evaluate() for systematic assessment of LLM outputs using configurable scorer...

--- Document 3 ---
Evaluation in MLflow can be performed using mlflow.evaluate() for traditional ML models or mlflow.genai.evaluate() for generative AI applications. For GenAI, evaluation uses Scorer objects that can be...

Minimal RAG Pipeline

We'll build a simple RAG chain using LangChain's LCEL (LangChain Expression Language) that:

Retrieves relevant context from our FAISS vector store
Formats a prompt with the context and question
Generates an answer using the LLM

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

RAG_PROMPT = ChatPromptTemplate.from_template("""
You are a helpful assistant answering questions about MLflow.
Use ONLY the following context to answer the question.
If the context doesn't contain the answer, say "I don't have enough information to answer this question."

Context:
{context}

Question: {question}

Answer:
""")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough(),
    }
    | RAG_PROMPT
    | llm
    | StrOutputParser()
)

print("RAG chain created successfully")

RAG chain created successfully

test_answer = rag_chain.invoke("What is MLflow Tracking?")
print("Test Question: What is MLflow Tracking?")
print(f"\nAnswer: {test_answer}")

Test Question: What is MLflow Tracking?

Answer: MLflow Tracking is an API and UI for logging parameters, code versions, metrics, and artifacts when running your machine learning code. It allows you to log and query experiments using Python, REST, R API, and Java API. The MLflow Tracking component lets you log source code, models, and visualizations. Each run records: code version, start and end time, source, parameters, metrics, and artifacts.

Enable MLflow Tracing

MLflow's LangChain integration can automatically capture traces of our RAG pipeline invocations. This is essential for evaluation - RAGAS scorers analyze these traces to compute metrics.

import mlflow

mlflow.set_experiment("RAG-Evaluation-Tutorial")

mlflow.langchain.autolog(log_traces=True)

print(f"MLflow experiment: {mlflow.get_experiment_by_name('RAG-Evaluation-Tutorial').name}")
print("LangChain autologging enabled with tracing")

2026/01/11 09:22:25 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2026/01/11 09:22:25 INFO mlflow.store.db.utils: Updating database tables
2026/01/11 09:22:25 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2026/01/11 09:22:25 INFO alembic.runtime.migration: Will assume non-transactional DDL.
2026/01/11 09:22:25 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2026/01/11 09:22:25 INFO alembic.runtime.migration: Will assume non-transactional DDL.

MLflow experiment: RAG-Evaluation-Tutorial
LangChain autologging enabled with tracing

NOTE: If you don't have tracing enabled in your RAG there is still possibility to pass the evaluation data to MLflow to ease analysis. If this is your case please refer to the text on my blog explaining how to do it: RAG Evaluation with RAGAS and MLflow - without tracing

Load Golden Dataset

A golden dataset (also called ground truth or evaluation dataset) contains:

Questions: User queries we want to evaluate
Expected Answers: The correct/ideal responses
Expected Contexts (optional): Which documents should be retrieved

This dataset allows us to systematically measure our RAG system's quality.

# Load golden dataset from external file
with open("data/golden_dataset.json") as f:
    GOLDEN_DATASET = json.load(f)

eval_df = pd.DataFrame(GOLDEN_DATASET)
print(f"Golden dataset contains {len(GOLDEN_DATASET)} evaluation samples")
eval_df.head(2)

Golden dataset contains 20 evaluation samples

	question	ground_truth	contexts
0	What is MLflow Tracking used for?	MLflow Tracking is used for logging parameters...	[MLflow Tracking is an API and UI for logging ...
1	What features does the MLflow Model Registry p...	The MLflow Model Registry provides model linea...	[The MLflow Model Registry is a centralized mo...

Generate RAG Responses for Evaluation

We'll run our RAG pipeline on each question and collect the responses along with the retrieved contexts. This data will be used by RAGAS scorers.

# Define traced RAG function for evaluation
# IMPORTANT: Function parameter names must match keys in data['inputs']
# Since inputs={'question': ...}, the function must accept 'question' parameter

@mlflow.trace(span_type="CHAIN")
def traced_rag_predict(question: str) -> dict:
    """Traced RAG prediction function for mlflow.genai.evaluate().
    
    Args:
        question: The question to answer (matches inputs['question'] key)
    
    Returns:
        dict with 'response' and 'retrieved_contexts' for RAGAS scorers
    """
    # Retrieval step - creates RETRIEVER span
    with mlflow.start_span(name="retriever", span_type="RETRIEVER") as span:
        retrieved_docs = retriever.invoke(question)
        contexts = [doc.page_content for doc in retrieved_docs]
        span.set_inputs({"question": question})
        span.set_outputs({"retrieved_contexts": contexts})
    
    # Generation step - creates LLM span
    with mlflow.start_span(name="generator", span_type="LLM") as span:
        answer = rag_chain.invoke(question)
        span.set_inputs({"question": question, "contexts": contexts})
        span.set_outputs({"response": answer})
    
    return {
        "response": answer,
        "retrieved_contexts": contexts,
    }

# Preview a sample from the golden dataset
# Note: With predict_fn approach, answers are generated during evaluation
sample_idx = 2
print(f"Sample evaluation record #{sample_idx + 1}:")
print(f"\nQuestion: {eval_df.iloc[sample_idx]['question']}")
print(f"\nExpected Answer: {eval_df.iloc[sample_idx]['ground_truth']}")

# Show what the traced function would produce for this question
print(f"\n--- Testing RAG response for this question ---")
test_output = traced_rag_predict(question=eval_df.iloc[sample_idx]['question'])
print(f"\nRAG Answer: {test_output['response']}")
print(f"\nRetrieved Contexts ({len(test_output['retrieved_contexts'])}):\n")
for j, ctx in enumerate(test_output['retrieved_contexts'], 1):
    print(f"  Context {j}: {ctx[:100]}...\n")

Sample evaluation record #3:

Question: What metrics does RAGAS provide for RAG evaluation?

Expected Answer: RAGAS provides four key metrics: Faithfulness (measures if the answer is grounded in context), Context Precision (evaluates if relevant documents are ranked higher), Context Recall (checks if context contains all needed information), and Factual Correctness (compares output against expected answers).

--- Testing RAG response for this question ---

RAG Answer: RAGAS provides the following key metrics for RAG evaluation: Faithfulness, Context Precision, Context Recall, and Factual Correctness.

Retrieved Contexts (3):

  Context 1: RAGAS (Retrieval Augmented Generation Assessment) is an evaluation framework integrated with MLflow ...

  Context 2: Evaluation in MLflow can be performed using mlflow.evaluate() for traditional ML models or mlflow.ge...

  Context 3: MLflow GenAI provides specialized tools for developing and evaluating generative AI applications. It...

RAGAS Evaluation with MLflow

Now we'll use MLflow's RAGAS integration to evaluate our RAG pipeline. The key metrics we'll compute:

Metric	What it measures	Required Data	Common Failure
Faithfulness	Is the answer grounded in retrieved context?	answer, contexts	Missing RETRIEVER spans
Context Precision	Are relevant docs ranked higher?	question, contexts, ground_truth	No ground_truth provided
Context Recall	Does context contain needed info?	contexts, ground_truth	No ground_truth provided
Factual Correctness	Does answer match expected?	answer, ground_truth	Semantic mismatch (strict)

These metrics provide an initial quantitative assessment of RAG quality across multiple dimensions. There are more RAGAS tool metrics available through the MLflow integration.

Note on LLM Judge: RAGAS metrics use an LLM as a judge. For best results, use OpenAI (gpt-4o-mini) as the judge model even if you're using Ollama for RAG generation. Ollama/local models may have issues with litellm's structured output parsing. Set JUDGE_PROVIDER = LLMProvider.OPENAI below if you encounter scoring errors with Ollama.

# Test litellm connectivity (optional - helps debug scoring issues)
import litellm

def test_litellm_connection(model_uri: str) -> bool:
    """Test if litellm can connect to the model."""
    try:
        response = litellm.completion(
            model=model_uri,
            messages=[{"role": "user", "content": "Say 'test' and nothing else."}],
            max_tokens=10,
        )
        print(f"✓ litellm connection successful: {model_uri}")
        print(f"  Response: {response.choices[0].message.content[:50]}...")
        return True
    except Exception as e:
        print(f"✗ litellm connection failed: {model_uri}")
        print(f"  Error: {type(e).__name__}: {str(e)[:100]}")
        return False

# Test the judge model connection
judge_model_uri = get_mlflow_model_uri(PROVIDER)
print(f"Testing judge model: {judge_model_uri}\n")
litellm_ok = test_litellm_connection(judge_model_uri)

if not litellm_ok:
    print("\n⚠️  Consider using OpenAI as judge model for reliable scoring.")
    print("   Set JUDGE_PROVIDER = LLMProvider.OPENAI in the next cell.")

Testing judge model: azure:/gpt-4o-mini


Provider List: https://docs.litellm.ai/docs/providers


Provider List: https://docs.litellm.ai/docs/providers

✗ litellm connection failed: azure:/gpt-4o-mini
  Error: BadRequestError: litellm.BadRequestError: LLM Provider NOT provided. Pass in the LLM provider you are trying to call.

⚠️  Consider using OpenAI as judge model for reliable scoring.
   Set JUDGE_PROVIDER = LLMProvider.OPENAI in the next cell.

from mlflow.genai.scorers.ragas import (
    Faithfulness,
    ContextPrecision,
    ContextRecall,
    FactualCorrectness,
)

# Configure the judge model for RAGAS evaluation
# For reliable scoring, use OpenAI even when using Ollama for RAG generation
JUDGE_PROVIDER = PROVIDER  # Change to LLMProvider.OPENAI for better results
judge_model_uri = get_mlflow_model_uri(JUDGE_PROVIDER)

print(f"Judge model: {judge_model_uri}")
print(f"(Change JUDGE_PROVIDER to LLMProvider.OPENAI if scoring fails with Ollama)\n")

# Note: ContextPrecision and ContextRecall require traces with RETRIEVER spans
# For evaluation without traces, use Faithfulness and FactualCorrectness
scorers = [
    Faithfulness(model=judge_model_uri),
    FactualCorrectness(model=judge_model_uri),
    # These require traces with retriever spans - may show errors without proper tracing:
    ContextPrecision(model=judge_model_uri),
    ContextRecall(model=judge_model_uri),
]

print(f"Configured {len(scorers)} RAGAS scorers:")
for scorer in scorers:
    print(f"  - {type(scorer).__name__}")

Judge model: azure:/gpt-4o-mini
(Change JUDGE_PROVIDER to LLMProvider.OPENAI if scoring fails with Ollama)

Configured 4 RAGAS scorers:
  - Faithfulness
  - FactualCorrectness
  - ContextPrecision
  - ContextRecall

Prepare Evaluation Data

MLflow's genai.evaluate() expects data in a specific format. We need to map our data to the expected schema.

# Prepare evaluation data for predict_fn approach
# With predict_fn, we pass inputs and expectations - outputs come from the traced function
eval_data = []
for _, row in eval_df.iterrows():
    eval_data.append({
        "inputs": {"question": row["question"]},
        "expectations": {
            "ground_truth": row["ground_truth"],
            "contexts": row.get("contexts", []),  # For ContextRecall
        },
    })

print(f"Prepared {len(eval_data)} samples for evaluation")
print(f"\nSample format:")
print(f"  inputs: {list(eval_data[0]['inputs'].keys())}")
print(f"  expectations: {list(eval_data[0]['expectations'].keys())}")
if eval_data[0]['expectations'].get('ground_truth'):
    print(f"  ground_truth contexts: {len(eval_data[0]['expectations']['ground_truth'])} items")
print(f"\nNote: outputs will be generated by traced_rag_predict() during evaluation")
print(f"      ground_truth enables ContextPrecision and ContextRecall metrics")

Prepared 20 samples for evaluation

Sample format:
  inputs: ['question']
  expectations: ['ground_truth', 'contexts']
  ground_truth contexts: 205 items

Note: outputs will be generated by traced_rag_predict() during evaluation
      ground_truth enables ContextPrecision and ContextRecall metrics

print("Running RAGAS evaluation with traced predict_fn...")
print("This generates traces with RETRIEVER spans for Faithfulness metric.\n")

with mlflow.start_run(run_name="ragas-evaluation-traced") as run:
    mlflow.log_param("provider", PROVIDER.value)
    mlflow.log_param("model", MODEL_CONFIG[PROVIDER]["chat_model"])
    mlflow.log_param("num_samples", len(eval_data))
    mlflow.log_param("retriever_k", 3)
    mlflow.log_param("evaluation_mode", "predict_fn")

    # Use predict_fn to generate traces with RETRIEVER spans
    # This allows Faithfulness scorer to access retrieved_contexts
    eval_results = mlflow.genai.evaluate(
        predict_fn=traced_rag_predict,
        data=eval_data,
        scorers=scorers,
    )

    run_id = run.info.run_id

print(f"\nEvaluation complete! Run ID: {run_id}")

2026/01/11 09:22:33 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.
2026/01/11 09:22:33 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset. To disable this check, set the MLFLOW_GENAI_EVAL_SKIP_TRACE_VALIDATION environment variable to True.
2026/01/11 09:22:33 WARNING mlflow.tracing.fluent: Failed to start span VectorStoreRetriever: 'NonRecordingSpan' object has no attribute 'context'. For full traceback, set logging level to debug.

Running RAGAS evaluation with traced predict_fn...
This generates traces with RETRIEVER spans for Faithfulness metric.

2026/01/11 09:22:34 WARNING mlflow.tracing.fluent: Failed to start span RunnableSequence: 'NonRecordingSpan' object has no attribute 'context'. For full traceback, set logging level to debug.

Evaluating:   0%|          | 0/20 [Elapsed: 00:00, Remaining: ?]

✨ Evaluation completed.

Metrics and evaluation results are logged to the MLflow run:
  Run name: ragas-evaluation-traced
  Run ID: 35029b87d0e542128dedd53531ba0710

To view the detailed evaluation results with sample-wise scores,
open the Traces tab in the Run page in the MLflow UI.


Evaluation complete! Run ID: 35029b87d0e542128dedd53531ba0710

MLflow Results Analysis

Let's examine the evaluation results both programmatically and understand how to view them in the MLflow UI.

print("=" * 60)
print("RAGAS EVALUATION RESULTS")
print("=" * 60)

results_df = eval_results.tables["eval_results"]

# Find RAGAS scorer columns (Faithfulness, FactualCorrectness, Context*)
import pandas as pd
ragas_metrics = ['Faithfulness', 'FactualCorrectness', 'ContextPrecision', 'ContextRecall']
value_columns = [col for col in results_df.columns 
                 if col.endswith('/value') and any(m in col for m in ragas_metrics)]
error_columns = [col for col in results_df.columns 
                 if col.endswith('/error') and any(m in col for m in ragas_metrics)]

print("\nRAGAS Metrics:")
print("-" * 40)
successful_metrics = 0
failed_metrics = 0

for col in value_columns:
    # Convert to numeric, coercing errors to NaN
    numeric_col = pd.to_numeric(results_df[col], errors='coerce')
    non_null = numeric_col.dropna()
    total = len(results_df)
    success_count = len(non_null)

    if success_count > 0:
        mean_val = non_null.mean()
        std_val = non_null.std() if len(non_null) > 1 else 0
        print(f"  ✓ {col}: {mean_val:.3f} (±{std_val:.3f}) [{success_count}/{total} samples]")
        successful_metrics += 1
    else:
        print(f"  ✗ {col}: NO SCORES (0/{total} samples succeeded)")
        failed_metrics += 1

print(f"\nSummary: {successful_metrics} metrics succeeded, {failed_metrics} metrics failed")
print(f"   Total samples: {len(results_df)}")

# Error diagnostics (if any metrics failed)
if failed_metrics > 0:
    print("\n" + "=" * 60)
    print("🔍 DIAGNOSTIC: Error Details for Failed Metrics")
    print("=" * 60)
    
    for col in error_columns:
        metric_name = col.replace('/error', '')
        errors = results_df[col].dropna()
        
        if len(errors) > 0:
            print(f"\n❌ {metric_name}:")
            # Get first unique error message
            unique_errors = errors.unique()
            for err in unique_errors[:2]:  # Show max 2 unique errors
                # Truncate long error messages
                err_str = str(err)[:300]
                if len(str(err)) > 300:
                    err_str += "..."
                print(f"   {err_str}")
    
    print("\n" + "-" * 60)
    print("Common fixes:")
    print("   1. Use OpenAI as judge: JUDGE_PROVIDER = LLMProvider.OPENAI")
    print("   2. For Ollama: ensure model is running and OLLAMA_API_BASE is set")
    print("   3. ContextPrecision/ContextRecall require traces with RETRIEVER spans")
else:
    print("\n✅ All metrics computed successfully!")

============================================================
RAGAS EVALUATION RESULTS
============================================================

RAGAS Metrics:
----------------------------------------
  ✓ ContextRecall/value: 0.853 (±0.196) [20/20 samples]
  ✓ ContextPrecision/value: 0.967 (±0.116) [20/20 samples]
  ✓ Faithfulness/value: 0.984 (±0.047) [20/20 samples]
  ✓ FactualCorrectness/value: 0.613 (±0.250) [20/20 samples]

Summary: 4 metrics succeeded, 0 metrics failed
   Total samples: 20

✅ All metrics computed successfully!

# Helper function to extract question from request column
def extract_question(request_data):
    """Extract question from MLflow request column."""
    if isinstance(request_data, dict):
        return str(request_data.get("question", "N/A"))[:60]
    elif isinstance(request_data, str):
        return request_data[:60]
    return "N/A"

# Display results summary with metric columns
available_cols = [col for col in value_columns if col in results_df.columns]
results_summary = results_df[available_cols].copy()

# Add question column from request data
if "request" in results_df.columns:
    results_summary.insert(0, "question", results_df["request"].apply(extract_question))

results_summary


print("\nIdentifying Low-Scoring Samples:")
print("-" * 40)

for col in value_columns:
    if col in results_df.columns:
        numeric_col = pd.to_numeric(results_df[col], errors='coerce')
        low_mask = numeric_col < 0.5
        low_scores = results_df[low_mask]
        if len(low_scores) > 0:
            print(f"\n⚠️  {col} < 0.5: {len(low_scores)} samples")
            for idx, row in low_scores.iterrows():
                question = extract_question(row.get("request", {}))
                score = numeric_col.loc[idx]
                if pd.notna(score):
                    print(f"    - [{score:.2f}] {question}...")

Identifying Low-Scoring Samples:
----------------------------------------

⚠️  ContextRecall/value < 0.5: 1 samples
    - [0.33] What is MLflow GenAI used for?...

⚠️  ContextPrecision/value < 0.5: 1 samples
    - [0.50] How can you run MLflow Projects?...

⚠️  FactualCorrectness/value < 0.5: 6 samples
    - [0.46] What is MLflow Tracking used for?...
    - [0.30] What is MLflow GenAI used for?...
    - [0.47] How can you run MLflow Projects?...
    - [0.12] What is Faithfulness in RAGAS?...
    - [0.31] What frameworks support MLflow autolog?...
    - [0.40] How can you access MLflow programmatically via REST API?...

Interpreting RAGAS Scores

All RAGAS metrics return scores between 0.0 and 1.0. Here's rough guidance:

Score Range	Interpretation
0.9 - 1.0	Excellent - production ready
0.7 - 0.9	Good - minor improvements needed
0.5 - 0.7	Fair - significant room for improvement
< 0.5	Poor - investigate specific failures

Important caveats:

These thresholds are guidelines, not absolutes
Different applications have different quality requirements
Low FactualCorrectness often reflects semantic similarity issues, not actual incorrectness
Focus on relative improvements when comparing variants, not absolute scores

To view detailed results in the MLflow UI:

Start MLflow UI (if not running): $ mlflow ui --port 5000
Open http://localhost:5000 in your browser
Navigate to:
- Experiment: 'RAG-Evaluation-Tutorial'
- Run: 'ragas-evaluation'
In the run details, you'll find:
- Parameters: model configuration
- Metrics: aggregate RAGAS scores
- Artifacts: detailed evaluation tables
- Traces: individual RAG invocations

print(f"Run ID: {run_id}")

Run ID: 35029b87d0e542128dedd53531ba0710

print("\n" + "=" * 60)
print("🎉 Tutorial Complete!")
print("=" * 60)
print(f"""
Summary:
  - Provider: {PROVIDER.value}
  - Model: {MODEL_CONFIG[PROVIDER]['chat_model']}
  - Samples evaluated: {len(eval_data)}
  - MLflow Run ID: {run_id}

View results: mlflow ui --port 5000
""")

============================================================
🎉 Tutorial Complete!
============================================================

Summary:
  - Provider: azure_openai
  - Model: gpt-4o-mini
  - Samples evaluated: 20
  - MLflow Run ID: 35029b87d0e542128dedd53531ba0710

View results: mlflow ui --port 5000

Common Pitfalls and Solutions

During development of this tutorial, several non-obvious issues emerged:

Model URI Format

MLflow uses litellm under the hood. The URI format matters:

OpenAI: openai:/gpt-4o-mini (note the colon-slash)
Azure: azure:/deployment-name
Ollama: ollama:/llama3.2:3b

Using azure/ instead of azure:/ will fail silently or produce cryptic errors.

Function Signature Must Match Input Keys

When using predict_fn, the function parameter names must exactly match the keys in your inputs dictionary:

# If your data has: {"inputs": {"question": "..."}}
# Your function MUST be: def predict(question: str)  # NOT def predict(query: str)

RETRIEVER Spans for Context Metrics

Faithfulness, ContextPrecision, and ContextRecall require traces with RETRIEVER-type spans. Without them, these metrics return errors or incorrect values. The traced_rag_predict function in this tutorial creates these spans explicitly.

Judge Model Limitations

RAGAS metrics use an LLM as a judge. Local models (Ollama) may struggle with the structured output parsing that RAGAS requires. For reliable scoring, consider using OpenAI/Azure as the judge even when your RAG uses a different provider.

Extras: Comparing RAG Variants with MLflow

One of MLflow's key strengths is enabling systematic A/B comparisons between different RAG configurations. Here's how to structure experiments comparing variants like chunk sizes, models, or retrieval strategies.

Example: Comparing Chunk Sizes

# Comparing RAG Variants: Different Chunk Sizes
# This demonstrates how to evaluate the same RAG pipeline with different configurations

print("Running chunk size comparison experiments...")
print("=" * 60)

CHUNK_SIZES = [50, 150]
experiment_run_ids = []

for chunk_size in CHUNK_SIZES:
    print(f"\nTesting chunk_size={chunk_size}")
    
    # Rebuild the vector store with new chunk size
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_size // 10
    )
    chunks = text_splitter.split_documents(documents)
    
    # Update the global vectorstore and retriever used by traced_rag_predict
    global vectorstore, retriever
    vectorstore = FAISS.from_documents(chunks, embeddings)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
    
    print(f"   Created {len(chunks)} chunks")
    
    # Run evaluation
    with mlflow.start_run(run_name=f"chunk-size-{chunk_size}") as run:
        mlflow.log_param("chunk_size", chunk_size)
        mlflow.log_param("chunk_overlap", chunk_size // 10)
        mlflow.log_param("num_chunks", len(chunks))
        
        eval_results = mlflow.genai.evaluate(
            predict_fn=traced_rag_predict,
            data=eval_data,
            scorers=scorers,
        )
        experiment_run_ids.append(run.info.run_id)
        print(f"   ✓ Run ID: {run.info.run_id}")

print("\n" + "=" * 60)
print(f"Completed {len(CHUNK_SIZES)} experiments. Run IDs saved for comparison.")

Running chunk size comparison experiments...
============================================================

Testing chunk_size=50

2026/01/11 09:23:10 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset. To disable this check, set the MLFLOW_GENAI_EVAL_SKIP_TRACE_VALIDATION environment variable to True.
2026/01/11 09:23:10 WARNING mlflow.tracing.fluent: Failed to start span VectorStoreRetriever: 'NonRecordingSpan' object has no attribute 'context'. For full traceback, set logging level to debug.
2026/01/11 09:23:10 WARNING mlflow.tracing.fluent: Failed to start span RunnableSequence: 'NonRecordingSpan' object has no attribute 'context'. For full traceback, set logging level to debug.

   Created 175 chunks

Evaluating:   0%|          | 0/20 [Elapsed: 00:00, Remaining: ?]

✨ Evaluation completed.

Metrics and evaluation results are logged to the MLflow run:
  Run name: chunk-size-50
  Run ID: 34464d4c3bb34a0a8ce4d43760149f1e

To view the detailed evaluation results with sample-wise scores,
open the Traces tab in the Run page in the MLflow UI.

   ✓ Run ID: 34464d4c3bb34a0a8ce4d43760149f1e

Testing chunk_size=150

2026/01/11 09:23:33 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset. To disable this check, set the MLFLOW_GENAI_EVAL_SKIP_TRACE_VALIDATION environment variable to True.
2026/01/11 09:23:33 WARNING mlflow.tracing.fluent: Failed to start span VectorStoreRetriever: 'NonRecordingSpan' object has no attribute 'context'. For full traceback, set logging level to debug.
2026/01/11 09:23:33 WARNING mlflow.tracing.fluent: Failed to start span RunnableSequence: 'NonRecordingSpan' object has no attribute 'context'. For full traceback, set logging level to debug.

   Created 62 chunks

Evaluating:   0%|          | 0/20 [Elapsed: 00:00, Remaining: ?]

✨ Evaluation completed.

Metrics and evaluation results are logged to the MLflow run:
  Run name: chunk-size-150
  Run ID: c77546d4d930426b98ebeb1bf1a09a3f

To view the detailed evaluation results with sample-wise scores,
open the Traces tab in the Run page in the MLflow UI.

   ✓ Run ID: c77546d4d930426b98ebeb1bf1a09a3f

============================================================
Completed 2 experiments. Run IDs saved for comparison.

Comparing Results in MLflow UI

After running multiple variants:

Open MLflow UI: mlflow ui --port 5000
Navigate to your experiment
Select runs to compare using checkboxes
Click Compare to see side-by-side metrics
Use Chart view to visualize metric differences

You can also compare programmatically:

# Compare Results Programmatically
# Query MLflow for runs and display a formatted comparison table

experiment_name = "RAG-Evaluation-Tutorial"

# Get runs with chunk_size parameter (our comparison experiments)
runs_df = mlflow.search_runs(
    experiment_names=[experiment_name],
    filter_string="params.chunk_size != ''",
    order_by=["params.chunk_size ASC"]
)

if len(runs_df) == 0:
    print("No chunk size comparison runs found. Run the comparison cell above first.")
else:
    # Debug: show available metric columns
    metric_cols = [c for c in runs_df.columns if c.startswith("metrics.")]
    print(f"Available metric columns ({len(metric_cols)} total):")
    for col in metric_cols[:8]:  # Show first 8
        print(f"  - {col}")
    
    # Define metrics we want (will search for partial matches)
    metric_names = ["Faithfulness", "FactualCorrectness", "ContextPrecision", "ContextRecall"]
    
    # Find actual column names (may have backticks or different format)
    def find_metric_col(df, metric_name):
        """Find column containing metric_name in its name."""
        for col in df.columns:
            if metric_name in col and "mean" in col:
                return col
        return None
    
    comparison_data = []
    for _, run in runs_df.iterrows():
        row = {
            "Run Name": run.get("tags.mlflow.runName", "N/A"),
            "Chunk Size": run.get("params.chunk_size", "N/A"),
            "Num Chunks": run.get("params.num_chunks", "N/A"),
        }
        for metric_name in metric_names:
            col = find_metric_col(runs_df, metric_name)
            if col:
                value = run.get(col)
                row[metric_name] = f"{value:.3f}" if pd.notna(value) else "N/A"
            else:
                row[metric_name] = "N/A"
        comparison_data.append(row)
    
    comparison_df = pd.DataFrame(comparison_data)
    print("\nChunk Size Comparison Results")
    print("=" * 80)
    print(comparison_df.to_string(index=False))
    
    # Find best configuration
    if "FactualCorrectness" in comparison_df.columns:
        best_idx = comparison_df["FactualCorrectness"].apply(
            lambda x: float(x) if x != "N/A" else 0
        ).idxmax()
        print(f"\n✨ Best configuration: {comparison_df.iloc[best_idx]['Run Name']}")

Available metric columns (4 total):
  - metrics.Faithfulness/mean
  - metrics.FactualCorrectness/mean
  - metrics.ContextRecall/mean
  - metrics.ContextPrecision/mean

Chunk Size Comparison Results
================================================================================
      Run Name Chunk Size Num Chunks Faithfulness FactualCorrectness ContextPrecision ContextRecall
chunk-size-150        150         62        0.911              0.768            0.992         0.890
chunk-size-150        150         62        0.915              0.781            0.983         0.834
 chunk-size-50         50        175        0.942              0.743            0.983         0.840
 chunk-size-50         50        175        0.944              0.758            0.983         0.844

✨ Best configuration: chunk-size-150

How to inspect results in MLflow UI:

Select the experiment to inspect

Figure 1: Select the experiment you want to analyze.

Experiment type should be automatically recognized as "GenAI Evaluation" - when opening the experiment for the first time - you need to confirm this. Perhaps there is a way to pass this parameter when creating the experiment via code, but I have not found it yet.

Configure comparison of the runs

Figure 2: Runs overview - high level overview of the results achieved for the various configurations/variant of your RAG under evaluation.

You can edit experiment name and description here. Informative names help when comparing multiple experiments. Description can provide additional context about the experiment's purpose.

NOTE: Perhaps there is a way to pass description when creating the experiment via code, but I have not found it yet.

In this view you can see all the runs (evaluate RAG variants) that belongs to this experiment. Each run corresponds to a different RAG configuration (e.g., different chunk sizes, models, etc.). You can see parameters (e.g., model name, chunk size), aggregated metrics (e.g., mean Faithfulness, mean Context Precision). The displayed columns with parameters, metrics can be customized using the "Columns" button on the top right, so you can focus on the most relevant information.
If you want to do the comparison of two runs, select the runs you want to compare by checking the checkboxes next to each run. Note that, the second run you select will be treated as the "baseline" run in the comparison. The score changes of the first selected run will be calculated against the second selected run.
You can select multiple runs to compare their metrics side-by-side. This is useful for evaluating different RAG configurations.
You can also select columns to display in the comparison table.

View Comparison Results

Compare - two pannels Figure 3: Comparison of individual question results for selected runs. In this view, you can see detailed comparison of individual question results for the selected runs. The zoom of the single panel is presented in the Figure 4** below. This helps to analyze how each RAG configuration performed on specific queries. You can inspects full RAG output and details like retrieved contexts for each question.

Figure 4: Zoom at the detailed results for the single variant

Figure 5: Comparison of individual question results for selected runs - detailed view showing metrics for each question.

More from MLflow

This tutorial focused on RAG evaluation using RAGAS metrics. MLflow offers many more features for RAG or GenAI model management, including:

built-in metrics MLFlow predefined metrics for GenAI models
Guidelines-based LLM Scorers Guidelines-based LLM Scorers

from MLflow Documentation:

Guidelines is a powerful scorer class designed to let you quickly and easily customize evaluation by defining natural language criteria that are framed as pass/fail conditions. It is ideal for checking compliance with rules, style guides, or information inclusion/exclusion.

Guidelines have the distinct advantage of being easy to explain to business stakeholders ("we are evaluating if the app delivers upon this set of rules") and, as such, can often be directly written by domain experts.

MCP server See the documentation on how to add MLflow MCP server in poular IDEs and Agentic conding tools: MLflow MCP Server

...and many more features. Explore the MLflow GenAI documentation for more details.

References, further reading

the code for this tutorial is available on my GitHub: 2026-01-08-ragas-in-mlfow-rag-eval-demo
MLflow Documentation
Introduction to RAG with MLflow and LangChain - MLflow documentation - exemplary implementation of RAG with LangChain and MLflow (without RAGAS evaluation).
GitHub - rag_evaluation_and_tracking - This project houses a Retrieval Augmented Generation (RAG) LLM application built for robust and context-aware text generation. It leverages the combined power of LangChain for orchestration, MLflow for tracking and experimentation, DVC for version control, and RAGAS for evaluation.

RAG Evaluation with RAGAS and MLflow

You might also like