Skip to main content

RAG Evaluators

On this page, you will learn how to configure and use RAG evaluators for your custom-built application. RAG evaluators are unique because they need intermediate outputs and internal variables (such as context) to run.

By the end of this tutorial, you will know how to write a RAG application that saves the required internal variables and how to configure the evaluator in Agenta to use these in evaluation.

info

Access to internal variables (internals) and intermediate outputs (traces or inline traces) requires version 0.20.0+ of the Python SDK.

info

Access to RAG Evaluators and view trace in the Web UI requires Agenta Cloud and is not available in Agenta OSS.

Developing the RAG Application

Let's consider a simple RAG Application (the source code can be found here).

The application fetches count movies on the topic of topic within the genre of genre, then generates an initial summary from those movies, and finally summarizes the report twice in a row, arbitrarily.

async def rag(topic: str, genre: str, count: int = 5):
result = await retriever(topic, genre, count)
result = await reporter(topic, genre, count, result["movies"])
result = await summarizer(topic, genre, result["report"])
result = await summarizer(topic, genre, result["report"])

return result["report"]

The retriever simply performs semantic search on the formatted prompt fetched from Agenta.

async def retriever(topic: str, genre: str, count: int) -> dict:
prompt = ag.config.retriever_prompt.format(topic=topic, genre=genre)
topk = count * ag.config.retriever_multiplier

query = embed(prompt)
result = search(query["embedding"], topk)

movies = [
f"{movie['title']} ({movie['year']}) in {movie['genres']}: {movie['plot']}"
for movie in result
]

return {"movies": movies}

The reporter processes the set of movies retrieved in the previous step and generates a report.

async def reporter(topic: str, genre: str, count: int, movies: dict) -> dict:
context = ag.config.generator_context_prompt.format(movies="\n".join(movies))
instructions = ag.config.generator_instructions_prompt.format(
count=count, topic=topic, genre=genre
)

prompts = {"system": context, "user": instructions}
opts = {
"model": ag.config.generator_model,
"temperature": ag.config.generator_temperature,
}

result = await chat(prompts, opts)

report = result["message"]

return {"report": report}

The summarizer finally improves on the reporter's output, in two passes.

async def summarizer(topic: str, genre: str, report: dict) -> dict:
context = ag.config.summarizer_context_prompt
instructions = ag.config.summarizer_instructions_prompt.format(
topic=topic, genre=genre, report=report
)

prompts = {"system": context, "user": instructions}
opts = {
"model": ag.config.summarizer_model,
"temperature": ag.config.summarizer_temperature,
}

result = await chat(prompts, opts)

report = result["message"]

return {"report": report}

Testing the RAG Evaluation

Upon instrumenting and serving the RAG Application with Agenta, you should see the output of the second summarizer.

Upon clicking on view trace, you should be able to see the inputs and outputs for each instrumented stage in the RAG workflow.

Now that we have a RAG Application, let's create a RAG evaluator.

Creating the RAG Evaluator

On the evlauators page, click on RAG Faithfulness for example.

RAG evaluators, based on RAGAS are different from other evaluators in Agenta in that they often require internal variables. For instance, Faithfulness and Context Relevancy both require question, answer, and contexts.

From the trace we saw before, we could say that answer maps to the second rag summarizer output report, denoted by rag.summarizer[1].outputs.report. Similarly, the contexts map to the rag retriever output movies, denoted by rag.retriever.outputs.movies.

However, the question, which corresponds to the formatted prompt sent to the retriever (or the reporter) is not part of the inputs or the outputs of any stage in the workflow.

Next, let's see how and where you to find that variable.

Updating the RAG Application

In Agenta, when instrumenting stages of a workflow, there is a utility to add internal variables, called internals, to the stage's span.

async def retriever(topic: str, genre: str, count: int) -> dict:
prompt = ag.config.retriever_prompt.format(topic=topic, genre=genre)
topk = count * ag.config.retriever_multiplier

ag.tracing.store_internals({"prompt": prompt})

query = embed(prompt)
result = search(query["embedding"], topk)

movies = [
f"{movie['title']} ({movie['year']}) in {movie['genres']}: {movie['plot']}"
for movie in result
]

return {"movies": movies}

The utility ag.tracing.store_internals(...), as shown in the snippet above.

Upon serving the RAG Application again, the trace should now contain the formatted prompt

We are ready to resume the configuration of our RAG Evaluator.

Configuring the RAG Evaluator

Back to the evaluators page, we can now map the question to the rag retriever internal prompt, denoted by rag.retriever.internals.prompt, as shown below.

Once the RAG Evaluator is set up, you are ready to go.

Running the evaluation

You can now run evaluations on RAG Applications based on RAG Evaluators.