Evaluate RAG Accuracy: Q&A Evals
Know your Kiln Search tools find the right answer with RAG evals and synthetic Q&A data
Evaluating RAG is tricky. An LLM-as-judge doesn't have the knowledge from your documents, so it can't tell if a response is actually correct. But giving the judge access to RAG biases the evaluation.
The solution is reference-answer evals. The judge compares results to a known correct answer. Building these datasets used to be a long manual process, but Kiln makes the process fast and easy.
Overview
Reference answer accuracy evals measure how well your model leverages search tools (RAG) by generating query-answer (Q&A) pairs from your document library and using them as reference answers to test your RAG system's responses. This approach includes:
Generating large eval datasets quickly from your existing documents using synthetic Q&A pair generation
Creating realistic queries that reflect user questions about your corpus
Using reference answers (ground truth) derived from your documents to evaluate accuracy
Systematically testing different search tool configurations (chunking strategies, embedding models, etc.) and task models to find optimal settings
Video Preview
The Workflow
This guide walks through the RAG-specific workflow for reference answer accuracy evals:
Creating a Reference Answer Accuracy Eval
Generating Q&A pairs from your documents
Setting up a Judge
Finding the Ideal Run Configuration
For general eval concepts like judges, run configurations, and comparing results, see Evaluations.

Creating a Reference Answer Accuracy Eval
From the "Eval" tab in Kiln's UI, create a new evaluator using the "Reference Answer Accuracy Eval (RAG)" template.
Select the template, edit if desired, and save your eval.
Generate Q&A Pairs
Most commonly, you'll want to populate your eval dataset using synthetic Q&A pairs generated from your documents. These pairs include reference answers that serve as ground truth for evaluation. Clicking "Add Eval Data" from the Evals UI, and selecting "Synthetic Data" will launch the Q&A generation tool with the proper eval tags already populated.
Select Documents
Choose which documents from your library to use for generating Q&A pairs:
All documents: Use every document in your library
Filter by tags: Select specific documents by applying tag filters. This is useful when you want to generate evals for a specific subset of your corpus.
You can add tags to documents in the Document Library UI (found in the Docs & Search tab) to organize them for filtering.
Extract Documents
Before generating Q&A pairs, you need to extract text from your documents. Choose an extractor config that will process your documents. The extractor converts your documents (PDFs, HTML, etc.) into markdown or plain text. If you've already extracted documents with this extractor, those extractions will be reused.
Generate Pairs
Configure the generation process to create Q&A pairs from your documents.
Generation Settings
Pairs per document/chunk: How many Q&A pairs to generate from each document/chunk. More pairs give you a larger eval dataset but take longer to generate.
Model and provider: The AI model used to generate Q&A pairs. Larger models typically produce higher quality pairs.
Guidance: Optional instructions to steer the generation. You can:
Use the default Q&A generation template (recommended for most cases)
Provide custom guidance to focus on specific types of queries (e.g. "Focus on technical questions")
What Gets Generated
Queries: Realistic questions that users might ask about your document corpus. These can be:
Natural language questions (e.g. "What is the population of Pittsburgh?")
Search-style queries (e.g., "Pittsburgh population 2020")
Reference Answers: Factual, concise answers derived from the document content. These serve as ground truth for evaluating your RAG system's accuracy.
Review and Save
Review generated pairs organized by document/chunk
Remove individual pairs, entire chunks or documents if needed
Click "Save All" to save the Q&A pairs to your dataset.
Since these pairs contain reference answers, there's no need for a separate golden set. All pairs should be tagged with your eval tag (typically starting with qna_eval_set).
Pairs will also be saved with tags that identify:
They're synthetic Q&A data (
synthetic,qna)Their generation session ID (starting with
synthetic_qna_session)
Setting up a Judge
Before evaluating different run configurations, you need to create a judge. The eval you created defines the goal, but the judge defines how it's run (judge algorithm, model, and instructions).
Click "Create Judge" to get started. For detailed guidance on selecting judge algorithms (LLM as Judge vs G-Eval), models, and customizing evaluation steps, see Add a Judge section.
Finding the Ideal Run Configuration
Once you have a judge set up, you can evaluate different configurations for running your RAG task. You can test different task models, prompts, and model parameters to find the best combination for answering questions from your document corpus. For detailed guidance on selecting and comparing task model options, see Finding the Ideal Run Method.
Since reference answer accuracy evals specifically test how well your RAG system retrieves and uses information from your documents, you'll also want to test different search tool configurations:
A range of extraction models
Different chunking strategies: fixed window or semantic, with varying chunk sizes and overlap
A range of embedding models for search
Different search index configurations: full-text search, vector search, or hybrid search, with varying K values
A range of reranking models, with varying N top results
Once you've defined a set of run configurations (combining different task model options and search tool configurations), click "Run Eval" to test them against your Q&A dataset.
Comparing Results
After the eval completes, you'll see average scores for each run configuration. The highest average score indicates the best-performing configuration for your RAG system.
This systematic approach helps you find the optimal combination of task model and Search Tool settings for answering questions from your document corpus.
Last updated