Evals & Specs
Two powerful tools to measure and optimize your AI systems
Evals vs Specs
Kiln has two powerful features to ensure your AI systems perform as expected, drive optimizations and don't regress in quality:
Evals: Build industry standard evals with methods like LLM-as-Judge and G-Eval.
Specs: A Kiln spec includes an eval, but adds synthetic evaluation data generation, edge case detection, judge prompt generation, and more. It's an easy, fast and more comprehensive way to build evals.
LLM-as-Judge
including G-Eval
✅
✅
Judge Prompt Creation
Manual
Automatic
Edge Case Discovery
Manual
Automatic
Eval Data Creation
Manual
With synthetic tooling
Automatic
Eval Accuracy
Variable
High
Human in the loop validation and refinement
Approx. Effort
30 mins+
5-10mins
Needed Expertise
Data Science Basics Understand Golden sets, data labeling
No experience necessary Fully Guided UI
Kiln Account
Optional
Required
Guides
Specs Guide: build an eval, synthetic data, and align your judge in one interactive flow
Evals 101: build your first eval start to finish
Many Small Evals Beat One Big Eval: Blog post which walks through how to setup eval tooling, and how to create an eval culture on your team.
Evaluate RAG Accuracy: Kiln can generate custom Q&A evals which test your RAG with knowledge from your documents
Evaluate Tool Use: ensure your agents are using the right tools, at the right time, with the right parameters with tool use evals
Use Kiln Evals on External Agents: If you've built agents in another platform, you can still evaluate them in Kiln using our MCP connectors.
Last updated