list-checkEvals & Specs

Two powerful tools to measure and optimize your AI systems

Evals vs Specs

Kiln has two powerful features to ensure your AI systems perform as expected, drive optimizations and don't regress in quality:

  • Evals: Build industry standard evals with methods like LLM-as-Judge and G-Eval.

  • Specs: A Kiln spec includes an eval, but adds synthetic evaluation data generation, edge case detection, judge prompt generation, and more. It's an easy, fast and more comprehensive way to build evals.

Kiln Evals
Kiln Specs

LLM-as-Judge

including G-Eval

Judge Prompt Creation

Manual

Automatic

Edge Case Discovery

Manual

Automatic

Eval Data Creation

Manual

With synthetic tooling

Automatic

Eval Accuracy

Variable

High

Human in the loop validation and refinement

Approx. Effort

30 mins+

5-10mins

Needed Expertise

Data Science Basics Understand Golden sets, data labeling

No experience necessary Fully Guided UI

Kiln Account

Optional

Required

Guides

  • Specs Guide: build an eval, synthetic data, and align your judge in one interactive flow

  • Evals 101: build your first eval start to finish

  • Many Small Evals Beat One Big Evalarrow-up-right: Blog post which walks through how to setup eval tooling, and how to create an eval culture on your team.

  • Evaluate RAG Accuracy: Kiln can generate custom Q&A evals which test your RAG with knowledge from your documents

  • Evaluate Tool Use: ensure your agents are using the right tools, at the right time, with the right parameters with tool use evals

Last updated