list-checkEvals & Specs

Two powerful tools to measure and optimize your AI systems

Evals vs Specs

Kiln has two powerful features to ensure your AI systems perform as expected, drive optimizations and don't regress in quality:

  • Evals: Build industry standard evals with methods like LLM-as-Judge and G-Eval.

  • Specs: A Kiln spec includes an eval, but adds synthetic evaluation data generation, edge case detection, judge prompt generation, and more. It's an easy, fast and more comprehensive way to build evals.

Kiln Evals
Kiln Specs

LLM-as-Judge

including G-Eval

Judge Prompt Creation

Manual

Automatic

Edge Case Discovery

Manual

Automatic

Eval Data Creation

Manual

With synthetic tooling

Automatic

Eval Accuracy

Variable

High

Human in the loop validation and refinement

Approx. Effort

30 mins+

5-10mins

Needed Expertise

Data Science Basics Understand Golden sets, data labeling

No experience necessary Fully Guided UI

Kiln Account

Optional

Required

Guides

Last updated