What Are Evals?
Evaluations (Evals) are systematic assessments designed to measure the performance, reliability, and accuracy of Generative AI (GenAI) applications. They help prompt engineers and AI teams optimize their prompts, models, and responses to ensure alignment with business objectives and user expectations.
By leveraging Evals, teams can continuously refine their AI applications, improve response quality, and identify potential failure points before deploying models into production.
Types of Evals
Evals can be broadly categorized into three main types:
Human-in-the-Loop (HITL) Evals: Manual assessments where human reviewers rate model outputs for quality, consistency, and effectiveness.
Automated Evals: Predefined tests that assess model performance based on structured criteria such as accuracy, relevance, and coherence. These can either be deterministic-based Evals (using code with regular expressions for example) or LLM-based Evals.
LLM-as-a-Judge (Custom Evals): AI-driven evaluations where another language model (LLM) assesses responses using predefined scoring mechanisms or heuristics.
Each evaluation type has its place depending on the complexity of the task, the stage of development, and the required level of accuracy.
When and How to Use Evals?
Evals should be incorporated throughout the lifecycle of a GenAI application, from initial development to production deployment. Key use cases include:
Prompt Optimization: Testing different prompt structures to determine the most effective phrasing.
Model Comparison: Benchmarking different AI models to select the best-performing option.
Regression Testing: Ensuring new changes do not degrade existing performance.
Bias and Safety Checks: Detecting unwanted biases or potentially harmful outputs.
Inputs Sanitation: Assuring only certain inputs are indeed processed.
Using Evals effectively involves:
Defining the evaluation criteria and desired outcomes.
Selecting the appropriate type of Eval (Automated, HITL, or LLM-as-a-Judge).
Running evaluations on sample queries and reviewing the results.
Iterating on prompts and model configurations based on findings.