How to Use Custom LLM Eval
Open the Evaluations Hub by clicking the “Edit Evaluations” button on the prompt header.
In the Evaluations Hub, click the “Create New Custom LLM Eval” button.
Select the Eval Type:
Binary (Yes/No) – A simple pass/fail evaluation.
Multiple Choice – A set of categories or labels with individual scores (e.g., 1-5 rating).
Numeric Score – A numerical evaluation (e.g., 0-100).
Enter a meaningful name for your eval. This step is important, providing a meaningful name for your Custom Eval prompt will make sure the model used to asses that eval understands the goal of that eval prompt.
Write a prompt that explains the evaluation criteria to the model. Your prompt should match the type of eval you are creating. For example, if you’re defining a Binary Eval, you can use a yes/no question as your prompt. Few examples:
Evaluate the quality of this summary based on accuracy, completeness, and clarity. Assign one of these categories to score: Poor, Fair, Good , Excellent.
Scores: Poor - 1 | Fair - 2 | Good - 3 | Excellent - 4
-------
Does the response uses information provided as part of the input and is grounded to one or more of the documents provided?
Scores: Yes - 1 | No - 0
-------
You are a math exam checker, Grade the correctness of the answer provided in the response with a score between 0 to 100, make sure to reduce significant score on critical mathematical mistakes only.
-------
Scores: Range 0 -100
Choose a vendor and model from the supported list. If required, enter an API key for model access.
Choose whether the eval should run on:
The prompt text.
The response from the model.
(Optional) Apply preprocessing filters such as regex or JSON key extraction.
Click Save to add the eval to the library.
Apply the Eval to a Prompt Run
Navigate to a Prompt you would like to experiment with the new Eval, and apply the Eval.
Run your experiment and view aggregated scores and detailed Evals breakdowns.
Understanding the Eval Results
Once an eval is applied, results are summarized based on the eval type:
Multiple Choice: Displays the distribution of scores across runs.
Binary: Shows pass/fail rates.
Numeric Score: Includes mean, median, mode, and standard deviation.
Custom Eval Best Practices
Keep eval prompts clear and specific to ensure reliable results.
Use test runs on your Eval prompt to optimize the eval criteria before applying them at scale.
Leverage filters to focus on relevant parts of responses.
Analyze score distributions to spot trends and improve model performance.