Skip to main content

Using Custom LLM Evals

Leveraging AI-driven evaluations to assess model responses.

Updated over 3 months ago

How to Use Custom LLM Eval

  • Open the Evaluations Hub by clicking the “Edit Evaluations” button on the prompt header.

  • In the Evaluations Hub, click the “Create New Custom LLM Eval” button.

  • Select the Eval Type:

    • Binary (Yes/No) – A simple pass/fail evaluation.

    • Multiple Choice – A set of categories or labels with individual scores (e.g., 1-5 rating).

    • Numeric Score – A numerical evaluation (e.g., 0-100).

  • Enter a meaningful name for your eval. This step is important, providing a meaningful name for your Custom Eval prompt will make sure the model used to asses that eval understands the goal of that eval prompt.

  • Write a prompt that explains the evaluation criteria to the model. Your prompt should match the type of eval you are creating. For example, if you’re defining a Binary Eval, you can use a yes/no question as your prompt. Few examples:

Evaluate the quality of this summary based on accuracy, completeness, and clarity. Assign one of these categories to score: Poor, Fair, Good , Excellent.

Scores: Poor - 1 | Fair - 2 | Good - 3 | Excellent - 4

-------

Does the response uses information provided as part of the input and is grounded to one or more of the documents provided?

Scores: Yes - 1 | No - 0

-------

You are a math exam checker, Grade the correctness of the answer provided in the response with a score between 0 to 100, make sure to reduce significant score on critical mathematical mistakes only.

-------

Scores: Range 0 -100

  • Choose a vendor and model from the supported list. If required, enter an API key for model access.

  • Choose whether the eval should run on:

    • The prompt text.

    • The response from the model.

      (Optional) Apply preprocessing filters such as regex or JSON key extraction.

  • Click Save to add the eval to the library.

  • Apply the Eval to a Prompt Run

  • Navigate to a Prompt you would like to experiment with the new Eval, and apply the Eval.

  • Run your experiment and view aggregated scores and detailed Evals breakdowns.

Understanding the Eval Results

Once an eval is applied, results are summarized based on the eval type:

  • Multiple Choice: Displays the distribution of scores across runs.

  • Binary: Shows pass/fail rates.

  • Numeric Score: Includes mean, median, mode, and standard deviation.

Custom Eval Best Practices

  • Keep eval prompts clear and specific to ensure reliable results.

  • Use test runs on your Eval prompt to optimize the eval criteria before applying them at scale.

  • Leverage filters to focus on relevant parts of responses.

  • Analyze score distributions to spot trends and improve model performance.

Did this answer your question?