The Run Results dashboard gives a comprehensive view of your experiment’s performance, combining high-level aggregated insights with detailed row-level drill-down capabilities.
Understanding the Aggregated Graphs
At the top of the Run Results page, you’ll find key performance metrics visualized in graphs. These provide an accumulative overview of your AI model’s for the specific Run:
Semantic Similarity Score
This gauge-style chart displays the overall similarity between AI-generated responses and expected outputs.
A higher score indicates greater alignment with the intended response.
Response Word Count Distribution
A histogram that breaks down responses by length.
Helps identify if responses are too brief or excessively verbose.
Estimated Run Cost (USD)
Visualizes the cost incurred for executing the run.
Useful for budget management and cost-efficiency analysis.
Average Latency (sec)
Displays the average time taken to generate responses.
Helps measure responsiveness and efficiency.
Validation Success Percent
Shows how well responses align with predefined evaluation criteria.
Includes separate bars for different evaluation dimensions (e.g., Politeness, Relevance, Greeting, JSON Schema compliance).
A high percentage indicates strong model performance.
Drilling Down to Specific Run Details
Beyond high-level insights, Arato allows you to dive into individual results to analyze specific inputs and outputs:
Row-Level Breakdown
For each record in the run, you can view:
Input Variables: The exact user query, document references, or contextual parameters.
Generated Response: The AI’s output, including structured JSON responses when applicable.
Eval Scores: A breakdown of evaluation metrics (e.g., Politeness, Relevance, Correct JSON Schema).
Similarity Score: How well the generated response matches the expected answer.