AI Scorecard Guidelines | Scorebuddy Help Center

1. Question Text = Title

Keep the question text short (1–3 words).
Think of it as a reporting label, not the full question.

Empathy, Verification, Resolution, Tone

❌ Bad: “Did the agent demonstrate empathy throughout the conversation?”
✅ Good: “Empathy”

2. Tip/Guidance = Structured Rubric

Each question should have a detailed Tip/Guidance section that acts like a mini scoring rubric. This ensures evaluators (and AI) know exactly what is being measured and how to apply each answer option.

Structure your guidance as follows:

Full Question (in natural language) – state what you are asking.
Scoring Instructions – explain what each answer option means.
Examples – give concrete examples of poor, mid, and excellent performance.

Example: Empathy (1–5 Scale + N/A)

Question:

Empathy

Tip:

Did the agent demonstrate empathy towards the customer?

Scoring Guidance:

Poor: No acknowledgement of customer’s feelings; dismissive or robotic. Weak: Minimal attempt at empathy; generic phrase without personalisation.

Adequate: Acknowledges the customer but in a limited or scripted way.

Strong: Genuine empathy with personalised acknowledgement. Exceptional: Consistently empathetic throughout, with tailored and reassuring responses.

N/A – Not Applicable: Use if the customer’s statements did not provide any opportunity for empathy.

Examples:

Score Poor: “You’ll have to wait.” (No acknowledgement of frustration)

Score Adequate: “I understand this must be frustrating.” (Acknowledges but generic) Score Exceptional: “I completely understand how important this is for you — let’s fix it right away.” (Personalised and reassuring) N/A: Customer only asked for store opening hours (no emotional content).

3. Always Include N/A Where Relevant

Why: Not all behaviours apply in every interaction. Forcing a score without N/A leads to inconsistent results and may cause the AI to hallucinate.
When: Add N/A to questions on compliance, emotion, resolution, or any behaviour that may not arise in every conversation.
Benefit: Keeps scoring accurate and reporting clean.

❌ Without N/A: The AI invents empathy where none exists.
✅ With N/A: The AI cleanly excludes irrelevant behaviours.

4. Match Answer Labels to Question Type

Use the right labels for the behaviour being measured, and keep them short and simple.

Yes / No (+ N/A) → best for factual behaviours (Verification, Greeting).
Pass / Fail / N/A → best for compliance behaviours.
1–5 (+ N/A) → best for graded behaviours (Empathy, Tone, Product Knowledge).

⚠️ Important: Avoid long or descriptive answer labels.

If you use labels like “5 – Exceptional”, the LLM may only output “Exceptional”, causing a mismatch.
Instead, keep the label short (1, 2, 3, 4, 5) and explain what each means in the scoring guidance.

Examples:

❌ Bad: “5 – Exceptional empathy demonstrated”
✅ Good: “5” (with explanation in guidance: Consistently empathetic, personalised, and reassuring)

5. One Behaviour per Question (Granularity Matters)

Manual scorecards are often designed for humans, combining multiple behaviours into one broad question (e.g. Communication).
Problem for AI: Grouped questions force the model to assess several behaviours at once, reducing accuracy and consistency.
AI Best Practice: Break down broad behaviours into granular questions that measure one behaviour each.

Example:

❌ Manual-style: “Communication” (covers empathy, tone, clarity, grammar).
✅ AI-style:
- Empathy (Did the agent acknowledge the customer’s feelings?)
- Clarity (Were explanations clear and easy to follow?)
- Tone (Was the tone professional and appropriate?)
- Spelling & Grammar (Was written communication free of errors?)

Tip:
Granularity improves both scoring quality and reporting insight — making it clear if an agent struggles with empathy but excels in clarity.

6. Quick Checklist

✅ Title-style question text (short & report-friendly)
✅ Full question written in natural language
✅ Scoring rules for each answer option
✅ N/A included where appropriate
✅ Examples to illustrate performance levels
✅ One behaviour per question (granularity over grouping)