Skip to main content

AI Scorecard Guidelines

The aim is to create clear, concise, and structured questions that deliver consistent, high-quality results and accurate reporting.

Jordan McGovern avatar
Written by Jordan McGovern
Updated over a week ago

1. Question Text = Title

  • Keep the question text short (1–3 words).

  • Think of it as a reporting label, not the full question.

Empathy, Verification, Resolution, Tone

❌ Bad: “Did the agent demonstrate empathy throughout the conversation?”
✅ Good: “Empathy”


2. Tip/Guidance = Structured Rubric

Each question should have a detailed Tip/Guidance section that acts like a mini scoring rubric. This ensures evaluators (and AI) know exactly what is being measured and how to apply each answer option.

Structure your guidance as follows:

  1. Full Question (in natural language) – state what you are asking.

  2. Scoring Instructions – explain what each answer option means.

  3. Examples – give concrete examples of poor, mid, and excellent performance.


Example: Empathy (1–5 Scale + N/A)

Question:

Empathy

Tip:

Did the agent demonstrate empathy towards the customer?

Scoring Guidance:

Poor: No acknowledgement of customer’s feelings; dismissive or robotic. Weak: Minimal attempt at empathy; generic phrase without personalisation.

Adequate: Acknowledges the customer but in a limited or scripted way.

Strong: Genuine empathy with personalised acknowledgement. Exceptional: Consistently empathetic throughout, with tailored and reassuring responses.

N/A – Not Applicable: Use if the customer’s statements did not provide any opportunity for empathy.

Examples:

Score Poor: “You’ll have to wait.” (No acknowledgement of frustration)

Score Adequate: “I understand this must be frustrating.” (Acknowledges but generic) Score Exceptional: “I completely understand how important this is for you — let’s fix it right away.” (Personalised and reassuring) N/A: Customer only asked for store opening hours (no emotional content).


3. Always Include N/A Where Relevant

  • Why: Not all behaviours apply in every interaction. Forcing a score without N/A leads to inconsistent results and may cause the AI to hallucinate.

  • When: Add N/A to questions on compliance, emotion, resolution, or any behaviour that may not arise in every conversation.

  • Benefit: Keeps scoring accurate and reporting clean.

❌ Without N/A: The AI invents empathy where none exists.
✅ With N/A: The AI cleanly excludes irrelevant behaviours.


4. Match Answer Labels to Question Type

Use the right labels for the behaviour being measured, and keep them short and simple.

  • Yes / No (+ N/A) → best for factual behaviours (Verification, Greeting).

  • Pass / Fail / N/A → best for compliance behaviours.

  • 1–5 (+ N/A) → best for graded behaviours (Empathy, Tone, Product Knowledge).

⚠️ Important: Avoid long or descriptive answer labels.

  • If you use labels like “5 – Exceptional”, the LLM may only output “Exceptional”, causing a mismatch.

  • Instead, keep the label short (1, 2, 3, 4, 5) and explain what each means in the scoring guidance.

Examples:

  • ❌ Bad: “5 – Exceptional empathy demonstrated”

  • ✅ Good: “5” (with explanation in guidance: Consistently empathetic, personalised, and reassuring)


5. One Behaviour per Question (Granularity Matters)

  • Manual scorecards are often designed for humans, combining multiple behaviours into one broad question (e.g. Communication).

  • Problem for AI: Grouped questions force the model to assess several behaviours at once, reducing accuracy and consistency.

  • AI Best Practice: Break down broad behaviours into granular questions that measure one behaviour each.

Example:

  • ❌ Manual-style: “Communication” (covers empathy, tone, clarity, grammar).

  • ✅ AI-style:

    • Empathy (Did the agent acknowledge the customer’s feelings?)

    • Clarity (Were explanations clear and easy to follow?)

    • Tone (Was the tone professional and appropriate?)

    • Spelling & Grammar (Was written communication free of errors?)

Tip:
Granularity improves both scoring quality and reporting insight — making it clear if an agent struggles with empathy but excels in clarity.


6. Quick Checklist

✅ Title-style question text (short & report-friendly)
✅ Full question written in natural language
✅ Scoring rules for each answer option
✅ N/A included where appropriate
✅ Examples to illustrate performance levels
✅ One behaviour per question (granularity over grouping)

Did this answer your question?