At Pencil, we measure the accuracy and grounding of AI-generated responses by using an optimised Retrieval-Augmented Generation (RAG) pipeline across various large language models (LLMs). This ensures that the responses are based on reliable, brand-specific information.
How We Measure Performance
We assess the performance of the brand library by asking a series of standardised chat-based questions. These questions help us test whether the AI can:
Recall – Can the AI accurately retrieve relevant information from the documents uploaded to the brand library?
Apply – Can the AI apply the correct facts from the brand documents to generate relevant, accurate responses?
By using these benchmarks, we ensure that the AI responses are grounded in the correct data and align with the brand’s guidelines.
For more details or questions, don’t hesitate to contact our support team!