Evaluate AI with Microsoft AI Evaluation

04 December 2024

Implementing a 'demo' RAG is straightforward with AI libraries like Kernel Memory or Semantic Kernel. However, production grade AI applications are significantly more challenging. AI programs are sensitive to changes and inherently probabilistic, unlike traditional deterministic tests. Tools like Ragas address these issues. Microsoft has released a preview of Microsoft.Extensions.AI.Evaluation, followed by Microsoft.Extensions.AI. Notably, Microsoft.Extensions.AI.Evaluation uses an 'LLM-as-a-Judge' approach to rank outcomes.

Lets implement a fact evaluation from OpenAI fact prompt

Implement IEvaluator

generating prompt of input, expected output and actual outcome
interpret the prompt result to a rating

Note that there are evaluators out of the box from Microsoft.Extensions.AI.Evaluation such as RelevanceTruthAndCompletenessEvaluator, CoherenceEvaluator, FluencyEvaluator

Setup IChatClient and create chat configuration

Configure chat client from Microsoft.Extensions.AI

Note that Microsoft.ML.Tokenizers package needs be added along with the actual token types (names all starts with Microsoft.ML.Tokenizers.Data) depending on the LLM types. For example, "gpt-4o" uses "Microsoft.ML.Tokenizers.Data.O200kBase".

Setup ReportingConfiguration with evaluators

Create ScenarioRun and execute against actual/expected results

var answerEvaluator = new FactEvaluator();
var reportConfiguration = DiskBasedReportingConfiguration.Create(
    storageRootPath: "./reports", // Json result files in this folder
    chatConfiguration: chatConfiguration,
    evaluators: [
        answerEvaluator
      ],
    executionName: documentId);

await using var scenario = await reportConfiguration.CreateScenarioRunAsync(indexName);

var evalResult = await scenario.EvaluateAsync(
  messages: [
    new ChatMessage(ChatRole.User, question)
  ],
  modelResponse: new ChatMessage(ChatRole.Assistant, response.Result),
  additionalContext: [new FactEvaluator.EvaluationExpert("Brazil and Bolivia")]);

Note that additional parameters need to be implemented as EvaluationContext

Issues/Feedback

Built in evaluators such as CoherenceEvaluator have prompts 'hardcoded', it is a bit hard to understand the outcome of them.
No support for prompt templating mechanism
Evaluation results need more support for 'What does this mean' kind of context in metrics.(Only Diagnostics at the moment)

Sample code here