Evaluate AI with Microsoft AI Evaluation
December 2024
Implementing a 'demo' RAG is straightforward with AI libraries like Kernel Memory or Semantic Kernel. However, production grade AI applications are significantly more challenging. AI programs are sensitive to changes and inherently probabilistic, unlike traditional deterministic tests. Tools like Ragas address these issues. Microsoft has released a preview of Microsoft.Extensions.AI.Evaluation, followed by Microsoft.Extensions.AI. Notably, Microsoft.Extensions.AI.Evaluation uses an 'LLM-as-a-Judge' approach to rank outcomes.
Lets implement a fact evaluation from OpenAI fact prompt
Implement IEvaluator
- generating prompt of input, expected output and actual outcome
- interpret the prompt result to a rating
Note that there are evaluators out of the box from Microsoft.Extensions.AI.Evaluation such as RelevanceTruthAndCompletenessEvaluator, CoherenceEvaluator, FluencyEvaluator
Setup IChatClient and create chat configuration
- Configure chat client from Microsoft.Extensions.AI
Note that Microsoft.ML.Tokenizers package needs be added along with the actual token types (names all starts with Microsoft.ML.Tokenizers.Data) depending on the LLM types. For example, "gpt-4o" uses "Microsoft.ML.Tokenizers.Data.O200kBase".
Setup ReportingConfiguration with evaluators
Create ScenarioRun and execute against actual/expected results
var answerEvaluator = new FactEvaluator();
var reportConfiguration = DiskBasedReportingConfiguration.Create(
storageRootPath: "./reports", // Json result files in this folder
chatConfiguration: chatConfiguration,
evaluators: [
answerEvaluator
],
executionName: documentId);
await using var scenario = await reportConfiguration.CreateScenarioRunAsync(indexName);
var evalResult = await scenario.EvaluateAsync(
messages: [
new ChatMessage(ChatRole.User, question)
],
modelResponse: new ChatMessage(ChatRole.Assistant, response.Result),
additionalContext: [new FactEvaluator.EvaluationExpert("Brazil and Bolivia")]);
Note that additional parameters need to be implemented as EvaluationContext
Issues/Feedback
- Built in evaluators such as CoherenceEvaluator have prompts 'hardcoded', it is a bit hard to understand the outcome of them.
- No support for prompt templating mechanism
- Evaluation results need more support for 'What does this mean' kind of context in metrics.(Only Diagnostics at the moment)