The current screen is not properly rendering some details of the eval results, and makes them sometime hard to understand:
Clarify what’s expected, and what’s the result only when a mistake is flagged: no repetition when successful, so we focus on mistakes to analyze
Fix the fact that a pure failure on the LLM side leads to false positives (ie if an example is supposed not to detect anything, we flag a success even if the pipeline itself failed)
Please authenticate to join the conversation.
Planned
Feature requests and bug reports
About 1 month ago

Hervé Labas
Get notified by email when there are changes.
Planned
Feature requests and bug reports
About 1 month ago

Hervé Labas
Get notified by email when there are changes.