Clean up and improve UX of the Analysis pipeline eval screen

The current screen is not properly rendering some details of the eval results, and makes them sometime hard to understand:

  • Clarify what’s expected, and what’s the result only when a mistake is flagged: no repetition when successful, so we focus on mistakes to analyze

  • Fix the fact that a pure failure on the LLM side leads to false positives (ie if an example is supposed not to detect anything, we flag a success even if the pipeline itself failed)

Please authenticate to join the conversation.

Upvoters
Status

Planned

Board

Feature requests and bug reports

Date

About 1 month ago

Author

Hervé Labas

Subscribe to post

Get notified by email when there are changes.