Here is a summary of the key results of the testing.
Note: The Score column shows the actual score achieved by the LLM, while the Percentile Rank indicates how the model compares to other leading models selected by IBM. For example, a rank of 75 means the model outperformed 75% of its peers. Thresholds below each bar define performance categories and may be default or user-defined. If lower scores indicate better performance, custom thresholds are automatically adjusted to maintain consistent evaluation logic.
Benchmark | Score | Percentile Rank | ||||
---|---|---|---|---|---|---|
{{ risk_dim }} |
||||||
|
{% elif score_category == "GREEN" %}
|
{% elif score_category == "ORANGE" %}
|
{% else %}
{{ card_name_translation }} | {% endif %}{{ metrics.get("score") }} (Percentile score: {{percentile_value}}) |
{% if percentile_value is not none %}
{% set percentile_ranges_lower_upper = metrics_interpretation.get(risk_dim,{}).get(card_name,{}).get("percentile_ranges_lower_upper") %}
{% set red_segment_width = percentile_ranges_lower_upper[0] %}
{% set yellow_segment_width = percentile_ranges_lower_upper[1] - percentile_ranges_lower_upper[0] %}
{% set green_segment_width = 100 - percentile_ranges_lower_upper[1] %}
{% else %} | Not Applicable | {% endif %}
Note: If the percentile rank column shows "Not Applicable" for a risk or benchmark, it means there was no reference data available for comparison.