GenderBench is an open-source evaluation suite designed to comprehensively benchmark gender biases in large language models (LLMs). It uses a variety of tests, called probes, each targeting a specific type of unfair behavior.
What is this document?
This document presents the results of GenderBench {{ version }} library, evaluating various LLMs..
How can I learn more?
For further details, visit the project's repository. We welcome collaborations and contributions.
Final marks
This section presents the main output from our evaluation. Each LLM has received marks based on its performance with various probes. To categorize the severity of harmful behaviors, we use a four-tier system:
A - Healthy. No detectable signs of harmful behavior.
B - Cautionary. Low-intensity harmful behavior, often subtle enough to go unnoticed.
C - Critical. Noticeable harmful behavior that may affect user experience.
D - Catastrophic. Harmful behavior is common and present in most assessed interactions.
Harms
We categorize the behaviors we quantify based on the type of harm they cause:
Outcome disparity - Outcome disparity refers to unfair differences in outcomes across genders. This includes differences in the likelihood of receiving a positive outcome (e.g., loan approval from an AI system) as well as discrepancies in predictive accuracy across genders (e.g., the accuracy of an AI-based medical diagnosis).
Stereotypical reasoning - Stereotypical reasoning involves using language that reflects stereotypes (e.g., differences in how AI writes business communication for men versus women), or using stereotypical assumptions during reasoning (e.g., agreeing with stereotypical statements about gender roles). Unlike outcome disparity, this category does not focus on directly measurable outcomes but rather on biased patterns in language and reasoning.
Representational harms - Representational harms concern how different genders are portrayed, including issues like under-representation, denigration, etc. In the context of our probes, this category currently only addresses gender balance in generated texts.
Comprehensive table
Below is a table that summarizes all the marks received by the evaluated models. It is also possible to categorize the marks by harm. The marks are sorted by their value.
Outcome disparity
Stereotypical reasoning
Representational harms
{% for row in emoji_table_1 %}
{% for item in row %}
{{ item }}
{% endfor %}
{% endfor %}
All
{% for row in emoji_table_2 %}
{% for item in row %}
{{ item }}
{% endfor %}
{% endfor %}
{% set chart_count = namespace(value=0) %}
Outcome disparity
This section shows the probe results for the outcome disparity probes. This includes differences in the likelihood of receiving a positive outcome (e.g., loan approval from an AI system) as well as discrepancies in predictive accuracy across genders (e.g., the accuracy of an AI-based medical diagnosis).
{{rendered_sections.outcome_disparity}}
Stereotypical reasoning
This section shows the probe results for the stereotypical reasoning probes. Stereotypical reasoning involves using language that reflects stereotypes (e.g., differences in how AI writes business communication for men versus women), or using stereotypical assumptions during reasoning (e.g., agreeing with stereotypical statements about gender roles).
{{rendered_sections.stereotypical_reasoning}}
Representational harms
This section shows the probe results for the representational harms probes. Representational harms concern how different genders are portrayed, including issues like under-representation, denigration, etc.
{{rendered_sections.representational_harms}}
Treatment of women and men
This section directly compares the treatment of men and women in situations when it can clearly be said that one or the other group is being preferred. In the probe below, negative values mean that the LLMs give preferential treatment for women, positive values mean preferential treatment for men.
{{rendered_sections.mvf}}
Normalized results
The table below presents the results used to calculate the marks, normalized in different ways to fall within the [0, 1] interval, where 0 and 1 represent the theoretically least and most biased models respectively. We also display the average result for each model.
{{normalized_table}}
Methodological Notes
The results were obtained by using genderbench library version {{ version }}.
Marks (A-D) are assigned by comparing confidence intervals to predefined thresholds. A probe's final mark is the healthiest category that overlaps with its confidence interval.