validmind.ProtectedClassesDescription

ProtectedClassesDescription

@tags('bias_and_fairness', 'descriptive_statistics')

@tasks('classification', 'regression')

defProtectedClassesDescription(dataset,protected_classes=None) → Tuple[pd.DataFrame, Union[go.Figure, List[go.Figure]], validmind.vm_models.RawData]:

Visualizes the distribution of protected classes in the dataset relative to the target variable and provides descriptive statistics.

Purpose

The ProtectedClassesDescription test aims to identify potential biases or significant differences in the distribution of target outcomes across different protected classes. This visualization and statistical summary help in understanding the relationship between protected attributes and the target variable, which is crucial for assessing fairness in machine learning models.

Test Mechanism

The function creates interactive stacked bar charts for each specified protected class using Plotly. Additionally, it generates a single table of descriptive statistics for all protected classes, including:

Protected class and category
Count and percentage of each category within the protected class
Mean, median, and mode of the target variable for each category
Standard deviation of the target variable for each category
Minimum and maximum values of the target variable for each category

Signs of High Risk

Significant imbalances in the distribution of target outcomes across different categories of a protected class.
Large disparities in mean, median, or mode of the target variable across categories.
Underrepresentation or overrepresentation of certain groups within protected classes.
High standard deviations in certain categories, indicating potential volatility or outliers.

Strengths

Provides both visual and statistical representation of potential biases in the dataset.
Allows for easy identification of imbalances in target variable distribution across protected classes.
Interactive plots enable detailed exploration of the data.
Consolidated statistical summary provides quantitative measures to complement visual analysis.
Applicable to both classification and regression tasks.

Limitations

Does not provide advanced statistical measures of bias or fairness.
May become cluttered if there are many categories within a protected class or many unique target values.
Interpretation may require domain expertise to understand the implications of observed disparities.
Does not account for intersectionality or complex interactions between multiple protected attributes.