ChiSquaredFeaturesTable

Assesses the statistical association between categorical features and a target variable using the Chi-Squared test.

Purpose

The ChiSquaredFeaturesTable function is designed to evaluate the relationship between categorical features and a target variable in a dataset. It performs a Chi-Squared test of independence for each categorical feature to determine whether a statistically significant association exists with the target variable. This is particularly useful in Model Risk Management for understanding the relevance of features and identifying potential biases in a classification model.

Test Mechanism

The function creates a contingency table for each categorical feature and the target variable, then applies the Chi-Squared test to compute the Chi-squared statistic and the p-value. The results for each feature include the variable name, Chi-squared statistic, p-value, p-value threshold, and a pass/fail status based on whether the p-value is below the specified threshold. The output is a DataFrame summarizing these results, sorted by p-value to highlight the most statistically significant associations.

Signs of High Risk

  • High p-values (greater than the set threshold) indicate a lack of significant association between a feature and the target variable, resulting in a ‘Fail’ status.
  • Features with a ‘Fail’ status might not be relevant for the model, which could negatively impact model performance.

Strengths

  • Provides a clear, statistical assessment of the relationship between categorical features and the target variable.
  • Produces an easily interpretable summary with a ‘Pass/Fail’ outcome for each feature, helping in feature selection.
  • The p-value threshold is adjustable, allowing for flexibility in statistical rigor.

Limitations

  • Assumes the dataset is tabular and consists of categorical variables, which may not be suitable for all datasets.
  • The test is designed for classification tasks and is not applicable to regression problems.
  • As with all hypothesis tests, the Chi-Squared test can only detect associations, not causal relationships.
  • The choice of p-value threshold can affect the interpretation of feature relevance, and different thresholds may lead to different conclusions.