validmind.tests.data_validation.HighPearsonCorrelation
Identifies highly correlated feature pairs in a dataset suggesting feature redundancy or multicollinearity.
Purpose
The High Pearson Correlation test measures the linear relationship between features in a dataset, with the main goal of identifying high correlations that might indicate feature redundancy or multicollinearity. Identification of such issues allows developers and risk management teams to properly deal with potential impacts on the machine learning model's performance and interpretability.
Test Mechanism
The test works by generating pairwise Pearson correlations for all features in the dataset, then sorting and
eliminating duplicate and self-correlations. It assigns a Pass or Fail based on whether the absolute value of the
correlation coefficient surpasses a pre-set threshold (defaulted at 0.3). It lastly returns the top n strongest
correlations regardless of passing or failing status (where n is 10 by default but can be configured by passing the
top_n_correlations
parameter).
Signs of High Risk
- A high risk indication would be the presence of correlation coefficients exceeding the threshold.
- If the features share a strong linear relationship, this could lead to potential multicollinearity and model overfitting.
- Redundancy of variables can undermine the interpretability of the model due to uncertainty over the authenticity of individual variable's predictive power.
Strengths
- Provides a quick and simple means of identifying relationships between feature pairs.
- Generates a transparent output that displays pairs of correlated variables, the Pearson correlation coefficient, and a Pass or Fail status for each.
- Aids in early identification of potential multicollinearity issues that may disrupt model training.
Limitations
- Can only delineate linear relationships, failing to shed light on nonlinear relationships or dependencies.
- Sensitive to outliers where a few outliers could notably affect the correlation coefficient.
- Limited to identifying redundancy only within feature pairs; may fail to spot more complex relationships among three or more variables.