KolmogorovSmirnov
Assesses whether each feature in the dataset aligns with a normal distribution using the Kolmogorov-Smirnov test.
Purpose
The Kolmogorov-Smirnov (KS) test evaluates the distribution of features in a dataset to determine their alignment with a normal distribution. This is important because many statistical methods and machine learning models assume normality in the data distribution.
Test Mechanism
This test calculates the KS statistic and corresponding p-value for each feature in the dataset. It does so by comparing the cumulative distribution function of the feature with an ideal normal distribution. The KS statistic and p-value for each feature are then stored in a dictionary. The p-value threshold to reject the normal distribution hypothesis is not preset, providing flexibility for different applications.
Signs of High Risk
- Elevated KS statistic for a feature combined with a low p-value, indicating a significant divergence from a normal distribution.
- Features with notable deviations that could create problems if the model assumes normality in data distribution.
Strengths
- The KS test is sensitive to differences in the location and shape of empirical cumulative distribution functions.
- It is non-parametric and adaptable to various datasets, as it does not assume any specific data distribution.
- Provides detailed insights into the distribution of individual features.
Limitations
- The test’s sensitivity to disparities in the tails of data distribution might cause false alarms about non-normality.
- Less effective for multivariate distributions, as it is designed for univariate distributions.
- Does not identify specific types of non-normality, such as skewness or kurtosis, which could impact model fitting.