validmind.DescriptiveStats
DescriptiveStats
@tags('tabular_data', 'statistics', 'data_quality')
@tasks('classification', 'regression', 'clustering')
defDescriptiveStats(dataset:validmind.vm_models.VMDataset,columns:Optional[List[str]]=None,include_advanced:bool=True,confidence_level:float=0.95) → Dict[str, Any]:
Provides comprehensive descriptive statistics for numerical features in a dataset.
Purpose
This test generates detailed descriptive statistics for numerical features, including basic statistics, distribution measures, confidence intervals, and normality tests. It provides a comprehensive overview of data characteristics essential for understanding data quality and distribution properties.
Test Mechanism
The test computes various statistical measures for each numerical column:
- Basic statistics: count, mean, median, std, min, max, quartiles
- Distribution measures: skewness, kurtosis, coefficient of variation
- Confidence intervals for the mean
- Normality tests (Shapiro-Wilk for small samples, Anderson-Darling for larger)
- Missing value analysis
Signs of High Risk
- High skewness or kurtosis indicating non-normal distributions
- Large coefficients of variation suggesting high data variability
- Significant results in normality tests when normality is expected
- High percentage of missing values
- Extreme outliers based on IQR analysis
Strengths
- Comprehensive statistical analysis in a single test
- Includes advanced statistical measures beyond basic descriptives
- Provides confidence intervals for uncertainty quantification
- Handles missing values appropriately
- Suitable for both exploratory and confirmatory analysis
Limitations
- Limited to numerical features only
- Normality tests may not be meaningful for all data types
- Large datasets may make some tests computationally expensive
- Interpretation requires statistical knowledge