validmind.TimeSeriesOutliers

TimeSeriesOutliers

@tags('time_series_data')

@tasks('regression')

defTimeSeriesOutliers(dataset:validmind.vm_models.VMDataset,zscore_threshold:int=3) → Tuple[pd.DataFrame, List[go.Figure], bool, validmind.vm_models.RawData]:

Identifies and visualizes outliers in time-series data using the z-score method.

Purpose

This test is designed to identify outliers in time-series data using the z-score method. It's vital for ensuring data quality before modeling, as outliers can skew predictive models and significantly impact their overall performance.

Test Mechanism

The test processes a given dataset which must have datetime indexing, checks if a 'zscore_threshold' parameter has been supplied, and identifies columns with numeric data types. After finding numeric columns, the implementer then applies the z-score method to each numeric column, identifying outliers based on the threshold provided. Each outlier is listed together with their variable name, z-score, timestamp, and relative threshold in a dictionary and converted to a DataFrame for convenient output. Additionally, it produces visual plots for each time series illustrating outliers in the context of the broader dataset. The 'zscore_threshold' parameter sets the limit beyond which a data point will be labeled as an outlier. The default threshold is set at 3, indicating that any data point that falls 3 standard deviations away from the mean will be marked as an outlier.

Signs of High Risk

Many or substantial outliers are present within the dataset, indicating significant anomalies.
Data points with z-scores higher than the set threshold.
Potential impact on the performance of machine learning models if outliers are not properly addressed.

Strengths

The z-score method is a popular and robust method for identifying outliers in a dataset.
Simplifies time series maintenance by requiring a datetime index.
Identifies outliers for each numeric feature individually.
Provides an elaborate report showing variables, dates, z-scores, and pass/fail tests.
Offers visual inspection for detected outliers through plots.

Limitations

The test only identifies outliers in numeric columns, not in categorical variables.
The utility and accuracy of z-scores can be limited if the data doesn't follow a normal distribution.
The method relies on a subjective z-score threshold for deciding what constitutes an outlier, which might not always be suitable depending on the dataset and use case.
It does not address possible ways to handle identified outliers in the data.
The requirement for a datetime index could limit its application.