Comprehensive statistical data preprocessing and outlier detection library with formal statistical testing and publication-quality reporting.
self. Access cleaned data via cleaner.clean_df and details via cleaner.outlier_info.
pip install statclean
import pandas as pd
from statclean import StatClean
# Sample data with outliers
df = pd.DataFrame({
'values': [1, 2, 3, 100, 4, 5, 6]
})
# Initialize and clean
cleaner = StatClean(df)
cleaner.remove_outliers_zscore('values')
cleaned_df = cleaner.clean_df
print(f"Original: {df.shape}")
print(f"Cleaned: {cleaned_df.shape}")
| Method | Univariate | Multivariate | Formal Test |
|---|---|---|---|
| IQR | - | - | |
| Z-score | - | - | |
| Modified Z-score | - | - | |
| Mahalanobis | - | - | |
| Grubbs | - | ||
| Dixon Q | - |
# Formal statistical testing
result = cleaner.grubbs_test('income', alpha=0.05)
print(f"P-value: {result['p_value']:.6f}")
print(f"Outlier: {result['is_outlier']}")
# Dixon's Q-test for small samples
result = cleaner.dixon_q_test('age', alpha=0.05)
print(f"Statistic: {result['statistic']:.3f}")
# Mahalanobis distance
outliers = cleaner.detect_outliers_mahalanobis(
['income', 'age'],
chi2_threshold=0.95
)
# Method chaining
cleaned = (cleaner
.transform_boxcox('income')
.remove_outliers_modified_zscore('income')
.clean_df)