Comprehensive guide to the mathematical foundations and appropriate use cases for each statistical method in StatClean.
Interactive decision tree to help choose the right statistical method:
Method: detect_outliers_iqr()
Lower Bound = Q₁ - (lower_factor × IQR)
Upper Bound = Q₃ + (upper_factor × IQR)
where IQR = Q₃ - Q₁
# Standard IQR (1.5 × IQR)
outliers = cleaner.detect_outliers_iqr('column')
# Conservative (2.0 × IQR)
outliers = cleaner.detect_outliers_iqr(
'column',
lower_factor=2.0,
upper_factor=2.0
)
Method: detect_outliers_zscore()
Z = (x - μ) / σ
Outlier if |Z| > threshold (typically 3.0)
# Standard threshold (|Z| > 3)
outliers = cleaner.detect_outliers_zscore('column')
# More sensitive (|Z| > 2.5)
outliers = cleaner.detect_outliers_zscore(
'column',
threshold=2.5
)
Method: detect_outliers_modified_zscore()
Modified Z = 0.6745 × (x - median) / MAD
where MAD = median(|x - median(x)|)
Outlier if |Modified Z| > threshold (typically 3.5)
# Standard MAD threshold
outliers = cleaner.detect_outliers_modified_zscore('column')
# More conservative
outliers = cleaner.detect_outliers_modified_zscore(
'column',
threshold=4.0
)
Method: detect_outliers_mahalanobis()
D² = (x - μ)ᵀ Σ⁻¹ (x - μ)
where μ is mean vector, Σ is covariance matrix
Outlier if D² > χ²(p, α) threshold
# 95th percentile threshold
outliers = cleaner.detect_outliers_mahalanobis(
['var1', 'var2', 'var3']
)
# 99th percentile with shrinkage
outliers = cleaner.detect_outliers_mahalanobis(
['var1', 'var2'],
chi2_threshold=0.99,
use_shrinkage=True
)
Method: grubbs_test()
Purpose: Test for single outliers in univariate normal data
G = max|x - x̄| / s
where s is sample standard deviation
result = cleaner.grubbs_test(
'column',
alpha=0.05
)
print(f"P-value: {result['p_value']}")
print(f"Outlier: {result['is_outlier']}")
print(f"Statistic: {result['statistic']}")
Method: dixon_q_test()
Purpose: Test for outliers in small samples (n < 30)
Q = gap / range
where gap is difference to nearest neighbor
result = cleaner.dixon_q_test(
'column',
alpha=0.05
)
print(f"Q-statistic: {result['statistic']}")
print(f"Critical value: {result['critical_value']}")
print(f"P-value: {result['p_value']}")
Formula:
y(λ) = (xλ - 1) / λ if λ ≠ 0
y(λ) = ln(x) if λ = 0
Purpose: Normalize skewed distributions, stabilize variance
result = cleaner.transform_boxcox('column')
print(f"Lambda: {result['lambda']}")
Variants: Natural log (ln), Base 10, Base 2
Purpose: Reduce right skewness, handle multiplicative relationships
Use Cases:
result = cleaner.transform_log(
'column',
base='natural'
)
Formula: y = √x
Purpose: Moderate right skewness, Poisson data
Use Cases:
result = cleaner.transform_sqrt('column')
| Distribution | Primary Method | Alternative |
|---|---|---|
| Normal | Z-score | Grubbs' test |
| Skewed | Modified Z-score | IQR |
| Unknown | IQR | Modified Z-score |
| Small (n < 30) | Dixon's Q-test | IQR |
| Multivariate | Mahalanobis | Multiple univariate |
StatClean automatically recommends methods based on:
analysis = cleaner.analyze_distribution('column')
print(f"Recommended: {analysis['recommended_method']}")
print(f"Reason: {analysis['recommendation_reason']}")
When testing multiple variables, consider:
comparison = cleaner.compare_methods(
['column'],
methods=['iqr', 'zscore', 'modified_zscore']
)
Use agreement between methods as validation:
analysis = cleaner.analyze_distribution('column')
Key diagnostics:
Explore practical implementations and advanced examples: