Statistical Methods

Comprehensive guide to the mathematical foundations and appropriate use cases for each statistical method in StatClean.

Method Selection Guide

Interactive decision tree to help choose the right statistical method:

Univariate Methods

Interquartile Range (IQR)

Method: detect_outliers_iqr()

Mathematical Formula:

Lower Bound = Q₁ - (lower_factor × IQR)

Upper Bound = Q₃ + (upper_factor × IQR)

where IQR = Q₃ - Q₁

Use Cases:
  • Non-normal distributions
  • Skewed data
  • Robust to extreme values
  • No distributional assumptions
Code Example:
# Standard IQR (1.5 × IQR)
outliers = cleaner.detect_outliers_iqr('column')

# Conservative (2.0 × IQR)
outliers = cleaner.detect_outliers_iqr(
    'column', 
    lower_factor=2.0, 
    upper_factor=2.0
)

Z-Score Method

Method: detect_outliers_zscore()

Mathematical Formula:

Z = (x - μ) / σ

Outlier if |Z| > threshold (typically 3.0)

Use Cases:
  • Normally distributed data
  • Standardized thresholds needed
  • Large sample sizes (n > 30)
Assumptions:
  • Data follows normal distribution
  • Sample statistics represent population
Code Example:
# Standard threshold (|Z| > 3)
outliers = cleaner.detect_outliers_zscore('column')

# More sensitive (|Z| > 2.5)
outliers = cleaner.detect_outliers_zscore(
    'column', 
    threshold=2.5
)

Modified Z-Score (MAD-based)

Method: detect_outliers_modified_zscore()

Mathematical Formula:

Modified Z = 0.6745 × (x - median) / MAD

where MAD = median(|x - median(x)|)

Outlier if |Modified Z| > threshold (typically 3.5)

Advantages:
  • Robust to outliers (uses median)
  • Works with non-normal data
  • Less sensitive to extreme values
  • Suitable for small samples
Code Example:
# Standard MAD threshold
outliers = cleaner.detect_outliers_modified_zscore('column')

# More conservative
outliers = cleaner.detect_outliers_modified_zscore(
    'column', 
    threshold=4.0
)

Multivariate Methods

Mahalanobis Distance

Method: detect_outliers_mahalanobis()

Mathematical Formula:

D² = (x - μ)ᵀ Σ⁻¹ (x - μ)

where μ is mean vector, Σ is covariance matrix

Outlier if D² > χ²(p, α) threshold

Use Cases:
  • Multivariate outlier detection
  • Correlated variables
  • Unusual combinations of values
  • High-dimensional data
Assumptions:
  • Multivariate normality (approximately)
  • Variables are correlated
  • Sufficient sample size for covariance
Code Example:
# 95th percentile threshold
outliers = cleaner.detect_outliers_mahalanobis(
    ['var1', 'var2', 'var3']
)

# 99th percentile with shrinkage
outliers = cleaner.detect_outliers_mahalanobis(
    ['var1', 'var2'], 
    chi2_threshold=0.99,
    use_shrinkage=True
)

Formal Statistical Tests

Grubbs' Test

Method: grubbs_test()

Purpose: Test for single outliers in univariate normal data

Mathematical Formula:

G = max|x - x̄| / s

where s is sample standard deviation

Hypotheses:
  • H₀: No outliers present
  • H₁: One outlier present
Assumptions:
  • Data follows normal distribution
  • Testing for at most one outlier
  • Independent observations
Code Example:
result = cleaner.grubbs_test(
    'column', 
    alpha=0.05
)
print(f"P-value: {result['p_value']}")
print(f"Outlier: {result['is_outlier']}")
print(f"Statistic: {result['statistic']}")

Dixon's Q-Test

Method: dixon_q_test()

Purpose: Test for outliers in small samples (n < 30)

Mathematical Formula:

Q = gap / range

where gap is difference to nearest neighbor

Use Cases:
  • Small sample sizes (n < 30)
  • Formal statistical testing needed
  • Quality control applications
  • Laboratory measurements
Note: Recommended only for small sample sizes (n < 30)
Code Example:
result = cleaner.dixon_q_test(
    'column', 
    alpha=0.05
)
print(f"Q-statistic: {result['statistic']}")
print(f"Critical value: {result['critical_value']}")
print(f"P-value: {result['p_value']}")

Data Transformations

Box-Cox Transformation

Formula:

y(λ) = (xλ - 1) / λ if λ ≠ 0

y(λ) = ln(x) if λ = 0

Purpose: Normalize skewed distributions, stabilize variance

result = cleaner.transform_boxcox('column')
print(f"Lambda: {result['lambda']}")
Logarithmic Transformation

Variants: Natural log (ln), Base 10, Base 2

Purpose: Reduce right skewness, handle multiplicative relationships

Use Cases:

  • Exponential growth data
  • Financial data
  • Count data with large ranges
result = cleaner.transform_log(
    'column', 
    base='natural'
)
Square Root Transformation

Formula: y = √x

Purpose: Moderate right skewness, Poisson data

Use Cases:

  • Count data transformation
  • Variance stabilization
  • Moderate skewness reduction
result = cleaner.transform_sqrt('column')

Method Selection Guidelines

Distribution-Based Selection

Distribution Primary Method Alternative
Normal Z-score Grubbs' test
Skewed Modified Z-score IQR
Unknown IQR Modified Z-score
Small (n < 30) Dixon's Q-test IQR
Multivariate Mahalanobis Multiple univariate

Automatic Method Selection

StatClean automatically recommends methods based on:

  • Sample size
  • Skewness level
  • Kurtosis
  • Normality test results
analysis = cleaner.analyze_distribution('column')
print(f"Recommended: {analysis['recommended_method']}")
print(f"Reason: {analysis['recommendation_reason']}")

Statistical Considerations

Type I and Type II Errors
Type I Error (False Positive):
  • Incorrectly identifying normal points as outliers
  • Controlled by significance level (α)
  • Lower α = fewer false positives, more missed outliers
Type II Error (False Negative):
  • Missing actual outliers
  • Influenced by effect size and sample size
  • Higher sensitivity = more false positives
Robustness vs. Efficiency
Robust Methods (IQR, MAD-based):
  • Less affected by outliers
  • Work with various distributions
  • May be less efficient with normal data
Efficient Methods (Z-score, Grubbs'):
  • Optimal for normal distributions
  • More powerful when assumptions met
  • Sensitive to assumption violations
Multiple Testing Correction

When testing multiple variables, consider:

  • Bonferroni correction: αadjusted = α / ntests
  • False Discovery Rate (FDR) control
  • Sequential testing procedures

Validation and Diagnostics

Method Comparison

comparison = cleaner.compare_methods(
    ['column'], 
    methods=['iqr', 'zscore', 'modified_zscore']
)

Use agreement between methods as validation:

  • High agreement: Confident identification
  • Low agreement: Investigate data characteristics
  • Consider multiple perspectives

Distribution Assessment

analysis = cleaner.analyze_distribution('column')

Key diagnostics:

  • Skewness: |skew| < 0.5 (approximately normal)
  • Kurtosis: Normal ≈ 3, Heavy tails > 3
  • Shapiro-Wilk: p > 0.05 suggests normality
Ready to Apply?

Explore practical implementations and advanced examples: