Statistical Methods

Method Selection Guide

Interactive decision tree to help choose the right statistical method:

Univariate Methods

Interquartile Range (IQR)

Method: detect_outliers_iqr()

Mathematical Formula:

Lower Bound = Q₁ - (lower_factor × IQR)

Upper Bound = Q₃ + (upper_factor × IQR)

where IQR = Q₃ - Q₁

Use Cases:

Non-normal distributions
Skewed data
Robust to extreme values
No distributional assumptions

Code Example:

# Standard IQR (1.5 × IQR)
outliers = cleaner.detect_outliers_iqr('column')

# Conservative (2.0 × IQR)
outliers = cleaner.detect_outliers_iqr(
    'column', 
    lower_factor=2.0, 
    upper_factor=2.0
)

Z-Score Method

Method: detect_outliers_zscore()

Mathematical Formula:

Z = (x - μ) / σ

Outlier if |Z| > threshold (typically 3.0)

Use Cases:

Normally distributed data
Standardized thresholds needed
Large sample sizes (n > 30)

Assumptions:

Data follows normal distribution
Sample statistics represent population

Code Example:

# Standard threshold (|Z| > 3)
outliers = cleaner.detect_outliers_zscore('column')

# More sensitive (|Z| > 2.5)
outliers = cleaner.detect_outliers_zscore(
    'column', 
    threshold=2.5
)

Modified Z-Score (MAD-based)

Method: detect_outliers_modified_zscore()

Mathematical Formula:

Modified Z = 0.6745 × (x - median) / MAD

where MAD = median(|x - median(x)|)

Outlier if |Modified Z| > threshold (typically 3.5)

Advantages:

Robust to outliers (uses median)
Works with non-normal data
Less sensitive to extreme values
Suitable for small samples

Code Example:

# Standard MAD threshold
outliers = cleaner.detect_outliers_modified_zscore('column')

# More conservative
outliers = cleaner.detect_outliers_modified_zscore(
    'column', 
    threshold=4.0
)

Multivariate Methods

Mahalanobis Distance

Method: detect_outliers_mahalanobis()

Mathematical Formula:

D² = (x - μ)ᵀ Σ⁻¹ (x - μ)

where μ is mean vector, Σ is covariance matrix

Outlier if D² > χ²(p, α) threshold

Use Cases:

Multivariate outlier detection
Correlated variables
Unusual combinations of values
High-dimensional data

Assumptions:

Multivariate normality (approximately)
Variables are correlated
Sufficient sample size for covariance

Code Example:

# 95th percentile threshold
outliers = cleaner.detect_outliers_mahalanobis(
    ['var1', 'var2', 'var3']
)

# 99th percentile with shrinkage
outliers = cleaner.detect_outliers_mahalanobis(
    ['var1', 'var2'], 
    chi2_threshold=0.99,
    use_shrinkage=True
)

Formal Statistical Tests

Grubbs' Test

Method: grubbs_test()

Purpose: Test for single outliers in univariate normal data

Mathematical Formula:

G = max|x - x̄| / s

where s is sample standard deviation

Hypotheses:

H₀: No outliers present
H₁: One outlier present

Assumptions:

Data follows normal distribution
Testing for at most one outlier
Independent observations

Code Example:

result = cleaner.grubbs_test(
    'column', 
    alpha=0.05
)
print(f"P-value: {result['p_value']}")
print(f"Outlier: {result['is_outlier']}")
print(f"Statistic: {result['statistic']}")

Dixon's Q-Test

Method: dixon_q_test()

Purpose: Test for outliers in small samples (n < 30)

Mathematical Formula:

Q = gap / range

where gap is difference to nearest neighbor

Use Cases:

Small sample sizes (n < 30)
Formal statistical testing needed
Quality control applications
Laboratory measurements

Note: Recommended only for small sample sizes (n < 30)

Code Example:

result = cleaner.dixon_q_test(
    'column', 
    alpha=0.05
)
print(f"Q-statistic: {result['statistic']}")
print(f"Critical value: {result['critical_value']}")
print(f"P-value: {result['p_value']}")

Data Transformations

Box-Cox Transformation

Formula:

y(λ) = (x^λ - 1) / λ if λ ≠ 0

y(λ) = ln(x) if λ = 0

Purpose: Normalize skewed distributions, stabilize variance

result = cleaner.transform_boxcox('column')
print(f"Lambda: {result['lambda']}")

Logarithmic Transformation

Variants: Natural log (ln), Base 10, Base 2

Purpose: Reduce right skewness, handle multiplicative relationships

Use Cases:

Exponential growth data
Financial data
Count data with large ranges

result = cleaner.transform_log(
    'column', 
    base='natural'
)

Square Root Transformation

Formula: y = √x

Purpose: Moderate right skewness, Poisson data

Use Cases:

Count data transformation
Variance stabilization
Moderate skewness reduction

result = cleaner.transform_sqrt('column')

Method Selection Guidelines

Distribution-Based Selection

Distribution	Primary Method	Alternative
Normal	Z-score	Grubbs' test
Skewed	Modified Z-score	IQR
Unknown	IQR	Modified Z-score
Small (n < 30)	Dixon's Q-test	IQR
Multivariate	Mahalanobis	Multiple univariate

Automatic Method Selection

StatClean automatically recommends methods based on:

Sample size
Skewness level
Kurtosis
Normality test results

analysis = cleaner.analyze_distribution('column')
print(f"Recommended: {analysis['recommended_method']}")
print(f"Reason: {analysis['recommendation_reason']}")

Statistical Considerations

Type I and Type II Errors

Type I Error (False Positive):

Incorrectly identifying normal points as outliers
Controlled by significance level (α)
Lower α = fewer false positives, more missed outliers

Type II Error (False Negative):

Missing actual outliers
Influenced by effect size and sample size
Higher sensitivity = more false positives

Robustness vs. Efficiency

Robust Methods (IQR, MAD-based):

Less affected by outliers
Work with various distributions
May be less efficient with normal data

Efficient Methods (Z-score, Grubbs'):

Optimal for normal distributions
More powerful when assumptions met
Sensitive to assumption violations

Multiple Testing Correction

When testing multiple variables, consider:

Bonferroni correction: α_adjusted = α / n_tests
False Discovery Rate (FDR) control
Sequential testing procedures

Validation and Diagnostics

Method Comparison

comparison = cleaner.compare_methods(
    ['column'], 
    methods=['iqr', 'zscore', 'modified_zscore']
)

Use agreement between methods as validation:

High agreement: Confident identification
Low agreement: Investigate data characteristics
Consider multiple perspectives

Distribution Assessment

analysis = cleaner.analyze_distribution('column')

Key diagnostics:

Skewness: |skew| < 0.5 (approximately normal)
Kurtosis: Normal ≈ 3, Heavy tails > 3
Shapiro-Wilk: p > 0.05 suggests normality

Statistical Methods

Method Selection Guide

Univariate Methods

Interquartile Range (IQR)

Mathematical Formula:

Use Cases:

Code Example:

Z-Score Method

Mathematical Formula:

Use Cases:

Assumptions:

Code Example:

Modified Z-Score (MAD-based)

Mathematical Formula:

Advantages:

Code Example:

Multivariate Methods

Mahalanobis Distance

Mathematical Formula:

Use Cases:

Assumptions:

Code Example:

Formal Statistical Tests

Grubbs' Test

Mathematical Formula:

Hypotheses:

Assumptions:

Code Example:

Dixon's Q-Test

Mathematical Formula:

Use Cases:

Code Example:

Data Transformations

Box-Cox Transformation

Logarithmic Transformation

Square Root Transformation

Method Selection Guidelines

Distribution-Based Selection

Automatic Method Selection

Statistical Considerations

Type I and Type II Errors

Type I Error (False Positive):

Type II Error (False Negative):

Robustness vs. Efficiency

Robust Methods (IQR, MAD-based):

Efficient Methods (Z-score, Grubbs'):

Multiple Testing Correction

Validation and Diagnostics

Method Comparison

Distribution Assessment

Ready to Apply?