StatClean

Comprehensive statistical data preprocessing and outlier detection library with formal statistical testing and publication-quality reporting.

Note: Remover methods return self. Access cleaned data via cleaner.clean_df and details via cleaner.outlier_info.

Quick Start

Installation
pip install statclean

Basic Usage

import pandas as pd
from statclean import StatClean

# Sample data with outliers
df = pd.DataFrame({
    'values': [1, 2, 3, 100, 4, 5, 6]
})

# Initialize and clean
cleaner = StatClean(df)
cleaner.remove_outliers_zscore('values')
cleaned_df = cleaner.clean_df

print(f"Original: {df.shape}")
print(f"Cleaned: {cleaned_df.shape}")

How It Flows

Key Features

Statistical Testing
  • Grubbs' test with p-values
  • Dixon's Q-test for small samples
  • Distribution analysis
  • Method comparison
Detection Methods
  • IQR, Z-score, Modified Z-score
  • Mahalanobis distance
  • Batch processing
  • Automatic selection
Treatment Options
  • Outlier removal
  • Winsorizing
  • Data transformations
  • Method chaining

Method Overview

Method Univariate Multivariate Formal Test
IQR - -
Z-score - -
Modified Z-score - -
Mahalanobis - -
Grubbs -
Dixon Q -

Advanced Usage

Statistical Testing

# Formal statistical testing
result = cleaner.grubbs_test('income', alpha=0.05)
print(f"P-value: {result['p_value']:.6f}")
print(f"Outlier: {result['is_outlier']}")

# Dixon's Q-test for small samples
result = cleaner.dixon_q_test('age', alpha=0.05)
print(f"Statistic: {result['statistic']:.3f}")

Multivariate Detection

# Mahalanobis distance
outliers = cleaner.detect_outliers_mahalanobis(
    ['income', 'age'], 
    chi2_threshold=0.95
)

# Method chaining
cleaned = (cleaner
    .transform_boxcox('income')
    .remove_outliers_modified_zscore('income')
    .clean_df)