API Reference

Complete reference documentation for all StatClean classes, methods, and functions.

StatClean Class

Initialization

StatClean(df=None, preserve_index=True)
Parameters:
Parameter Type Default Description
df pandas.DataFrame None The DataFrame to clean
preserve_index bool True Whether to preserve original index
Example:
import pandas as pd
from statclean import StatClean

df = pd.DataFrame({'values': [1, 2, 3, 100, 4, 5]})
cleaner = StatClean(df, preserve_index=True)

Detection Methods

Methods for detecting outliers without removing them from the dataset.

detect_outliers_iqr(column, lower_factor=1.5, upper_factor=1.5)

Detect outliers using the Interquartile Range method.

Parameters:
  • column (str): Column name to analyze
  • lower_factor (float): Lower bound multiplier for IQR
  • upper_factor (float): Upper bound multiplier for IQR
Returns:

pandas.Series: Boolean mask indicating outliers

Example:
# Standard IQR
outliers = cleaner.detect_outliers_iqr('values')

# Conservative approach
outliers = cleaner.detect_outliers_iqr(
    'values', 
    lower_factor=2.0, 
    upper_factor=2.0
)
detect_outliers_zscore(column, threshold=3.0)

Detect outliers using Z-score method.

Parameters:
  • column (str): Column name to analyze
  • threshold (float): Z-score threshold for outlier detection
Returns:

pandas.Series: Boolean mask indicating outliers

Example:
# Standard threshold
outliers = cleaner.detect_outliers_zscore('values')

# More sensitive
outliers = cleaner.detect_outliers_zscore(
    'values', 
    threshold=2.5
)
detect_outliers_modified_zscore(column, threshold=3.5)

Detect outliers using Modified Z-score (MAD-based) method.

Parameters:
  • column (str): Column name to analyze
  • threshold (float): Modified Z-score threshold
Returns:

pandas.Series: Boolean mask indicating outliers

Example:
# Standard MAD threshold
outliers = cleaner.detect_outliers_modified_zscore('values')

# More conservative
outliers = cleaner.detect_outliers_modified_zscore(
    'values', 
    threshold=4.0
)
detect_outliers_mahalanobis(columns, chi2_threshold=None, use_shrinkage=False)

Detect multivariate outliers using Mahalanobis distance.

Parameters:
  • columns (list): List of column names for multivariate analysis
  • chi2_threshold (float): Chi-square threshold or percentile (0-1]
  • use_shrinkage (bool): Use Ledoit–Wolf shrinkage covariance
Returns:

pandas.Series: Boolean mask indicating outliers

Example:
# 97.5th percentile (default)
outliers = cleaner.detect_outliers_mahalanobis(
    ['var1', 'var2']
)

# 99th percentile with shrinkage
outliers = cleaner.detect_outliers_mahalanobis(
    ['var1', 'var2'], 
    chi2_threshold=0.99,
    use_shrinkage=True
)

Treatment Methods

Methods for removing or treating detected outliers. All methods return self for method chaining.

Important: All treatment methods return self to enable method chaining. Access results via cleaner.clean_df and cleaner.outlier_info.
Removal Methods
  • remove_outliers_iqr(column, ...)
  • remove_outliers_zscore(column, threshold=3.0)
  • remove_outliers_modified_zscore(column, threshold=3.5)
  • remove_outliers_mahalanobis(columns, ...)
Example:
cleaner.remove_outliers_zscore('values')
cleaned_df = cleaner.clean_df
Winsorizing Methods
  • winsorize_outliers_iqr(column, ...)
  • winsorize_outliers_zscore(column, threshold=3.0)
  • winsorize_outliers_percentile(column, lower=5, upper=95)
Example:
cleaner.winsorize_outliers_iqr('values')
winsorized_df = cleaner.clean_df

Statistical Testing

Formal statistical tests for outlier detection with p-values and significance testing.

grubbs_test(column, alpha=0.05, two_sided=True)

Perform Grubbs' test for outliers with statistical significance.

Parameters:
  • column (str): Column name to test
  • alpha (float): Significance level
  • two_sided (bool): Whether to perform two-sided test
Returns:

dict: Test results including statistic, p_value, critical_value, is_outlier, outlier_value, outlier_index

Example:
result = cleaner.grubbs_test('values', alpha=0.05)
print(f"P-value: {result['p_value']:.6f}")
print(f"Outlier: {result['is_outlier']}")
print(f"Value: {result['outlier_value']}")
print(f"Index: {result['outlier_index']}")
dixon_q_test(column, alpha=0.05)

Perform Dixon's Q-test for small samples (n < 30).

Parameters:
  • column (str): Column name to test
  • alpha (float): Significance level
Returns:

dict: Test results including statistic, critical_value, p_value, is_outlier

Example:
result = cleaner.dixon_q_test('values', alpha=0.05)
print(f"Q-statistic: {result['statistic']:.3f}")
print(f"Critical: {result['critical_value']:.3f}")
print(f"P-value: {result['p_value']:.6f}")

Data Transformations

Methods for transforming data to improve normality and reduce skewness.

transform_boxcox(column, lambda_param=None)

Apply Box-Cox transformation with automatic lambda estimation.

Parameters:
  • column (str): Column name to transform
  • lambda_param (float): Transformation parameter (auto if None)
Returns:

tuple: (self, transformation_info_dict)

_, info = cleaner.transform_boxcox('income')
print(f"Lambda: {info['lambda']:.3f}")
print(f"Skewness before: {info['skewness_before']:.3f}")
print(f"Skewness after: {info['skewness_after']:.3f}")
recommend_transformation(column)

Automatically recommend best transformation based on distribution.

Parameters:
  • column (str): Column name to analyze
Returns:

dict: Recommendations including best transformation and improvement metrics

rec = cleaner.recommend_transformation('income')
print(f"Method: {rec['recommended_method']}")
print(f"Improvement: {rec['expected_improvement']:.3f}")
Additional Transformation Methods:
  • transform_log(column, base='natural') - Logarithmic transformation (natural, base 10, base 2)
  • transform_sqrt(column) - Square root transformation

Analysis Methods

Methods for analyzing data distribution and comparing detection methods.

analyze_distribution(column)

Comprehensive distribution analysis with statistical tests and method recommendations.

Parameters:
  • column (str): Column name to analyze
Returns:

dict: Distribution analysis including skewness, kurtosis, normality test, recommended method

Example:
analysis = cleaner.analyze_distribution('income')
print(f"Skewness: {analysis['skewness']:.3f}")
print(f"Kurtosis: {analysis['kurtosis']:.3f}")
print(f"Normal p-value: {analysis['normality_test']['p_value']:.6f}")
print(f"Recommended: {analysis['recommended_method']}")
print(f"Reason: {analysis['recommendation_reason']}")
compare_methods(columns, methods=['iqr', 'zscore', 'modified_zscore'])

Compare agreement between different detection methods.

Parameters:
  • columns (list): Column names to compare
  • methods (list): Detection methods to compare
Returns:

dict: Method comparison results and agreement statistics

Example:
comparison = cleaner.compare_methods(
    ['income'], 
    methods=['iqr', 'zscore', 'modified_zscore']
)

for method, stats in comparison['income']['method_stats'].items():
    print(f"{method}: {stats['outliers_detected']} outliers")
    
print(f"Summary: {comparison['income']['summary']}")

Visualization

plot_outlier_analysis(columns=None, figsize=(15, 5))

Generate comprehensive outlier analysis plots for specified columns.

Parameters:
  • columns (list): Columns to plot (defaults to all numeric)
  • figsize (tuple): Base figure size for each subplot
Returns:

dict: Dictionary of matplotlib figures keyed by column names

Example:
figures = cleaner.plot_outlier_analysis(['income', 'age'])

# Save figures
for column, fig in figures.items():
    fig.savefig(f'{column}_analysis.png', 
                dpi=300, bbox_inches='tight')

# Display specific figure
import matplotlib.pyplot as plt
plt.show(figures['income'])

Utility Methods

get_outlier_stats(columns=None, include_indices=False)

Get comprehensive outlier statistics without removing data.

stats = cleaner.get_outlier_stats(
    ['income', 'age'], 
    include_indices=True
)
print(stats)
set_thresholds(**kwargs)

Configure default thresholds for detection methods.

cleaner.set_thresholds(
    iqr_lower_factor=2.0,
    iqr_upper_factor=2.0,
    zscore_threshold=2.5,
    modified_zscore_threshold=4.0
)
clean_columns(columns, method='auto', **kwargs)

Batch process multiple columns with progress tracking.

cleaned_df, info = cleaner.clean_columns(
    columns=['income', 'age'],
    method='auto',
    show_progress=True,
    include_indices=True
)

for col, details in info.items():
    print(f"{col}: {details['outliers_removed']} removed")
get_summary_report()

Generate a comprehensive summary report of all cleaning operations.

report = cleaner.get_summary_report()
print(report)

Utility Functions

Standalone utility functions in the statclean.utils module.

Visualization Functions
  • plot_outliers(data, outliers, title="Outlier Analysis")
  • plot_distribution(data, outliers, title="Distribution Analysis")
  • plot_boxplot(data, outliers, title="Box Plot Analysis")
  • plot_qq(data, outliers, title="Q-Q Plot")
  • plot_outlier_analysis(data, outliers, title="Comprehensive Analysis")
Example:
from statclean.utils import plot_outliers, plot_distribution

outliers = cleaner.detect_outliers_zscore('income')
plot_outliers(df['income'], outliers, 'Income Analysis')
plot_distribution(df['income'], outliers, 'Income Distribution')
Function Parameters
Common Parameters:
  • data (pandas.Series): The data to plot
  • outliers (pandas.Series): Boolean mask of outliers
  • title (str): Plot title
  • figsize (tuple): Figure size
All utility functions create matplotlib figures that can be displayed, saved, or customized further.
Key Properties

Access cleaned data and outlier information:

  • cleaner.clean_df - Cleaned DataFrame after outlier treatment
  • cleaner.outlier_info - Detailed information about removed outliers
  • cleaner.original_df - Original DataFrame (preserved)
  • cleaner.data - Current working DataFrame