Complete reference documentation for all StatClean classes, methods, and functions.
StatClean(df=None, preserve_index=True)
| Parameter | Type | Default | Description |
|---|---|---|---|
df |
pandas.DataFrame | None | The DataFrame to clean |
preserve_index |
bool | True | Whether to preserve original index |
import pandas as pd
from statclean import StatClean
df = pd.DataFrame({'values': [1, 2, 3, 100, 4, 5]})
cleaner = StatClean(df, preserve_index=True)
Methods for detecting outliers without removing them from the dataset.
detect_outliers_iqr(column, lower_factor=1.5, upper_factor=1.5)Detect outliers using the Interquartile Range method.
column (str): Column name to analyzelower_factor (float): Lower bound multiplier for IQRupper_factor (float): Upper bound multiplier for IQRpandas.Series: Boolean mask indicating outliers
# Standard IQR
outliers = cleaner.detect_outliers_iqr('values')
# Conservative approach
outliers = cleaner.detect_outliers_iqr(
'values',
lower_factor=2.0,
upper_factor=2.0
)
detect_outliers_zscore(column, threshold=3.0)Detect outliers using Z-score method.
column (str): Column name to analyzethreshold (float): Z-score threshold for outlier detectionpandas.Series: Boolean mask indicating outliers
# Standard threshold
outliers = cleaner.detect_outliers_zscore('values')
# More sensitive
outliers = cleaner.detect_outliers_zscore(
'values',
threshold=2.5
)
detect_outliers_modified_zscore(column, threshold=3.5)Detect outliers using Modified Z-score (MAD-based) method.
column (str): Column name to analyzethreshold (float): Modified Z-score thresholdpandas.Series: Boolean mask indicating outliers
# Standard MAD threshold
outliers = cleaner.detect_outliers_modified_zscore('values')
# More conservative
outliers = cleaner.detect_outliers_modified_zscore(
'values',
threshold=4.0
)
detect_outliers_mahalanobis(columns, chi2_threshold=None, use_shrinkage=False)Detect multivariate outliers using Mahalanobis distance.
columns (list): List of column names for multivariate analysischi2_threshold (float): Chi-square threshold or percentile (0-1]use_shrinkage (bool): Use Ledoit–Wolf shrinkage covariancepandas.Series: Boolean mask indicating outliers
# 97.5th percentile (default)
outliers = cleaner.detect_outliers_mahalanobis(
['var1', 'var2']
)
# 99th percentile with shrinkage
outliers = cleaner.detect_outliers_mahalanobis(
['var1', 'var2'],
chi2_threshold=0.99,
use_shrinkage=True
)
Methods for removing or treating detected outliers. All methods return self for method chaining.
self to enable method chaining. Access results via cleaner.clean_df and cleaner.outlier_info.
remove_outliers_iqr(column, ...)remove_outliers_zscore(column, threshold=3.0)remove_outliers_modified_zscore(column, threshold=3.5)remove_outliers_mahalanobis(columns, ...)cleaner.remove_outliers_zscore('values')
cleaned_df = cleaner.clean_df
winsorize_outliers_iqr(column, ...)winsorize_outliers_zscore(column, threshold=3.0)winsorize_outliers_percentile(column, lower=5, upper=95)cleaner.winsorize_outliers_iqr('values')
winsorized_df = cleaner.clean_df
Formal statistical tests for outlier detection with p-values and significance testing.
grubbs_test(column, alpha=0.05, two_sided=True)Perform Grubbs' test for outliers with statistical significance.
column (str): Column name to testalpha (float): Significance leveltwo_sided (bool): Whether to perform two-sided testdict: Test results including statistic, p_value, critical_value, is_outlier, outlier_value, outlier_index
result = cleaner.grubbs_test('values', alpha=0.05)
print(f"P-value: {result['p_value']:.6f}")
print(f"Outlier: {result['is_outlier']}")
print(f"Value: {result['outlier_value']}")
print(f"Index: {result['outlier_index']}")
dixon_q_test(column, alpha=0.05)Perform Dixon's Q-test for small samples (n < 30).
column (str): Column name to testalpha (float): Significance leveldict: Test results including statistic, critical_value, p_value, is_outlier
result = cleaner.dixon_q_test('values', alpha=0.05)
print(f"Q-statistic: {result['statistic']:.3f}")
print(f"Critical: {result['critical_value']:.3f}")
print(f"P-value: {result['p_value']:.6f}")
Methods for transforming data to improve normality and reduce skewness.
transform_boxcox(column, lambda_param=None)Apply Box-Cox transformation with automatic lambda estimation.
column (str): Column name to transformlambda_param (float): Transformation parameter (auto if None)tuple: (self, transformation_info_dict)
_, info = cleaner.transform_boxcox('income')
print(f"Lambda: {info['lambda']:.3f}")
print(f"Skewness before: {info['skewness_before']:.3f}")
print(f"Skewness after: {info['skewness_after']:.3f}")
recommend_transformation(column)Automatically recommend best transformation based on distribution.
column (str): Column name to analyzedict: Recommendations including best transformation and improvement metrics
rec = cleaner.recommend_transformation('income')
print(f"Method: {rec['recommended_method']}")
print(f"Improvement: {rec['expected_improvement']:.3f}")
transform_log(column, base='natural') - Logarithmic transformation (natural, base 10, base 2)transform_sqrt(column) - Square root transformationMethods for analyzing data distribution and comparing detection methods.
analyze_distribution(column)Comprehensive distribution analysis with statistical tests and method recommendations.
column (str): Column name to analyzedict: Distribution analysis including skewness, kurtosis, normality test, recommended method
analysis = cleaner.analyze_distribution('income')
print(f"Skewness: {analysis['skewness']:.3f}")
print(f"Kurtosis: {analysis['kurtosis']:.3f}")
print(f"Normal p-value: {analysis['normality_test']['p_value']:.6f}")
print(f"Recommended: {analysis['recommended_method']}")
print(f"Reason: {analysis['recommendation_reason']}")
compare_methods(columns, methods=['iqr', 'zscore', 'modified_zscore'])Compare agreement between different detection methods.
columns (list): Column names to comparemethods (list): Detection methods to comparedict: Method comparison results and agreement statistics
comparison = cleaner.compare_methods(
['income'],
methods=['iqr', 'zscore', 'modified_zscore']
)
for method, stats in comparison['income']['method_stats'].items():
print(f"{method}: {stats['outliers_detected']} outliers")
print(f"Summary: {comparison['income']['summary']}")
plot_outlier_analysis(columns=None, figsize=(15, 5))Generate comprehensive outlier analysis plots for specified columns.
columns (list): Columns to plot (defaults to all numeric)figsize (tuple): Base figure size for each subplotdict: Dictionary of matplotlib figures keyed by column names
figures = cleaner.plot_outlier_analysis(['income', 'age'])
# Save figures
for column, fig in figures.items():
fig.savefig(f'{column}_analysis.png',
dpi=300, bbox_inches='tight')
# Display specific figure
import matplotlib.pyplot as plt
plt.show(figures['income'])
get_outlier_stats(columns=None, include_indices=False)Get comprehensive outlier statistics without removing data.
stats = cleaner.get_outlier_stats(
['income', 'age'],
include_indices=True
)
print(stats)
set_thresholds(**kwargs)Configure default thresholds for detection methods.
cleaner.set_thresholds(
iqr_lower_factor=2.0,
iqr_upper_factor=2.0,
zscore_threshold=2.5,
modified_zscore_threshold=4.0
)
clean_columns(columns, method='auto', **kwargs)Batch process multiple columns with progress tracking.
cleaned_df, info = cleaner.clean_columns(
columns=['income', 'age'],
method='auto',
show_progress=True,
include_indices=True
)
for col, details in info.items():
print(f"{col}: {details['outliers_removed']} removed")
get_summary_report()Generate a comprehensive summary report of all cleaning operations.
report = cleaner.get_summary_report()
print(report)
Standalone utility functions in the statclean.utils module.
plot_outliers(data, outliers, title="Outlier Analysis")plot_distribution(data, outliers, title="Distribution Analysis")plot_boxplot(data, outliers, title="Box Plot Analysis")plot_qq(data, outliers, title="Q-Q Plot")plot_outlier_analysis(data, outliers, title="Comprehensive Analysis")from statclean.utils import plot_outliers, plot_distribution
outliers = cleaner.detect_outliers_zscore('income')
plot_outliers(df['income'], outliers, 'Income Analysis')
plot_distribution(df['income'], outliers, 'Income Distribution')
data (pandas.Series): The data to plotoutliers (pandas.Series): Boolean mask of outlierstitle (str): Plot titlefigsize (tuple): Figure sizeAccess cleaned data and outlier information:
cleaner.clean_df - Cleaned DataFrame after outlier treatmentcleaner.outlier_info - Detailed information about removed outlierscleaner.original_df - Original DataFrame (preserved)cleaner.data - Current working DataFrame