Data Cleaning and Model Training: A Practical Guide

Introduction

Real-world data is messy. Missing values, inconsistent formats, duplicates, outliers, and encoding errors are the norm, not the exception. The popular claim that data scientists spend 80% of their time on data preparation is not an exaggeration — it reflects the reality that model performance is bounded by data quality.

This article covers the full pipeline from raw data to trained model:

Data Quality Assessment — profiling, identifying issues, understanding distributions
Cleaning Techniques — handling missing values, duplicates, outliers, type conversions
Feature Engineering — transforming raw columns into model-ready features
Model Training — scikit-learn workflows, cross-validation, hyperparameter tuning
Evaluation and Iteration — metrics, error analysis, feedback loops

♦ ♦ ♦

Section 1: Data Quality Assessment

The Data Cleaning Pipeline

Before cleaning anything, you need to understand what's broken. The assessment phase produces a structured inventory of issues:

Profiling with pandas

Start by loading the data and getting a high-level view:

python

1import pandas as pd
2import numpy as np
3
4df = pd.read_csv('data/raw_dataset.csv')
5
6# Shape and memory
7print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
8print(f"Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
9
10# Data types
11print(df.dtypes)
12
13# Missing values
14missing = df.isnull().sum()
15missing_pct = (missing / len(df) * 100).round(2)
16print(pd.DataFrame({'count': missing, 'pct': missing_pct}).query('count > 0'))

[!TIP] Always check memory usage upfront. A 50M-row dataset that fits in memory on your laptop may not fit on a shared production server. Use dtypes to spot object columns that should be category or datetime.

Statistical Summary

python

1# Numeric columns
2df.describe(percentiles=[.01, .05, .25, .5, .75, .95, .99])
3
4# Categorical columns
5for col in df.select_dtypes(include='object').columns:
6    print(f"\n{col}: {df[col].nunique()} unique values")
7    print(df[col].value_counts().head(10))

Metric	What to look for
`count`	Less than total rows = missing data
`mean` vs `50%`	Large gap = skewed distribution
`std`	Zero = constant column (no signal)
`min/max`	Impossible values (negative age, future dates)
`unique`	High cardinality categorical = potential feature engineering

♦ ♦ ♦

Section 2: Cleaning Techniques

Handling Missing Values

Missing data is the most common issue. The strategy depends on the mechanism:

Drop rows — only safe when missing is random and sparse:

python

1# Drop rows where critical columns are missing
2critical_cols = ['id', 'target', 'timestamp']
3df = df.dropna(subset=critical_cols)
4
5# Drop columns with > 50% missing
6threshold = len(df) * 0.5
7df = df.dropna(axis=1, thresh=threshold)

Impute — fill with statistical estimates:

python

1from sklearn.impute import SimpleImputer
2
3# Numeric: median is robust to outliers
4num_imputer = SimpleImputer(strategy='median')
5df[num_cols] = num_imputer.fit_transform(df[num_cols])
6
7# Categorical: mode (most frequent)
8cat_imputer = SimpleImputer(strategy='most_frequent')
9df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

[!WARNING] Never impute before splitting train/test. Fit the imputer on training data only, then transform both sets. Otherwise you leak test information into training.

Flag missingness — sometimes the fact that data is missing is itself informative:

python

1# Create indicator columns before imputing
2for col in cols_with_missing:
3    df[f'{col}_was_missing'] = df[col].isnull().astype(int)

Removing Duplicates

python

1# Exact duplicates
2n_dupes = df.duplicated().sum()
3print(f"Exact duplicates: {n_dupes}")
4df = df.drop_duplicates()
5
6# Near-duplicates (same name, slight variations)
7from rapidfuzz import fuzz
8# Check pairs with similarity > 90%
9# (expensive — use blocking keys to reduce comparisons)

Handling Outliers

python

1# IQR method
2Q1 = df['salary'].quantile(0.25)
3Q3 = df['salary'].quantile(0.75)
4IQR = Q3 - Q1
5lower = Q1 - 1.5 * IQR
6upper = Q3 + 1.5 * IQR
7
8# Cap instead of remove (winsorization)
9df['salary'] = df['salary'].clip(lower, upper)
10
11# Z-score method (for normally distributed data)
12from scipy import stats
13z_scores = np.abs(stats.zscore(df['age']))
14df = df[z_scores < 3]

[!NOTE] Outliers are not always errors. A salary of $500K might be a CEO — legitimate data. Domain knowledge determines whether to cap, remove, or keep extreme values.

Type Conversions

python

1# Datetime parsing
2df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
3
4# Categorical downcast (saves memory)
5df['status'] = df['status'].astype('category')
6
7# Numeric coercion (invalid values become NaN)
8df['price'] = pd.to_numeric(df['price'], errors='coerce')
9
10# String cleanup
11df['email'] = df['email'].str.strip().str.lower()
12df['phone'] = df['phone'].str.replace(r'[^0-9+]', '', regex=True)

♦ ♦ ♦

Section 3: Feature Engineering

Raw cleaned data rarely performs well. Feature engineering transforms columns into representations that machine learning models can learn from effectively.

Numeric Features

python

1# Scaling (required for linear models, neural networks)
2from sklearn.preprocessing import StandardScaler, MinMaxScaler
3
4scaler = StandardScaler()
5df[num_cols] = scaler.fit_transform(df[num_cols])
6
7# Log transform for skewed distributions
8df['income_log'] = np.log1p(df['income'])
9
10# Binning continuous variables
11df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 55, 100],
12                         labels=['young', 'adult', 'middle', 'senior'])
13
14# Interaction features
15df['price_per_sqft'] = df['price'] / df['sqft']
16df['rooms_per_person'] = df['rooms'] / df['household_size']

Categorical Features

python

1# One-hot encoding (low cardinality)
2df = pd.get_dummies(df, columns=['color', 'size'], drop_first=True)
3
4# Target encoding (high cardinality — e.g., zip code)
5from sklearn.preprocessing import TargetEncoder
6te = TargetEncoder(smooth='auto')
7df['zip_encoded'] = te.fit_transform(df[['zip_code']], df['target'])
8
9# Frequency encoding
10freq = df['category'].value_counts(normalize=True)
11df['category_freq'] = df['category'].map(freq)

Text Features

python

1from sklearn.feature_extraction.text import TfidfVectorizer
2
3# TF-IDF for text columns
4tfidf = TfidfVectorizer(max_features=100, stop_words='english')
5text_features = tfidf.fit_transform(df['description'])
6text_df = pd.DataFrame(text_features.toarray(),
7                       columns=tfidf.get_feature_names_out())
8df = pd.concat([df, text_df], axis=1)

Feature Engineering Pipeline

♦ ♦ ♦

Section 4: Model Training

Train/Test Split

python

1from sklearn.model_selection import train_test_split
2
3X = df.drop('target', axis=1)
4y = df['target']
5
6X_train, X_test, y_train, y_test = train_test_split(
7    X, y, test_size=0.2, random_state=42, stratify=y
8)

[!IMPORTANT] Always set random_state for reproducibility. Use stratify=y for classification to preserve class distribution in both splits.

Baseline Model

Start simple. A baseline tells you whether complex models are actually learning:

python

1from sklearn.linear_model import LogisticRegression
2from sklearn.metrics import accuracy_score, classification_report
3
4# Baseline: logistic regression
5baseline = LogisticRegression(max_iter=1000)
6baseline.fit(X_train, y_train)
7y_pred = baseline.predict(X_test)
8
9print(classification_report(y_test, y_pred))

Cross-Validation

Never evaluate on a single train/test split. Cross-validation gives a more reliable estimate:

python

1from sklearn.model_selection import cross_val_score
2
3scores = cross_val_score(baseline, X_train, y_train,
4                         cv=5, scoring='f1_weighted')
5print(f"CV F1: {scores.mean():.3f} (+/- {scores.std():.3f})")

Model Selection

python

1from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
2from sklearn.svm import SVC
3
4models = {
5    'logistic': LogisticRegression(max_iter=1000),
6    'random_forest': RandomForestClassifier(n_estimators=100),
7    'gradient_boost': GradientBoostingClassifier(n_estimators=100),
8}
9
10results = {}
11for name, model in models.items():
12    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_weighted')
13    results[name] = scores.mean()
14    print(f"{name}: {scores.mean():.3f} (+/- {scores.std():.3f})")

Hyperparameter Tuning

python

1from sklearn.model_selection import RandomizedSearchCV
2from scipy.stats import randint, uniform
3
4param_dist = {
5    'n_estimators': randint(50, 300),
6    'max_depth': randint(3, 20),
7    'min_samples_split': randint(2, 20),
8    'min_samples_leaf': randint(1, 10),
9    'max_features': uniform(0.1, 0.9),
10}
11
12search = RandomizedSearchCV(
13    RandomForestClassifier(random_state=42),
14    param_distributions=param_dist,
15    n_iter=50,
16    cv=5,
17    scoring='f1_weighted',
18    random_state=42,
19    n_jobs=-1
20)
21search.fit(X_train, y_train)
22
23print(f"Best params: {search.best_params_}")
24print(f"Best CV score: {search.best_score_:.3f}")

[!TIP] Use RandomizedSearchCV over GridSearchCV for most cases. Grid search is exhaustive but exponentially expensive. Random search explores the space more efficiently — 50 random trials often outperform a full grid on 5 parameters.

♦ ♦ ♦

Section 5: Evaluation and Error Analysis

Metrics

python

1from sklearn.metrics import (
2    confusion_matrix, ConfusionMatrixDisplay,
3    roc_auc_score, precision_recall_curve
4)
5
6# Confusion matrix
7cm = confusion_matrix(y_test, y_pred)
8ConfusionMatrixDisplay(cm).plot()
9
10# ROC-AUC (for probabilistic predictions)
11y_proba = search.best_estimator_.predict_proba(X_test)[:, 1]
12roc_auc = roc_auc_score(y_test, y_proba)
13print(f"ROC-AUC: {roc_auc:.3f}")

Error Analysis

The most productive iteration loop is examining what the model gets wrong:

python

1# Find misclassified samples
2errors = X_test[y_test != y_pred].copy()
3errors['true_label'] = y_test[y_test != y_pred]
4errors['predicted'] = y_pred[y_test != y_pred]
5
6# Look for patterns
7print(errors.groupby('true_label').describe())

The Full Pipeline

♦ ♦ ♦

Production Checklist

Step	Action	Why
1	Profile data before cleaning	You can't fix what you don't understand
2	Split train/test before any transformation	Prevents data leakage
3	Use pipelines (`sklearn.pipeline.Pipeline`)	Ensures consistent preprocessing
4	Save preprocessing artifacts	Imputers, scalers, encoders must match training
5	Log all experiments	Reproducibility requires knowing what you tried
6	Version your data	Models are only reproducible with the same data
7	Monitor in production	Data drift degrades model performance over time

♦ ♦ ♦

Conclusion

Data cleaning and model training are iterative, not sequential. A clean dataset with poor features will underperform. Great features on dirty data will mislead. The feedback loop between cleaning, engineering, and training is where real ML work happens.

Key Takeaways:

Profile first — understand what's broken before fixing it
Prevent leakage — fit transformers on training data only
Start simple — baseline models reveal whether complexity helps
Iterate on errors — error analysis drives the next improvement
Automate pipelines — sklearn.pipeline.Pipeline ensures consistency from training to production

Introduction

This article covers the full pipeline from raw data to trained model:

Data Quality Assessment — profiling, identifying issues, understanding distributions
Cleaning Techniques — handling missing values, duplicates, outliers, type conversions
Feature Engineering — transforming raw columns into model-ready features
Model Training — scikit-learn workflows, cross-validation, hyperparameter tuning
Evaluation and Iteration — metrics, error analysis, feedback loops

♦ ♦ ♦

Section 1: Data Quality Assessment

The Data Cleaning Pipeline

Before cleaning anything, you need to understand what's broken. The assessment phase produces a structured inventory of issues:

Profiling with pandas

Start by loading the data and getting a high-level view:

python

1import pandas as pd
2import numpy as np
3
4df = pd.read_csv('data/raw_dataset.csv')
5
6# Shape and memory
7print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
8print(f"Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
9
10# Data types
11print(df.dtypes)
12
13# Missing values
14missing = df.isnull().sum()
15missing_pct = (missing / len(df) * 100).round(2)
16print(pd.DataFrame({'count': missing, 'pct': missing_pct}).query('count > 0'))

[!TIP] Always check memory usage upfront. A 50M-row dataset that fits in memory on your laptop may not fit on a shared production server. Use dtypes to spot object columns that should be category or datetime.

Statistical Summary

python

1# Numeric columns
2df.describe(percentiles=[.01, .05, .25, .5, .75, .95, .99])
3
4# Categorical columns
5for col in df.select_dtypes(include='object').columns:
6    print(f"\n{col}: {df[col].nunique()} unique values")
7    print(df[col].value_counts().head(10))

Metric	What to look for
`count`	Less than total rows = missing data
`mean` vs `50%`	Large gap = skewed distribution
`std`	Zero = constant column (no signal)
`min/max`	Impossible values (negative age, future dates)
`unique`	High cardinality categorical = potential feature engineering

♦ ♦ ♦

Section 2: Cleaning Techniques

Handling Missing Values

Missing data is the most common issue. The strategy depends on the mechanism:

Drop rows — only safe when missing is random and sparse:

python

1# Drop rows where critical columns are missing
2critical_cols = ['id', 'target', 'timestamp']
3df = df.dropna(subset=critical_cols)
4
5# Drop columns with > 50% missing
6threshold = len(df) * 0.5
7df = df.dropna(axis=1, thresh=threshold)

Impute — fill with statistical estimates:

python

1from sklearn.impute import SimpleImputer
2
3# Numeric: median is robust to outliers
4num_imputer = SimpleImputer(strategy='median')
5df[num_cols] = num_imputer.fit_transform(df[num_cols])
6
7# Categorical: mode (most frequent)
8cat_imputer = SimpleImputer(strategy='most_frequent')
9df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

[!WARNING] Never impute before splitting train/test. Fit the imputer on training data only, then transform both sets. Otherwise you leak test information into training.

Flag missingness — sometimes the fact that data is missing is itself informative:

python

1# Create indicator columns before imputing
2for col in cols_with_missing:
3    df[f'{col}_was_missing'] = df[col].isnull().astype(int)

Removing Duplicates

python

1# Exact duplicates
2n_dupes = df.duplicated().sum()
3print(f"Exact duplicates: {n_dupes}")
4df = df.drop_duplicates()
5
6# Near-duplicates (same name, slight variations)
7from rapidfuzz import fuzz
8# Check pairs with similarity > 90%
9# (expensive — use blocking keys to reduce comparisons)

Handling Outliers

python

1# IQR method
2Q1 = df['salary'].quantile(0.25)
3Q3 = df['salary'].quantile(0.75)
4IQR = Q3 - Q1
5lower = Q1 - 1.5 * IQR
6upper = Q3 + 1.5 * IQR
7
8# Cap instead of remove (winsorization)
9df['salary'] = df['salary'].clip(lower, upper)
10
11# Z-score method (for normally distributed data)
12from scipy import stats
13z_scores = np.abs(stats.zscore(df['age']))
14df = df[z_scores < 3]

[!NOTE] Outliers are not always errors. A salary of $500K might be a CEO — legitimate data. Domain knowledge determines whether to cap, remove, or keep extreme values.

Type Conversions

python

1# Datetime parsing
2df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
3
4# Categorical downcast (saves memory)
5df['status'] = df['status'].astype('category')
6
7# Numeric coercion (invalid values become NaN)
8df['price'] = pd.to_numeric(df['price'], errors='coerce')
9
10# String cleanup
11df['email'] = df['email'].str.strip().str.lower()
12df['phone'] = df['phone'].str.replace(r'[^0-9+]', '', regex=True)

♦ ♦ ♦

Section 3: Feature Engineering

Raw cleaned data rarely performs well. Feature engineering transforms columns into representations that machine learning models can learn from effectively.

Numeric Features

python

1# Scaling (required for linear models, neural networks)
2from sklearn.preprocessing import StandardScaler, MinMaxScaler
3
4scaler = StandardScaler()
5df[num_cols] = scaler.fit_transform(df[num_cols])
6
7# Log transform for skewed distributions
8df['income_log'] = np.log1p(df['income'])
9
10# Binning continuous variables
11df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 55, 100],
12                         labels=['young', 'adult', 'middle', 'senior'])
13
14# Interaction features
15df['price_per_sqft'] = df['price'] / df['sqft']
16df['rooms_per_person'] = df['rooms'] / df['household_size']

Categorical Features

python

1# One-hot encoding (low cardinality)
2df = pd.get_dummies(df, columns=['color', 'size'], drop_first=True)
3
4# Target encoding (high cardinality — e.g., zip code)
5from sklearn.preprocessing import TargetEncoder
6te = TargetEncoder(smooth='auto')
7df['zip_encoded'] = te.fit_transform(df[['zip_code']], df['target'])
8
9# Frequency encoding
10freq = df['category'].value_counts(normalize=True)
11df['category_freq'] = df['category'].map(freq)

Text Features

python

1from sklearn.feature_extraction.text import TfidfVectorizer
2
3# TF-IDF for text columns
4tfidf = TfidfVectorizer(max_features=100, stop_words='english')
5text_features = tfidf.fit_transform(df['description'])
6text_df = pd.DataFrame(text_features.toarray(),
7                       columns=tfidf.get_feature_names_out())
8df = pd.concat([df, text_df], axis=1)

Feature Engineering Pipeline

♦ ♦ ♦

Section 4: Model Training

Train/Test Split

python

1from sklearn.model_selection import train_test_split
2
3X = df.drop('target', axis=1)
4y = df['target']
5
6X_train, X_test, y_train, y_test = train_test_split(
7    X, y, test_size=0.2, random_state=42, stratify=y
8)

[!IMPORTANT] Always set random_state for reproducibility. Use stratify=y for classification to preserve class distribution in both splits.

Baseline Model

Start simple. A baseline tells you whether complex models are actually learning:

python

1from sklearn.linear_model import LogisticRegression
2from sklearn.metrics import accuracy_score, classification_report
3
4# Baseline: logistic regression
5baseline = LogisticRegression(max_iter=1000)
6baseline.fit(X_train, y_train)
7y_pred = baseline.predict(X_test)
8
9print(classification_report(y_test, y_pred))

Cross-Validation

Never evaluate on a single train/test split. Cross-validation gives a more reliable estimate:

python

1from sklearn.model_selection import cross_val_score
2
3scores = cross_val_score(baseline, X_train, y_train,
4                         cv=5, scoring='f1_weighted')
5print(f"CV F1: {scores.mean():.3f} (+/- {scores.std():.3f})")

Model Selection

python

1from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
2from sklearn.svm import SVC
3
4models = {
5    'logistic': LogisticRegression(max_iter=1000),
6    'random_forest': RandomForestClassifier(n_estimators=100),
7    'gradient_boost': GradientBoostingClassifier(n_estimators=100),
8}
9
10results = {}
11for name, model in models.items():
12    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_weighted')
13    results[name] = scores.mean()
14    print(f"{name}: {scores.mean():.3f} (+/- {scores.std():.3f})")

Hyperparameter Tuning

python

1from sklearn.model_selection import RandomizedSearchCV
2from scipy.stats import randint, uniform
3
4param_dist = {
5    'n_estimators': randint(50, 300),
6    'max_depth': randint(3, 20),
7    'min_samples_split': randint(2, 20),
8    'min_samples_leaf': randint(1, 10),
9    'max_features': uniform(0.1, 0.9),
10}
11
12search = RandomizedSearchCV(
13    RandomForestClassifier(random_state=42),
14    param_distributions=param_dist,
15    n_iter=50,
16    cv=5,
17    scoring='f1_weighted',
18    random_state=42,
19    n_jobs=-1
20)
21search.fit(X_train, y_train)
22
23print(f"Best params: {search.best_params_}")
24print(f"Best CV score: {search.best_score_:.3f}")

[!TIP] Use RandomizedSearchCV over GridSearchCV for most cases. Grid search is exhaustive but exponentially expensive. Random search explores the space more efficiently — 50 random trials often outperform a full grid on 5 parameters.

♦ ♦ ♦

Section 5: Evaluation and Error Analysis

Metrics

python

1from sklearn.metrics import (
2    confusion_matrix, ConfusionMatrixDisplay,
3    roc_auc_score, precision_recall_curve
4)
5
6# Confusion matrix
7cm = confusion_matrix(y_test, y_pred)
8ConfusionMatrixDisplay(cm).plot()
9
10# ROC-AUC (for probabilistic predictions)
11y_proba = search.best_estimator_.predict_proba(X_test)[:, 1]
12roc_auc = roc_auc_score(y_test, y_proba)
13print(f"ROC-AUC: {roc_auc:.3f}")

Error Analysis

The most productive iteration loop is examining what the model gets wrong:

python

1# Find misclassified samples
2errors = X_test[y_test != y_pred].copy()
3errors['true_label'] = y_test[y_test != y_pred]
4errors['predicted'] = y_pred[y_test != y_pred]
5
6# Look for patterns
7print(errors.groupby('true_label').describe())

The Full Pipeline

♦ ♦ ♦

Production Checklist

Step	Action	Why
1	Profile data before cleaning	You can't fix what you don't understand
2	Split train/test before any transformation	Prevents data leakage
3	Use pipelines (`sklearn.pipeline.Pipeline`)	Ensures consistent preprocessing
4	Save preprocessing artifacts	Imputers, scalers, encoders must match training
5	Log all experiments	Reproducibility requires knowing what you tried
6	Version your data	Models are only reproducible with the same data
7	Monitor in production	Data drift degrades model performance over time

♦ ♦ ♦

Conclusion

Key Takeaways:

Profile first — understand what's broken before fixing it
Prevent leakage — fit transformers on training data only
Start simple — baseline models reveal whether complexity helps
Iterate on errors — error analysis drives the next improvement
Automate pipelines — sklearn.pipeline.Pipeline ensures consistency from training to production

Introduction

Section 1: Data Quality Assessment

The Data Cleaning Pipeline

Profiling with pandas

Statistical Summary

Section 2: Cleaning Techniques

Handling Missing Values

Removing Duplicates

Handling Outliers

Type Conversions

Section 3: Feature Engineering

Numeric Features

Categorical Features

Text Features

Feature Engineering Pipeline

Section 4: Model Training

Train/Test Split

Baseline Model

Cross-Validation

Model Selection

Hyperparameter Tuning

Section 5: Evaluation and Error Analysis

Metrics

Error Analysis

The Full Pipeline

Production Checklist

Conclusion

Enjoyed this article?

Introduction

Section 1: Data Quality Assessment

The Data Cleaning Pipeline

Profiling with pandas

Statistical Summary

Section 2: Cleaning Techniques

Handling Missing Values

Removing Duplicates

Handling Outliers

Type Conversions

Section 3: Feature Engineering

Numeric Features

Categorical Features

Text Features

Feature Engineering Pipeline

Section 4: Model Training

Train/Test Split

Baseline Model

Cross-Validation

Model Selection

Hyperparameter Tuning

Section 5: Evaluation and Error Analysis

Metrics

Error Analysis

The Full Pipeline

Production Checklist

Conclusion

Enjoyed this article?