Data Cleaning and Model Training: A Practical Guide
Introduction
Real-world data is messy. Missing values, inconsistent formats, duplicates, outliers, and encoding errors are the norm, not the exception. The popular claim that data scientists spend 80% of their time on data preparation is not an exaggeration — it reflects the reality that model performance is bounded by data quality.
This article covers the full pipeline from raw data to trained model:
- Data Quality Assessment — profiling, identifying issues, understanding distributions
- Cleaning Techniques — handling missing values, duplicates, outliers, type conversions
- Feature Engineering — transforming raw columns into model-ready features
- Model Training — scikit-learn workflows, cross-validation, hyperparameter tuning
- Evaluation and Iteration — metrics, error analysis, feedback loops
Section 1: Data Quality Assessment
The Data Cleaning Pipeline
Before cleaning anything, you need to understand what's broken. The assessment phase produces a structured inventory of issues:
Profiling with pandas
Start by loading the data and getting a high-level view:
1import pandas as pd
2import numpy as np
3
4df = pd.read_csv('data/raw_dataset.csv')
5
6# Shape and memory
7print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
8print(f"Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
9
10# Data types
11print(df.dtypes)
12
13# Missing values
14missing = df.isnull().sum()
15missing_pct = (missing / len(df) * 100).round(2)
16print(pd.DataFrame({'count': missing, 'pct': missing_pct}).query('count > 0'))[!TIP] Always check memory usage upfront. A 50M-row dataset that fits in memory on your laptop may not fit on a shared production server. Use
dtypesto spotobjectcolumns that should becategoryordatetime.
Statistical Summary
1# Numeric columns
2df.describe(percentiles=[.01, .05, .25, .5, .75, .95, .99])
3
4# Categorical columns
5for col in df.select_dtypes(include='object').columns:
6 print(f"\n{col}: {df[col].nunique()} unique values")
7 print(df[col].value_counts().head(10))| Metric | What to look for |
|---|---|
count | Less than total rows = missing data |
mean vs 50% | Large gap = skewed distribution |
std | Zero = constant column (no signal) |
min/max | Impossible values (negative age, future dates) |
unique | High cardinality categorical = potential feature engineering |
Section 2: Cleaning Techniques
Handling Missing Values
Missing data is the most common issue. The strategy depends on the mechanism:
Drop rows — only safe when missing is random and sparse:
1# Drop rows where critical columns are missing
2critical_cols = ['id', 'target', 'timestamp']
3df = df.dropna(subset=critical_cols)
4
5# Drop columns with > 50% missing
6threshold = len(df) * 0.5
7df = df.dropna(axis=1, thresh=threshold)Impute — fill with statistical estimates:
1from sklearn.impute import SimpleImputer
2
3# Numeric: median is robust to outliers
4num_imputer = SimpleImputer(strategy='median')
5df[num_cols] = num_imputer.fit_transform(df[num_cols])
6
7# Categorical: mode (most frequent)
8cat_imputer = SimpleImputer(strategy='most_frequent')
9df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])[!WARNING] Never impute before splitting train/test. Fit the imputer on training data only, then transform both sets. Otherwise you leak test information into training.
Flag missingness — sometimes the fact that data is missing is itself informative:
1# Create indicator columns before imputing
2for col in cols_with_missing:
3 df[f'{col}_was_missing'] = df[col].isnull().astype(int)Removing Duplicates
1# Exact duplicates
2n_dupes = df.duplicated().sum()
3print(f"Exact duplicates: {n_dupes}")
4df = df.drop_duplicates()
5
6# Near-duplicates (same name, slight variations)
7from rapidfuzz import fuzz
8# Check pairs with similarity > 90%
9# (expensive — use blocking keys to reduce comparisons)Handling Outliers
1# IQR method
2Q1 = df['salary'].quantile(0.25)
3Q3 = df['salary'].quantile(0.75)
4IQR = Q3 - Q1
5lower = Q1 - 1.5 * IQR
6upper = Q3 + 1.5 * IQR
7
8# Cap instead of remove (winsorization)
9df['salary'] = df['salary'].clip(lower, upper)
10
11# Z-score method (for normally distributed data)
12from scipy import stats
13z_scores = np.abs(stats.zscore(df['age']))
14df = df[z_scores < 3][!NOTE] Outliers are not always errors. A salary of $500K might be a CEO — legitimate data. Domain knowledge determines whether to cap, remove, or keep extreme values.
Type Conversions
1# Datetime parsing
2df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
3
4# Categorical downcast (saves memory)
5df['status'] = df['status'].astype('category')
6
7# Numeric coercion (invalid values become NaN)
8df['price'] = pd.to_numeric(df['price'], errors='coerce')
9
10# String cleanup
11df['email'] = df['email'].str.strip().str.lower()
12df['phone'] = df['phone'].str.replace(r'[^0-9+]', '', regex=True)Section 3: Feature Engineering
Raw cleaned data rarely performs well. Feature engineering transforms columns into representations that machine learning models can learn from effectively.
Numeric Features
1# Scaling (required for linear models, neural networks)
2from sklearn.preprocessing import StandardScaler, MinMaxScaler
3
4scaler = StandardScaler()
5df[num_cols] = scaler.fit_transform(df[num_cols])
6
7# Log transform for skewed distributions
8df['income_log'] = np.log1p(df['income'])
9
10# Binning continuous variables
11df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 55, 100],
12 labels=['young', 'adult', 'middle', 'senior'])
13
14# Interaction features
15df['price_per_sqft'] = df['price'] / df['sqft']
16df['rooms_per_person'] = df['rooms'] / df['household_size']Categorical Features
1# One-hot encoding (low cardinality)
2df = pd.get_dummies(df, columns=['color', 'size'], drop_first=True)
3
4# Target encoding (high cardinality — e.g., zip code)
5from sklearn.preprocessing import TargetEncoder
6te = TargetEncoder(smooth='auto')
7df['zip_encoded'] = te.fit_transform(df[['zip_code']], df['target'])
8
9# Frequency encoding
10freq = df['category'].value_counts(normalize=True)
11df['category_freq'] = df['category'].map(freq)Text Features
1from sklearn.feature_extraction.text import TfidfVectorizer
2
3# TF-IDF for text columns
4tfidf = TfidfVectorizer(max_features=100, stop_words='english')
5text_features = tfidf.fit_transform(df['description'])
6text_df = pd.DataFrame(text_features.toarray(),
7 columns=tfidf.get_feature_names_out())
8df = pd.concat([df, text_df], axis=1)Feature Engineering Pipeline
Section 4: Model Training
Train/Test Split
1from sklearn.model_selection import train_test_split
2
3X = df.drop('target', axis=1)
4y = df['target']
5
6X_train, X_test, y_train, y_test = train_test_split(
7 X, y, test_size=0.2, random_state=42, stratify=y
8)[!IMPORTANT] Always set
random_statefor reproducibility. Usestratify=yfor classification to preserve class distribution in both splits.
Baseline Model
Start simple. A baseline tells you whether complex models are actually learning:
1from sklearn.linear_model import LogisticRegression
2from sklearn.metrics import accuracy_score, classification_report
3
4# Baseline: logistic regression
5baseline = LogisticRegression(max_iter=1000)
6baseline.fit(X_train, y_train)
7y_pred = baseline.predict(X_test)
8
9print(classification_report(y_test, y_pred))Cross-Validation
Never evaluate on a single train/test split. Cross-validation gives a more reliable estimate:
1from sklearn.model_selection import cross_val_score
2
3scores = cross_val_score(baseline, X_train, y_train,
4 cv=5, scoring='f1_weighted')
5print(f"CV F1: {scores.mean():.3f} (+/- {scores.std():.3f})")Model Selection
1from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
2from sklearn.svm import SVC
3
4models = {
5 'logistic': LogisticRegression(max_iter=1000),
6 'random_forest': RandomForestClassifier(n_estimators=100),
7 'gradient_boost': GradientBoostingClassifier(n_estimators=100),
8}
9
10results = {}
11for name, model in models.items():
12 scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_weighted')
13 results[name] = scores.mean()
14 print(f"{name}: {scores.mean():.3f} (+/- {scores.std():.3f})")Hyperparameter Tuning
1from sklearn.model_selection import RandomizedSearchCV
2from scipy.stats import randint, uniform
3
4param_dist = {
5 'n_estimators': randint(50, 300),
6 'max_depth': randint(3, 20),
7 'min_samples_split': randint(2, 20),
8 'min_samples_leaf': randint(1, 10),
9 'max_features': uniform(0.1, 0.9),
10}
11
12search = RandomizedSearchCV(
13 RandomForestClassifier(random_state=42),
14 param_distributions=param_dist,
15 n_iter=50,
16 cv=5,
17 scoring='f1_weighted',
18 random_state=42,
19 n_jobs=-1
20)
21search.fit(X_train, y_train)
22
23print(f"Best params: {search.best_params_}")
24print(f"Best CV score: {search.best_score_:.3f}")[!TIP] Use
RandomizedSearchCVoverGridSearchCVfor most cases. Grid search is exhaustive but exponentially expensive. Random search explores the space more efficiently — 50 random trials often outperform a full grid on 5 parameters.
Section 5: Evaluation and Error Analysis
Metrics
1from sklearn.metrics import (
2 confusion_matrix, ConfusionMatrixDisplay,
3 roc_auc_score, precision_recall_curve
4)
5
6# Confusion matrix
7cm = confusion_matrix(y_test, y_pred)
8ConfusionMatrixDisplay(cm).plot()
9
10# ROC-AUC (for probabilistic predictions)
11y_proba = search.best_estimator_.predict_proba(X_test)[:, 1]
12roc_auc = roc_auc_score(y_test, y_proba)
13print(f"ROC-AUC: {roc_auc:.3f}")Error Analysis
The most productive iteration loop is examining what the model gets wrong:
1# Find misclassified samples
2errors = X_test[y_test != y_pred].copy()
3errors['true_label'] = y_test[y_test != y_pred]
4errors['predicted'] = y_pred[y_test != y_pred]
5
6# Look for patterns
7print(errors.groupby('true_label').describe())The Full Pipeline
Production Checklist
| Step | Action | Why |
|---|---|---|
| 1 | Profile data before cleaning | You can't fix what you don't understand |
| 2 | Split train/test before any transformation | Prevents data leakage |
| 3 | Use pipelines (sklearn.pipeline.Pipeline) | Ensures consistent preprocessing |
| 4 | Save preprocessing artifacts | Imputers, scalers, encoders must match training |
| 5 | Log all experiments | Reproducibility requires knowing what you tried |
| 6 | Version your data | Models are only reproducible with the same data |
| 7 | Monitor in production | Data drift degrades model performance over time |
Conclusion
Data cleaning and model training are iterative, not sequential. A clean dataset with poor features will underperform. Great features on dirty data will mislead. The feedback loop between cleaning, engineering, and training is where real ML work happens.
Key Takeaways:
- Profile first — understand what's broken before fixing it
- Prevent leakage — fit transformers on training data only
- Start simple — baseline models reveal whether complexity helps
- Iterate on errors — error analysis drives the next improvement
- Automate pipelines —
sklearn.pipeline.Pipelineensures consistency from training to production
Enjoyed this article?
Check out my projects or get in touch if you'd like to discuss backend engineering, system design, or collaboration.