Lesson 8: Data Cleaning

What is Data Cleaning?

Data cleaning is the process of identifying and correcting errors, inconsistencies, and problems in your dataset. Real-world data is almost never perfect - it often contains missing values, duplicates, outliers, and formatting issues that can hurt your ML model's performance.

Cleaning your data is one of the most important steps in the ML pipeline. A well-cleaned dataset can dramatically improve model accuracy.

Handling Missing Values

Missing values are one of the most common data quality issues. Let's see different strategies to handle them:

missing_values.py
# Handling missing values in data

# Example dataset with missing values
data = [10, 20, None, 40, None, 60, 70]
print("Original data with missing values:", data)

# Strategy 1: Remove rows with missing values
data_no_missing = [x for x in data if x is not None]
print("\n1. Remove missing values:", data_no_missing)
print("   Use when: Missing values are few and random")

# Strategy 2: Fill with mean value
values_only = [x for x in data if x is not None]
mean_value = sum(values_only) / len(values_only)
data_mean_filled = [x if x is not None else mean_value for x in data]
print("\n2. Fill with mean:", data_mean_filled)
print(f"   Mean value: {mean_value:.1f}")
print("   Use when: Numerical data, missing values are random")

# Strategy 3: Fill with median (more robust to outliers)
sorted_values = sorted(values_only)
median_value = sorted_values[len(sorted_values) // 2]
data_median_filled = [x if x is not None else median_value for x in data]
print("\n3. Fill with median:", data_median_filled)
print(f"   Median value: {median_value}")
print("   Use when: Data has outliers")

# Strategy 4: Fill with mode (for categorical data)
categorical_data = ["red", "blue", None, "red", "green", None, "red"]
mode_value = max(set([x for x in categorical_data if x is not None]), 
                key=[x for x in categorical_data if x is not None].count)
categorical_filled = [x if x is not None else mode_value for x in categorical_data]
print("\n4. Fill with mode (categorical):", categorical_filled)
print(f"   Mode value: {mode_value}")
print("   Use when: Categorical data")

Handling Outliers

Outliers are extreme values that differ significantly from other observations. They can skew your model:

outliers.py
# Detecting and handling outliers

# Data with an outlier
data = [10, 12, 11, 13, 12, 11, 14, 200, 12, 13]
print("Original data:", data)
print("Notice: 200 is much larger than other values (outlier)")

# Method 1: Identify outliers using IQR (Interquartile Range)
sorted_data = sorted([x for x in data if x < 100])  # Remove obvious outlier first
q1 = sorted_data[len(sorted_data) // 4]
q3 = sorted_data[3 * len(sorted_data) // 4]
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

print("\nIQR Method:")
print(f"  Q1: {q1}, Q3: {q3}, IQR: {iqr}")
print(f"  Lower bound: {lower_bound:.1f}, Upper bound: {upper_bound:.1f}")

outliers = [x for x in data if x < lower_bound or x > upper_bound]
print("  Outliers detected:", outliers)

# Method 2: Remove outliers
data_no_outliers = [x for x in data if lower_bound <= x <= upper_bound]
print("\nData without outliers:", data_no_outliers)

# Method 3: Cap outliers (set to max/min value)
data_capped = [min(max(x, lower_bound), upper_bound) for x in data]
print("\nData with capped outliers:", data_capped)
print("  Outliers are set to boundary values")

Removing Duplicates

Duplicate records can bias your model. Let's see how to identify and remove them:

duplicates.py
# Removing duplicate data

# Data with duplicates
data = [
    {"name": "Alice", "age": 25, "city": "NYC"},
    {"name": "Bob", "age": 30, "city": "LA"},
    {"name": "Alice", "age": 25, "city": "NYC"},  # Duplicate
    {"name": "Charlie", "age": 35, "city": "Chicago"},
    {"name": "Bob", "age": 30, "city": "LA"},  # Duplicate
]

print("Original data:", len(data), "records")

# Remove duplicates (keep first occurrence)
seen = set()
unique_data = []
for record in data:
    record_tuple = tuple(record.items())
    if record_tuple not in seen:
        seen.add(record_tuple)
        unique_data.append(record)

print("After removing duplicates:", len(unique_data), "records")
print("Removed:", len(data) - len(unique_data), "duplicate records")

print("\nUnique records:")
for i, record in enumerate(unique_data, 1):
    print(f"  {i}. {record}")

Normalization and Standardization

Different features may have different scales. Normalizing them helps ML algorithms perform better:

normalization.py
# Normalization and standardization

# Data with different scales
age = [25, 30, 35, 40, 45]  # Range: 20-50
income = [50000, 75000, 100000, 125000, 150000]  # Range: 50k-150k

print("Original data (different scales):")
print("  Age:", age, "(range: 20-50)")
print("  Income:", income, "(range: 50k-150k)")

# Min-Max Normalization (scale to 0-1)
def normalize(data):
    min_val = min(data)
    max_val = max(data)
    return [(x - min_val) / (max_val - min_val) for x in data]

age_normalized = normalize(age)
income_normalized = normalize(income)

print("\nMin-Max Normalization (0-1 scale):")
print("  Age normalized:", [round(x, 2) for x in age_normalized])
print("  Income normalized:", [round(x, 2) for x in income_normalized])

# Standardization (mean=0, std=1)
def standardize(data):
    mean = sum(data) / len(data)
    variance = sum((x - mean) ** 2 for x in data) / len(data)
    std = variance ** 0.5
    return [(x - mean) / std for x in data]

age_standardized = standardize(age)
income_standardized = standardize(income)

print("\nStandardization (mean=0, std=1):")
print("  Age standardized:", [round(x, 2) for x in age_standardized])
print("  Income standardized:", [round(x, 2) for x in income_standardized])

print("\nWhy normalize/standardize?")
print("  - Features on same scale")
print("  - Prevents one feature from dominating")
print("  - Improves model performance")

Data Cleaning Checklist

Follow this checklist when cleaning your data:

Check for missing values: Identify and handle them appropriately
Detect outliers: Decide whether to remove, cap, or keep them
Remove duplicates: Eliminate exact duplicate records
Fix inconsistencies: Standardize formats (dates, text, etc.)
Normalize scales: Ensure features are on similar scales
Validate data types: Ensure each column has the correct type

Remember: Clean data leads to better models!

Exercise: Clean a Dataset

Complete the exercise on the right side:

Task 1: Fill missing values in the 'age' column with the mean age
Task 2: Remove duplicate records from the dataset
Task 3: Normalize the 'income' column using min-max normalization
Task 4: Print the cleaned dataset

Write your code to clean the dataset step by step!

Data Cleaning