Chapter 2: Data Fundamentals / Lesson 8

Data Cleaning

What is Data Cleaning?

Data cleaning is the process of identifying and correcting errors, inconsistencies, and problems in your dataset. Real-world data is almost never perfect - it often contains missing values, duplicates, outliers, and formatting issues that can hurt your ML model's performance.

Cleaning your data is one of the most important steps in the ML pipeline. A well-cleaned dataset can dramatically improve model accuracy.

Handling Missing Values

Missing values are one of the most common data quality issues. Let's see different strategies to handle them:

missing_values.py
# Handling missing values in data # Example dataset with missing values data = [10, 20, None, 40, None, 60, 70] print("Original data with missing values:", data) # Strategy 1: Remove rows with missing values data_no_missing = [x for x in data if x is not None] print("\n1. Remove missing values:", data_no_missing) print(" Use when: Missing values are few and random") # Strategy 2: Fill with mean value values_only = [x for x in data if x is not None] mean_value = sum(values_only) / len(values_only) data_mean_filled = [x if x is not None else mean_value for x in data] print("\n2. Fill with mean:", data_mean_filled) print(f" Mean value: {mean_value:.1f}") print(" Use when: Numerical data, missing values are random") # Strategy 3: Fill with median (more robust to outliers) sorted_values = sorted(values_only) median_value = sorted_values[len(sorted_values) // 2] data_median_filled = [x if x is not None else median_value for x in data] print("\n3. Fill with median:", data_median_filled) print(f" Median value: {median_value}") print(" Use when: Data has outliers") # Strategy 4: Fill with mode (for categorical data) categorical_data = ["red", "blue", None, "red", "green", None, "red"] mode_value = max(set([x for x in categorical_data if x is not None]), key=[x for x in categorical_data if x is not None].count) categorical_filled = [x if x is not None else mode_value for x in categorical_data] print("\n4. Fill with mode (categorical):", categorical_filled) print(f" Mode value: {mode_value}") print(" Use when: Categorical data")

Handling Outliers

Outliers are extreme values that differ significantly from other observations. They can skew your model:

outliers.py
# Detecting and handling outliers # Data with an outlier data = [10, 12, 11, 13, 12, 11, 14, 200, 12, 13] print("Original data:", data) print("Notice: 200 is much larger than other values (outlier)") # Method 1: Identify outliers using IQR (Interquartile Range) sorted_data = sorted([x for x in data if x < 100]) # Remove obvious outlier first q1 = sorted_data[len(sorted_data) // 4] q3 = sorted_data[3 * len(sorted_data) // 4] iqr = q3 - q1 lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr print("\nIQR Method:") print(f" Q1: {q1}, Q3: {q3}, IQR: {iqr}") print(f" Lower bound: {lower_bound:.1f}, Upper bound: {upper_bound:.1f}") outliers = [x for x in data if x < lower_bound or x > upper_bound] print(" Outliers detected:", outliers) # Method 2: Remove outliers data_no_outliers = [x for x in data if lower_bound <= x <= upper_bound] print("\nData without outliers:", data_no_outliers) # Method 3: Cap outliers (set to max/min value) data_capped = [min(max(x, lower_bound), upper_bound) for x in data] print("\nData with capped outliers:", data_capped) print(" Outliers are set to boundary values")

Removing Duplicates

Duplicate records can bias your model. Let's see how to identify and remove them:

duplicates.py
# Removing duplicate data # Data with duplicates data = [ {"name": "Alice", "age": 25, "city": "NYC"}, {"name": "Bob", "age": 30, "city": "LA"}, {"name": "Alice", "age": 25, "city": "NYC"}, # Duplicate {"name": "Charlie", "age": 35, "city": "Chicago"}, {"name": "Bob", "age": 30, "city": "LA"}, # Duplicate ] print("Original data:", len(data), "records") # Remove duplicates (keep first occurrence) seen = set() unique_data = [] for record in data: record_tuple = tuple(record.items()) if record_tuple not in seen: seen.add(record_tuple) unique_data.append(record) print("After removing duplicates:", len(unique_data), "records") print("Removed:", len(data) - len(unique_data), "duplicate records") print("\nUnique records:") for i, record in enumerate(unique_data, 1): print(f" {i}. {record}")

Normalization and Standardization

Different features may have different scales. Normalizing them helps ML algorithms perform better:

normalization.py
# Normalization and standardization # Data with different scales age = [25, 30, 35, 40, 45] # Range: 20-50 income = [50000, 75000, 100000, 125000, 150000] # Range: 50k-150k print("Original data (different scales):") print(" Age:", age, "(range: 20-50)") print(" Income:", income, "(range: 50k-150k)") # Min-Max Normalization (scale to 0-1) def normalize(data): min_val = min(data) max_val = max(data) return [(x - min_val) / (max_val - min_val) for x in data] age_normalized = normalize(age) income_normalized = normalize(income) print("\nMin-Max Normalization (0-1 scale):") print(" Age normalized:", [round(x, 2) for x in age_normalized]) print(" Income normalized:", [round(x, 2) for x in income_normalized]) # Standardization (mean=0, std=1) def standardize(data): mean = sum(data) / len(data) variance = sum((x - mean) ** 2 for x in data) / len(data) std = variance ** 0.5 return [(x - mean) / std for x in data] age_standardized = standardize(age) income_standardized = standardize(income) print("\nStandardization (mean=0, std=1):") print(" Age standardized:", [round(x, 2) for x in age_standardized]) print(" Income standardized:", [round(x, 2) for x in income_standardized]) print("\nWhy normalize/standardize?") print(" - Features on same scale") print(" - Prevents one feature from dominating") print(" - Improves model performance")

Data Cleaning Checklist

Follow this checklist when cleaning your data:

  • Check for missing values: Identify and handle them appropriately
  • Detect outliers: Decide whether to remove, cap, or keep them
  • Remove duplicates: Eliminate exact duplicate records
  • Fix inconsistencies: Standardize formats (dates, text, etc.)
  • Normalize scales: Ensure features are on similar scales
  • Validate data types: Ensure each column has the correct type

Remember: Clean data leads to better models!

Exercise: Clean a Dataset

Complete the exercise on the right side:

  • Task 1: Fill missing values in the 'age' column with the mean age
  • Task 2: Remove duplicate records from the dataset
  • Task 3: Normalize the 'income' column using min-max normalization
  • Task 4: Print the cleaned dataset

Write your code to clean the dataset step by step!

💡 Learning Tip

Practice is essential. Try modifying the code examples, experiment with different parameters, and see how changes affect the results. Hands-on experience is the best teacher!

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...