What is Data Cleaning?
Data cleaning is the process of identifying and correcting errors, inconsistencies, and problems in your dataset. Real-world data is almost never perfect - it often contains missing values, duplicates, outliers, and formatting issues that can hurt your ML model's performance.
Cleaning your data is one of the most important steps in the ML pipeline. A well-cleaned dataset can dramatically improve model accuracy.
Handling Missing Values
Missing values are one of the most common data quality issues. Let's see different strategies to handle them:
data = [10, 20, None, 40, None, 60, 70]
print("Original data with missing values:", data)
data_no_missing = [x for x in data if x is not None]
print("\n1. Remove missing values:", data_no_missing)
print(" Use when: Missing values are few and random")
values_only = [x for x in data if x is not None]
mean_value = sum(values_only) / len(values_only)
data_mean_filled = [x if x is not None else mean_value for x in data]
print("\n2. Fill with mean:", data_mean_filled)
print(f" Mean value: {mean_value:.1f}")
print(" Use when: Numerical data, missing values are random")
sorted_values = sorted(values_only)
median_value = sorted_values[len(sorted_values) // 2]
data_median_filled = [x if x is not None else median_value for x in data]
print("\n3. Fill with median:", data_median_filled)
print(f" Median value: {median_value}")
print(" Use when: Data has outliers")
categorical_data = ["red", "blue", None, "red", "green", None, "red"]
mode_value = max(set([x for x in categorical_data if x is not None]),
key=[x for x in categorical_data if x is not None].count)
categorical_filled = [x if x is not None else mode_value for x in categorical_data]
print("\n4. Fill with mode (categorical):", categorical_filled)
print(f" Mode value: {mode_value}")
print(" Use when: Categorical data")
Handling Outliers
Outliers are extreme values that differ significantly from other observations. They can skew your model:
data = [10, 12, 11, 13, 12, 11, 14, 200, 12, 13]
print("Original data:", data)
print("Notice: 200 is much larger than other values (outlier)")
sorted_data = sorted([x for x in data if x < 100])
q1 = sorted_data[len(sorted_data) // 4]
q3 = sorted_data[3 * len(sorted_data) // 4]
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
print("\nIQR Method:")
print(f" Q1: {q1}, Q3: {q3}, IQR: {iqr}")
print(f" Lower bound: {lower_bound:.1f}, Upper bound: {upper_bound:.1f}")
outliers = [x for x in data if x < lower_bound or x > upper_bound]
print(" Outliers detected:", outliers)
data_no_outliers = [x for x in data if lower_bound <= x <= upper_bound]
print("\nData without outliers:", data_no_outliers)
data_capped = [min(max(x, lower_bound), upper_bound) for x in data]
print("\nData with capped outliers:", data_capped)
print(" Outliers are set to boundary values")
Removing Duplicates
Duplicate records can bias your model. Let's see how to identify and remove them:
data = [
{"name": "Alice", "age": 25, "city": "NYC"},
{"name": "Bob", "age": 30, "city": "LA"},
{"name": "Alice", "age": 25, "city": "NYC"},
{"name": "Charlie", "age": 35, "city": "Chicago"},
{"name": "Bob", "age": 30, "city": "LA"},
]
print("Original data:", len(data), "records")
seen = set()
unique_data = []
for record in data:
record_tuple = tuple(record.items())
if record_tuple not in seen:
seen.add(record_tuple)
unique_data.append(record)
print("After removing duplicates:", len(unique_data), "records")
print("Removed:", len(data) - len(unique_data), "duplicate records")
print("\nUnique records:")
for i, record in enumerate(unique_data, 1):
print(f" {i}. {record}")
Normalization and Standardization
Different features may have different scales. Normalizing them helps ML algorithms perform better:
age = [25, 30, 35, 40, 45]
income = [50000, 75000, 100000, 125000, 150000]
print("Original data (different scales):")
print(" Age:", age, "(range: 20-50)")
print(" Income:", income, "(range: 50k-150k)")
def normalize(data):
min_val = min(data)
max_val = max(data)
return [(x - min_val) / (max_val - min_val) for x in data]
age_normalized = normalize(age)
income_normalized = normalize(income)
print("\nMin-Max Normalization (0-1 scale):")
print(" Age normalized:", [round(x, 2) for x in age_normalized])
print(" Income normalized:", [round(x, 2) for x in income_normalized])
def standardize(data):
mean = sum(data) / len(data)
variance = sum((x - mean) ** 2 for x in data) / len(data)
std = variance ** 0.5
return [(x - mean) / std for x in data]
age_standardized = standardize(age)
income_standardized = standardize(income)
print("\nStandardization (mean=0, std=1):")
print(" Age standardized:", [round(x, 2) for x in age_standardized])
print(" Income standardized:", [round(x, 2) for x in income_standardized])
print("\nWhy normalize/standardize?")
print(" - Features on same scale")
print(" - Prevents one feature from dominating")
print(" - Improves model performance")
Data Cleaning Checklist
Follow this checklist when cleaning your data:
- Check for missing values: Identify and handle them appropriately
- Detect outliers: Decide whether to remove, cap, or keep them
- Remove duplicates: Eliminate exact duplicate records
- Fix inconsistencies: Standardize formats (dates, text, etc.)
- Normalize scales: Ensure features are on similar scales
- Validate data types: Ensure each column has the correct type
Remember: Clean data leads to better models!
Exercise: Clean a Dataset
Complete the exercise on the right side:
- Task 1: Fill missing values in the 'age' column with the mean age
- Task 2: Remove duplicate records from the dataset
- Task 3: Normalize the 'income' column using min-max normalization
- Task 4: Print the cleaned dataset
Write your code to clean the dataset step by step!
💡 Learning Tip
Practice is essential. Try modifying the code examples, experiment with different parameters, and see how changes affect the results. Hands-on experience is the best teacher!
🎉
Lesson Complete!
Great work! Continue to the next lesson.