Lesson 10: Data Preprocessing Project

🎯 Project: Data Preprocessing Project

This project will help you apply everything you've learned about data preparation. You'll build a complete data preprocessing pipeline that handles missing values, outliers, encoding, and normalization.

Projects are where theory meets practice. This is your chance to integrate multiple concepts and create something real.

Project Overview

We'll work with a sample dataset containing house information. Our goal is to clean and preprocess this data for machine learning:

project_overview.py
# Data Preprocessing Project - Overview

# Sample dataset (with issues to fix)
raw_data = [
    {"size": 1200, "bedrooms": 2, "age": 5, "location": "urban", "price": 250000},
    {"size": 1500, "bedrooms": None, "age": 10, "location": "suburban", "price": 320000},
    {"size": 2000, "bedrooms": 4, "age": 2, "location": "urban", "price": 450000},
    {"size": None, "bedrooms": 3, "age": 15, "location": "rural", "price": 280000},
    {"size": 1600, "bedrooms": 3, "age": 8, "location": "suburban", "price": 350000},
]

print("Raw Data (with issues):")
print("  - Missing values (None)")
print("  - Categorical data (location)")
print("  - Different scales (size vs age)")
print("  - Need preprocessing for ML!")

Step 1: Handle Missing Values

First, we need to identify and handle missing values:

step1_missing_values.py
# Step 1: Handle Missing Values

data = [
    {"size": 1200, "bedrooms": 2, "age": 5},
    {"size": 1500, "bedrooms": None, "age": 10},
    {"size": 2000, "bedrooms": 4, "age": 2},
    {"size": None, "bedrooms": 3, "age": 15},
    {"size": 1600, "bedrooms": 3, "age": 8},
]

# Calculate mean for numerical features
sizes = [d["size"] for d in data if d["size"] is not None]
bedrooms = [d["bedrooms"] for d in data if d["bedrooms"] is not None]

mean_size = sum(sizes) / len(sizes)
mean_bedrooms = sum(bedrooms) / len(bedrooms)

print("Missing Value Handling:")
print(f"  Mean size: {mean_size:.0f}")
print(f"  Mean bedrooms: {mean_bedrooms:.1f}")

# Fill missing values
for record in data:
    if record["size"] is None:
        record["size"] = mean_size
    if record["bedrooms"] is None:
        record["bedrooms"] = round(mean_bedrooms)

print("\nData after filling missing values:")
for i, record in enumerate(data, 1):
    print(f"  {i}. {record}")

Step 2: Encode Categorical Data

Convert categorical features to numerical format:

step2_encoding.py
# Step 2: Encode Categorical Data

data = [
    {"location": "urban"},
    {"location": "suburban"},
    {"location": "urban"},
    {"location": "rural"},
    {"location": "suburban"},
]

# Create encoding mapping
unique_locations = list(set(d["location"] for d in data))
location_to_num = {loc: i for i, loc in enumerate(unique_locations)}

print("Categorical Encoding:")
print("  Mapping:", location_to_num)

# Encode locations
for record in data:
    record["location_encoded"] = location_to_num[record["location"]]

print("\nEncoded Data:")
for i, record in enumerate(data, 1):
    print(f"  {i}. {record["location"]} → {record["location_encoded"]}")

Step 3: Normalize Features

Scale features to similar ranges for better model performance:

step3_normalization.py
# Step 3: Normalize Features

# Features with different scales
sizes = [1200, 1500, 2000, 1575, 1600]  # Large values
ages = [5, 10, 2, 15, 8]  # Small values

print("Original features (different scales):")
print("  Sizes:", sizes)
print("  Ages:", ages)

# Min-Max Normalization
def normalize(values):
    min_val = min(values)
    max_val = max(values)
    return [(v - min_val) / (max_val - min_val) for v in values]

sizes_normalized = normalize(sizes)
ages_normalized = normalize(ages)

print("\nNormalized features (0-1 scale):")
print("  Sizes normalized:", [round(x, 3) for x in sizes_normalized])
print("  Ages normalized:", [round(x, 3) for x in ages_normalized])
print("\n  Now both features are on the same scale!")

Step 4: Complete Preprocessing Pipeline

Putting it all together - a complete preprocessing workflow:

complete_pipeline.py
# Complete Data Preprocessing Pipeline

# 1. Load raw data
raw_data = [
    {"size": 1200, "bedrooms": 2, "location": "urban"},
    {"size": 1500, "bedrooms": None, "location": "suburban"},
    {"size": 2000, "bedrooms": 4, "location": "urban"},
]

print("Step 1: Handle missing values")
# (Fill missing bedrooms with mean)
mean_bedrooms = 3.0
for record in raw_data:
    if record["bedrooms"] is None:
        record["bedrooms"] = mean_bedrooms

print("Step 2: Encode categorical data")
location_map = {"urban": 0, "suburban": 1, "rural": 2}
for record in raw_data:
    record["location_encoded"] = location_map[record["location"]]

print("Step 3: Normalize numerical features")
sizes = [r["size"] for r in raw_data]
min_size, max_size = min(sizes), max(sizes)
for i, record in enumerate(raw_data):
    record["size_normalized"] = (record["size"] - min_size) / (max_size - min_size)

print("\nPreprocessed Data (ready for ML):")
for i, record in enumerate(raw_data, 1):
    print(f"  {i}. Size: {record["size_normalized"]:.3f}, Bedrooms: {record["bedrooms"]}, Location: {record["location_encoded"]}")

Project Checklist

Complete these steps for your project:

✓ Handle Missing Values: Identify and fill or remove missing data
✓ Remove Duplicates: Eliminate duplicate records
✓ Encode Categorical: Convert text categories to numbers
✓ Normalize Features: Scale features to similar ranges
✓ Handle Outliers: Detect and treat extreme values
✓ Validate Data: Check data quality and consistency

Once preprocessing is complete, your data is ready for machine learning models!

Exercise: Complete Preprocessing Pipeline

Complete the exercise on the right side to build a full preprocessing pipeline:

Step 1: Fill missing values in 'bedrooms' with the mean
Step 2: Encode the 'location' categorical feature
Step 3: Normalize the 'size' feature using min-max normalization
Step 4: Print the final preprocessed dataset

Write your code to complete all preprocessing steps!

Data Preprocessing Project