Lesson 4: ML Workflow

The Machine Learning Workflow

Every successful ML project follows a structured workflow. Understanding this process is crucial—it's the roadmap that guides you from problem to solution. The workflow consists of several key steps that we'll explore in detail.

Following a systematic workflow helps you avoid common mistakes, ensures you don't skip important steps, and makes your projects more successful and reproducible.

Step 1: Problem Definition

Before writing any code, clearly define what you're trying to solve:

problem_definition.py
# Step 1: Define the Problem

problem = {
    "goal": "Predict house prices",
    "input": "House features (size, bedrooms, location)",
    "output": "Price (continuous number)",
    "type": "Supervised Learning - Regression"
}

print("Problem Definition:")
print("=" * 50)
for key, value in problem.items():
    print(f"{key.capitalize()}: {value}")

print("\nKey Questions:")
print("  - What are we trying to predict?")
print("  - What data do we have?")
print("  - What type of ML problem is this?")

Step 2: Data Collection

Gather the data you need. This might come from databases, APIs, files, or experiments:

data_collection.py
# Step 2: Collect Data

# Example: House price data
raw_data = [
    {"size": 1000, "bedrooms": 2, "price": 200000},
    {"size": 1500, "bedrooms": 3, "price": 300000},
    {"size": 2000, "bedrooms": 4, "price": 400000},
]

print("Collected Data:")
print("=" * 50)
for i, house in enumerate(raw_data, 1):
    print(f"House {i}: {house}")

print("\nData Sources:")
print("  - Databases")
print("  - APIs")
print("  - CSV/Excel files")
print("  - Web scraping")
print("  - Experiments/surveys")

Step 3: Data Preprocessing

Raw data is rarely perfect. Clean it, handle missing values, and prepare it for modeling:

data_preprocessing.py
# Step 3: Preprocess Data

# Raw data with issues
raw_data = [
    {"size": 1000, "bedrooms": 2, "price": 200000},
    {"size": 1500, "bedrooms": None, "price": 300000},  # Missing value
    {"size": 2000, "bedrooms": 4, "price": 400000},
]

print("Preprocessing Steps:")
print("=" * 50)

# Handle missing values
print("1. Handle Missing Values")
for house in raw_data:
    if house["bedrooms"] is None:
        house["bedrooms"] = 3  # Fill with average
        print(f"  Fixed missing bedrooms: {house}")

# Normalize/scale features
print("\n2. Normalize Features")
print("  Scale features to similar ranges")

# Split into features and target
print("\n3. Prepare Features and Target")
X = [[house["size"], house["bedrooms"]] for house in raw_data]
y = [house["price"] for house in raw_data]

print(f"  Features (X): {X}")
print(f"  Target (y): {y}")

Step 4: Train-Test Split

Split your data into training and testing sets. Train on one set, evaluate on the other:

train_test_split.py
# Step 4: Split Data

from sklearn.model_selection import train_test_split
import numpy as np

# Prepare data
X = np.array([[1000, 2], [1500, 3], [2000, 4], [1200, 2]])
y = np.array([200000, 300000, 400000, 250000])

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train-Test Split:")
print("=" * 50)
print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print("\nTrain on training set, evaluate on test set!")

Step 5: Model Training

Train your ML model on the training data. The algorithm learns patterns from the examples:

model_training.py
# Step 5: Train Model

from sklearn.linear_model import LinearRegression
import numpy as np

# Training data
X_train = np.array([[1000, 2], [1500, 3], [2000, 4]])
y_train = np.array([200000, 300000, 400000])

print("Model Training:")
print("=" * 50)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

print("Model trained successfully!")
print(f"Learned coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print("\nThe model has learned the relationship between features and price!")

Step 6: Model Evaluation

Evaluate how well your model performs on the test set (unseen data):

model_evaluation.py
# Step 6: Evaluate Model

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Train model (same as before)
X_train = np.array([[1000, 2], [1500, 3]])
y_train = np.array([200000, 300000])
model = LinearRegression()
model.fit(X_train, y_train)

# Test data
X_test = np.array([[1200, 2]])
y_test = np.array([250000])

# Make predictions
y_pred = model.predict(X_test)

print("Model Evaluation:")
print("=" * 50)
print(f"Actual price: ${y_test[0]:,}")
print(f"Predicted price: ${y_pred[0]:,.0f}")

# Calculate error
mse = mean_squared_error(y_test, y_pred)
print(f"\nMean Squared Error: {mse:,.0f}")
print("Lower error = better model!")

Step 7: Model Deployment

Once your model performs well, deploy it to make predictions on new, real-world data:

model_deployment.py
# Step 7: Deploy Model

# Trained model (simplified)
def predict_house_price(size, bedrooms):
    # This would use your trained model
    # Simplified for demonstration
    base_price = 100000
    price_per_sqft = 100
    bedroom_value = 20000
    return base_price + (size * price_per_sqft) + (bedrooms * bedroom_value)

print("Model Deployment:")
print("=" * 50)

# Use model to make predictions
new_houses = [
    {"size": 1800, "bedrooms": 3},
    {"size": 2200, "bedrooms": 4}
]

print("Predictions for New Houses:")
for house in new_houses:
    price = predict_house_price(house["size"], house["bedrooms"])
    print(f"  {house['size']} sqft, {house['bedrooms']} bedrooms: ${price:,.0f}")

print("\nModel is now deployed and making real predictions!")

The Complete Workflow Summary

Here's the complete ML workflow in one place:

complete_workflow.py
# Complete ML Workflow

workflow_steps = [
    ("1. Problem Definition", "Define what you want to predict"),
    ("2. Data Collection", "Gather relevant data"),
    ("3. Data Preprocessing", "Clean and prepare data"),
    ("4. Train-Test Split", "Separate training and testing data"),
    ("5. Model Training", "Train algorithm on training data"),
    ("6. Model Evaluation", "Test on unseen data"),
    ("7. Model Deployment", "Use model for real predictions")
]

print("Complete ML Workflow:")
print("=" * 60)
for step, description in workflow_steps:
    print(f"{step}")
    print(f"  → {description}")
    print()

print("This workflow applies to ALL ML projects!")

📝 Exercise: Complete the ML Workflow

Complete the exercise in the code editor on the right. You'll practice going through the complete ML workflow:

Step 1: Define a problem (e.g., predict house prices, classify emails, etc.)
Step 2: Create sample data for your problem
Step 3: Extract features (X) and target (y) from your data
Step 4: Split data into training and testing sets
Step 5: Train a simple model (calculate average or pattern)
Step 6: Evaluate your model on test data
Step 7: Create a prediction function for deployment

This exercise walks you through the complete workflow. Start with a simple problem like predicting house prices, then work through each step. Don't worry about perfect code—focus on understanding the process!

ML Workflow