The Machine Learning Workflow
Every successful ML project follows a structured workflow. Understanding this process is crucial—it's the roadmap that guides you from problem to solution. The workflow consists of several key steps that we'll explore in detail.
Following a systematic workflow helps you avoid common mistakes, ensures you don't skip important steps, and makes your projects more successful and reproducible.
Step 1: Problem Definition
Before writing any code, clearly define what you're trying to solve:
problem = {
"goal": "Predict house prices",
"input": "House features (size, bedrooms, location)",
"output": "Price (continuous number)",
"type": "Supervised Learning - Regression"
}
print("Problem Definition:")
print("=" * 50)
for key, value in problem.items():
print(f"{key.capitalize()}: {value}")
print("\nKey Questions:")
print(" - What are we trying to predict?")
print(" - What data do we have?")
print(" - What type of ML problem is this?")
Step 2: Data Collection
Gather the data you need. This might come from databases, APIs, files, or experiments:
raw_data = [
{"size": 1000, "bedrooms": 2, "price": 200000},
{"size": 1500, "bedrooms": 3, "price": 300000},
{"size": 2000, "bedrooms": 4, "price": 400000},
]
print("Collected Data:")
print("=" * 50)
for i, house in enumerate(raw_data, 1):
print(f"House {i}: {house}")
print("\nData Sources:")
print(" - Databases")
print(" - APIs")
print(" - CSV/Excel files")
print(" - Web scraping")
print(" - Experiments/surveys")
Step 3: Data Preprocessing
Raw data is rarely perfect. Clean it, handle missing values, and prepare it for modeling:
raw_data = [
{"size": 1000, "bedrooms": 2, "price": 200000},
{"size": 1500, "bedrooms": None, "price": 300000},
{"size": 2000, "bedrooms": 4, "price": 400000},
]
print("Preprocessing Steps:")
print("=" * 50)
print("1. Handle Missing Values")
for house in raw_data:
if house["bedrooms"] is None:
house["bedrooms"] = 3
print(f" Fixed missing bedrooms: {house}")
print("\n2. Normalize Features")
print(" Scale features to similar ranges")
print("\n3. Prepare Features and Target")
X = [[house["size"], house["bedrooms"]] for house in raw_data]
y = [house["price"] for house in raw_data]
print(f" Features (X): {X}")
print(f" Target (y): {y}")
Step 4: Train-Test Split
Split your data into training and testing sets. Train on one set, evaluate on the other:
from sklearn.model_selection import train_test_split
import numpy as np
X = np.array([[1000, 2], [1500, 3], [2000, 4], [1200, 2]])
y = np.array([200000, 300000, 400000, 250000])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("Train-Test Split:")
print("=" * 50)
print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")
print("\nTrain on training set, evaluate on test set!")
Step 5: Model Training
Train your ML model on the training data. The algorithm learns patterns from the examples:
from sklearn.linear_model import LinearRegression
import numpy as np
X_train = np.array([[1000, 2], [1500, 3], [2000, 4]])
y_train = np.array([200000, 300000, 400000])
print("Model Training:")
print("=" * 50)
model = LinearRegression()
model.fit(X_train, y_train)
print("Model trained successfully!")
print(f"Learned coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print("\nThe model has learned the relationship between features and price!")
Step 6: Model Evaluation
Evaluate how well your model performs on the test set (unseen data):
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
X_train = np.array([[1000, 2], [1500, 3]])
y_train = np.array([200000, 300000])
model = LinearRegression()
model.fit(X_train, y_train)
X_test = np.array([[1200, 2]])
y_test = np.array([250000])
y_pred = model.predict(X_test)
print("Model Evaluation:")
print("=" * 50)
print(f"Actual price: ${y_test[0]:,}")
print(f"Predicted price: ${y_pred[0]:,.0f}")
mse = mean_squared_error(y_test, y_pred)
print(f"\nMean Squared Error: {mse:,.0f}")
print("Lower error = better model!")
Step 7: Model Deployment
Once your model performs well, deploy it to make predictions on new, real-world data:
def predict_house_price(size, bedrooms):
base_price = 100000
price_per_sqft = 100
bedroom_value = 20000
return base_price + (size * price_per_sqft) + (bedrooms * bedroom_value)
print("Model Deployment:")
print("=" * 50)
new_houses = [
{"size": 1800, "bedrooms": 3},
{"size": 2200, "bedrooms": 4}
]
print("Predictions for New Houses:")
for house in new_houses:
price = predict_house_price(house["size"], house["bedrooms"])
print(f" {house['size']} sqft, {house['bedrooms']} bedrooms: ${price:,.0f}")
print("\nModel is now deployed and making real predictions!")
The Complete Workflow Summary
Here's the complete ML workflow in one place:
workflow_steps = [
("1. Problem Definition", "Define what you want to predict"),
("2. Data Collection", "Gather relevant data"),
("3. Data Preprocessing", "Clean and prepare data"),
("4. Train-Test Split", "Separate training and testing data"),
("5. Model Training", "Train algorithm on training data"),
("6. Model Evaluation", "Test on unseen data"),
("7. Model Deployment", "Use model for real predictions")
]
print("Complete ML Workflow:")
print("=" * 60)
for step, description in workflow_steps:
print(f"{step}")
print(f" → {description}")
print()
print("This workflow applies to ALL ML projects!")
📝 Exercise: Complete the ML Workflow
Complete the exercise in the code editor on the right. You'll practice going through the complete ML workflow:
- Step 1: Define a problem (e.g., predict house prices, classify emails, etc.)
- Step 2: Create sample data for your problem
- Step 3: Extract features (X) and target (y) from your data
- Step 4: Split data into training and testing sets
- Step 5: Train a simple model (calculate average or pattern)
- Step 6: Evaluate your model on test data
- Step 7: Create a prediction function for deployment
This exercise walks you through the complete workflow. Start with a simple problem like predicting house prices, then work through each step. Don't worry about perfect code—focus on understanding the process!
💡 Important Note
The workflow is iterative! You'll often go back to earlier steps—maybe you need more data, or preprocessing needs adjustment, or you need to try different models. Don't expect to go through it once and be done. ML is an iterative process!
🎉
Lesson Complete!
Great work! Continue to the next lesson.