Chapter 1: Introduction to Machine Learning / Lesson 4

ML Workflow

The Machine Learning Workflow

Every successful ML project follows a structured workflow. Understanding this process is crucial—it's the roadmap that guides you from problem to solution. The workflow consists of several key steps that we'll explore in detail.

Following a systematic workflow helps you avoid common mistakes, ensures you don't skip important steps, and makes your projects more successful and reproducible.

Step 1: Problem Definition

Before writing any code, clearly define what you're trying to solve:

problem_definition.py
# Step 1: Define the Problem problem = { "goal": "Predict house prices", "input": "House features (size, bedrooms, location)", "output": "Price (continuous number)", "type": "Supervised Learning - Regression" } print("Problem Definition:") print("=" * 50) for key, value in problem.items(): print(f"{key.capitalize()}: {value}") print("\nKey Questions:") print(" - What are we trying to predict?") print(" - What data do we have?") print(" - What type of ML problem is this?")

Step 2: Data Collection

Gather the data you need. This might come from databases, APIs, files, or experiments:

data_collection.py
# Step 2: Collect Data # Example: House price data raw_data = [ {"size": 1000, "bedrooms": 2, "price": 200000}, {"size": 1500, "bedrooms": 3, "price": 300000}, {"size": 2000, "bedrooms": 4, "price": 400000}, ] print("Collected Data:") print("=" * 50) for i, house in enumerate(raw_data, 1): print(f"House {i}: {house}") print("\nData Sources:") print(" - Databases") print(" - APIs") print(" - CSV/Excel files") print(" - Web scraping") print(" - Experiments/surveys")

Step 3: Data Preprocessing

Raw data is rarely perfect. Clean it, handle missing values, and prepare it for modeling:

data_preprocessing.py
# Step 3: Preprocess Data # Raw data with issues raw_data = [ {"size": 1000, "bedrooms": 2, "price": 200000}, {"size": 1500, "bedrooms": None, "price": 300000}, # Missing value {"size": 2000, "bedrooms": 4, "price": 400000}, ] print("Preprocessing Steps:") print("=" * 50) # Handle missing values print("1. Handle Missing Values") for house in raw_data: if house["bedrooms"] is None: house["bedrooms"] = 3 # Fill with average print(f" Fixed missing bedrooms: {house}") # Normalize/scale features print("\n2. Normalize Features") print(" Scale features to similar ranges") # Split into features and target print("\n3. Prepare Features and Target") X = [[house["size"], house["bedrooms"]] for house in raw_data] y = [house["price"] for house in raw_data] print(f" Features (X): {X}") print(f" Target (y): {y}")

Step 4: Train-Test Split

Split your data into training and testing sets. Train on one set, evaluate on the other:

train_test_split.py
# Step 4: Split Data from sklearn.model_selection import train_test_split import numpy as np # Prepare data X = np.array([[1000, 2], [1500, 3], [2000, 4], [1200, 2]]) y = np.array([200000, 300000, 400000, 250000]) # Split: 80% train, 20% test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print("Train-Test Split:") print("=" * 50) print(f"Training set: {len(X_train)} samples") print(f"Test set: {len(X_test)} samples") print("\nTrain on training set, evaluate on test set!")

Step 5: Model Training

Train your ML model on the training data. The algorithm learns patterns from the examples:

model_training.py
# Step 5: Train Model from sklearn.linear_model import LinearRegression import numpy as np # Training data X_train = np.array([[1000, 2], [1500, 3], [2000, 4]]) y_train = np.array([200000, 300000, 400000]) print("Model Training:") print("=" * 50) # Create and train model model = LinearRegression() model.fit(X_train, y_train) print("Model trained successfully!") print(f"Learned coefficients: {model.coef_}") print(f"Intercept: {model.intercept_}") print("\nThe model has learned the relationship between features and price!")

Step 6: Model Evaluation

Evaluate how well your model performs on the test set (unseen data):

model_evaluation.py
# Step 6: Evaluate Model from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error import numpy as np # Train model (same as before) X_train = np.array([[1000, 2], [1500, 3]]) y_train = np.array([200000, 300000]) model = LinearRegression() model.fit(X_train, y_train) # Test data X_test = np.array([[1200, 2]]) y_test = np.array([250000]) # Make predictions y_pred = model.predict(X_test) print("Model Evaluation:") print("=" * 50) print(f"Actual price: ${y_test[0]:,}") print(f"Predicted price: ${y_pred[0]:,.0f}") # Calculate error mse = mean_squared_error(y_test, y_pred) print(f"\nMean Squared Error: {mse:,.0f}") print("Lower error = better model!")

Step 7: Model Deployment

Once your model performs well, deploy it to make predictions on new, real-world data:

model_deployment.py
# Step 7: Deploy Model # Trained model (simplified) def predict_house_price(size, bedrooms): # This would use your trained model # Simplified for demonstration base_price = 100000 price_per_sqft = 100 bedroom_value = 20000 return base_price + (size * price_per_sqft) + (bedrooms * bedroom_value) print("Model Deployment:") print("=" * 50) # Use model to make predictions new_houses = [ {"size": 1800, "bedrooms": 3}, {"size": 2200, "bedrooms": 4} ] print("Predictions for New Houses:") for house in new_houses: price = predict_house_price(house["size"], house["bedrooms"]) print(f" {house['size']} sqft, {house['bedrooms']} bedrooms: ${price:,.0f}") print("\nModel is now deployed and making real predictions!")

The Complete Workflow Summary

Here's the complete ML workflow in one place:

complete_workflow.py
# Complete ML Workflow workflow_steps = [ ("1. Problem Definition", "Define what you want to predict"), ("2. Data Collection", "Gather relevant data"), ("3. Data Preprocessing", "Clean and prepare data"), ("4. Train-Test Split", "Separate training and testing data"), ("5. Model Training", "Train algorithm on training data"), ("6. Model Evaluation", "Test on unseen data"), ("7. Model Deployment", "Use model for real predictions") ] print("Complete ML Workflow:") print("=" * 60) for step, description in workflow_steps: print(f"{step}") print(f" → {description}") print() print("This workflow applies to ALL ML projects!")

📝 Exercise: Complete the ML Workflow

Complete the exercise in the code editor on the right. You'll practice going through the complete ML workflow:

  • Step 1: Define a problem (e.g., predict house prices, classify emails, etc.)
  • Step 2: Create sample data for your problem
  • Step 3: Extract features (X) and target (y) from your data
  • Step 4: Split data into training and testing sets
  • Step 5: Train a simple model (calculate average or pattern)
  • Step 6: Evaluate your model on test data
  • Step 7: Create a prediction function for deployment

This exercise walks you through the complete workflow. Start with a simple problem like predicting house prices, then work through each step. Don't worry about perfect code—focus on understanding the process!

💡 Important Note

The workflow is iterative! You'll often go back to earlier steps—maybe you need more data, or preprocessing needs adjustment, or you need to try different models. Don't expect to go through it once and be done. ML is an iterative process!

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...