Chapter 6: Regression Models / Lesson 26

Linear Regression

Understanding Linear Regression

Linear regression is one of the most fundamental and widely used machine learning algorithms. It's used to predict continuous numerical values based on input features. The name "linear" comes from the fact that it assumes a linear relationship between the input features and the output.

Think of linear regression as finding the best straight line through your data points. This line can then be used to make predictions for new data points.

How Linear Regression Works

Linear regression finds the line that minimizes the distance between the line and all data points. The equation of a line is:

y = mx + b

Where:

  • y is the predicted output (dependent variable)
  • x is the input feature (independent variable)
  • m is the slope (coefficient) - how much y changes for each unit change in x
  • b is the intercept - the value of y when x is 0

In machine learning, we call m the coefficient and b the intercept.

Simple Example

Let's say you want to predict house prices based on size. You have data showing that larger houses cost more. Linear regression will find the best line that describes this relationship:

house_prices.py
# Example: House size (sq ft) vs Price # Data: (size, price) # (1000, 200000), (1500, 300000), (2000, 400000) # The relationship appears to be: price = 200 * size # This is a linear relationship! sizes = [1000, 1500, 2000] prices = [200000, 300000, 400000] # Linear regression will learn: coefficient = 200, intercept = 0 print("House sizes:", sizes) print("House prices:", prices) print("\nLinear relationship: price = 200 * size")

Using scikit-learn

Scikit-learn makes it easy to implement linear regression. The LinearRegression class handles all the complex math for you:

sklearn_linear.py
# Linear regression with scikit-learn from sklearn.linear_model import LinearRegression import numpy as np # Prepare data (X must be 2D array) X = np.array([[1], [2], [3], [4], [5]]) # Features y = np.array([2, 4, 6, 8, 10]) # Target values # Create and train the model model = LinearRegression() model.fit(X, y) # Access learned parameters print("Coefficient (slope):", model.coef_[0]) print("Intercept:", model.intercept_) # Make predictions new_X = np.array([[6], [7], [8]]) predictions = model.predict(new_X) print("\nPredictions:") for x, pred in zip(new_X, predictions): print(f"x={x[0]}, predicted y={pred:.2f}")

💡 Key Methods

fit(X, y) - Trains the model on your data

predict(X) - Makes predictions on new data

coef_ - The learned coefficient (slope)

intercept_ - The learned intercept

Multiple Features

Linear regression can handle multiple input features. This is called multiple linear regression:

multiple_features.py
# Multiple linear regression from sklearn.linear_model import LinearRegression import numpy as np # Multiple features: [size, age, rooms] X = np.array([ [1000, 5, 2], [1500, 3, 3], [2000, 1, 4], [1200, 4, 2], [1800, 2, 3] ]) # Target: house price y = np.array([200000, 300000, 400000, 250000, 350000]) # Train model model = LinearRegression() model.fit(X, y) print("Coefficients for each feature:", model.coef_) print("Intercept:", model.intercept_) # Predict price for new house new_house = np.array([[1600, 2, 3]]) predicted_price = model.predict(new_house) print("\nPredicted price:", predicted_price[0])

Evaluating Model Performance

It's important to evaluate how well your model performs. Common metrics include Mean Squared Error (MSE) and R-squared:

evaluation.py
# Evaluating linear regression model from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score import numpy as np # Training data X_train = np.array([[1], [2], [3], [4], [5]]) y_train = np.array([2, 4, 6, 8, 10]) # Test data X_test = np.array([[6], [7], [8]]) y_test = np.array([12, 14, 16]) # Train model model = LinearRegression() model.fit(X_train, y_train) # Make predictions y_pred = model.predict(X_test) # Evaluate mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("Mean Squared Error:", mse) print("R-squared score:", r2) print("\nPredictions vs Actual:") for actual, pred in zip(y_test, y_pred): print(f"Actual: {actual}, Predicted: {pred:.2f}")

Real-World Applications

Linear regression is used in many real-world scenarios:

  • Sales Forecasting: Predict sales based on advertising spend
  • Price Prediction: Predict house prices, car prices, etc.
  • Risk Analysis: Predict insurance claims based on customer data
  • Medical: Predict patient outcomes based on treatment data

Assumptions of Linear Regression

Linear regression works best when:

  • The relationship between features and target is linear
  • There's little or no multicollinearity (features aren't highly correlated)
  • Errors are normally distributed
  • There's homoscedasticity (constant variance of errors)

When these assumptions aren't met, you might need other algorithms like polynomial regression or regularization techniques.

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...