Lesson 26: Linear Regression

Understanding Linear Regression

Linear regression is one of the most fundamental and widely used machine learning algorithms. It's used to predict continuous numerical values based on input features. The name "linear" comes from the fact that it assumes a linear relationship between the input features and the output.

Think of linear regression as finding the best straight line through your data points. This line can then be used to make predictions for new data points.

How Linear Regression Works

Linear regression finds the line that minimizes the distance between the line and all data points. The equation of a line is:

y = mx + b

Where:

y is the predicted output (dependent variable)
x is the input feature (independent variable)
m is the slope (coefficient) - how much y changes for each unit change in x
b is the intercept - the value of y when x is 0

In machine learning, we call m the coefficient and b the intercept.

Simple Example

Let's say you want to predict house prices based on size. You have data showing that larger houses cost more. Linear regression will find the best line that describes this relationship:

house_prices.py
# Example: House size (sq ft) vs Price
# Data: (size, price)
# (1000, 200000), (1500, 300000), (2000, 400000)

# The relationship appears to be: price = 200 * size
# This is a linear relationship!

sizes = [1000, 1500, 2000]
prices = [200000, 300000, 400000]

# Linear regression will learn: coefficient = 200, intercept = 0
print("House sizes:", sizes)
print("House prices:", prices)
print("\nLinear relationship: price = 200 * size")

Using scikit-learn

Scikit-learn makes it easy to implement linear regression. The LinearRegression class handles all the complex math for you:

sklearn_linear.py
# Linear regression with scikit-learn
from sklearn.linear_model import LinearRegression
import numpy as np

# Prepare data (X must be 2D array)
X = np.array([[1], [2], [3], [4], [5]])  # Features
y = np.array([2, 4, 6, 8, 10])  # Target values

# Create and train the model
model = LinearRegression()
model.fit(X, y)

# Access learned parameters
print("Coefficient (slope):", model.coef_[0])
print("Intercept:", model.intercept_)

# Make predictions
new_X = np.array([[6], [7], [8]])
predictions = model.predict(new_X)
print("\nPredictions:")
for x, pred in zip(new_X, predictions):
    print(f"x={x[0]}, predicted y={pred:.2f}")

💡 Key Methods

fit(X, y) - Trains the model on your data

predict(X) - Makes predictions on new data

coef_ - The learned coefficient (slope)

intercept_ - The learned intercept

Multiple Features

Linear regression can handle multiple input features. This is called multiple linear regression:

multiple_features.py
# Multiple linear regression
from sklearn.linear_model import LinearRegression
import numpy as np

# Multiple features: [size, age, rooms]
X = np.array([
    [1000, 5, 2],
    [1500, 3, 3],
    [2000, 1, 4],
    [1200, 4, 2],
    [1800, 2, 3]
])
# Target: house price
y = np.array([200000, 300000, 400000, 250000, 350000])

# Train model
model = LinearRegression()
model.fit(X, y)

print("Coefficients for each feature:", model.coef_)
print("Intercept:", model.intercept_)

# Predict price for new house
new_house = np.array([[1600, 2, 3]])
predicted_price = model.predict(new_house)
print("\nPredicted price:", predicted_price[0])

Evaluating Model Performance

It's important to evaluate how well your model performs. Common metrics include Mean Squared Error (MSE) and R-squared:

evaluation.py
# Evaluating linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Training data
X_train = np.array([[1], [2], [3], [4], [5]])
y_train = np.array([2, 4, 6, 8, 10])

# Test data
X_test = np.array([[6], [7], [8]])
y_test = np.array([12, 14, 16])

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared score:", r2)
print("\nPredictions vs Actual:")
for actual, pred in zip(y_test, y_pred):
    print(f"Actual: {actual}, Predicted: {pred:.2f}")

Real-World Applications

Linear regression is used in many real-world scenarios:

Sales Forecasting: Predict sales based on advertising spend
Price Prediction: Predict house prices, car prices, etc.
Risk Analysis: Predict insurance claims based on customer data
Medical: Predict patient outcomes based on treatment data

Assumptions of Linear Regression

Linear regression works best when:

The relationship between features and target is linear
There's little or no multicollinearity (features aren't highly correlated)
Errors are normally distributed
There's homoscedasticity (constant variance of errors)

When these assumptions aren't met, you might need other algorithms like polynomial regression or regularization techniques.

Linear Regression