Spam Classifier Project

🎯 Project: Spam Email Classifier

In this project, you'll build a spam email classifier that can distinguish between spam and legitimate emails. This is a classic binary classification problem that demonstrates text processing, feature extraction, and model training.

You'll work through the complete ML pipeline: data preparation, text vectorization, model training, and evaluation.

Project Goal

Build a classifier that can automatically identify spam emails. The model should learn patterns from example emails and correctly classify new emails as spam or not spam.

project_overview.py
# Spam Classifier Project Overview

# Training emails
emails = [
    "Win money now! Click here!",  # Spam
    "Meeting at 3pm tomorrow",      # Not spam
    "Free prize! Claim now!",       # Spam
    "Project update: Status report"  # Not spam
]

labels = [1, 0, 1, 0]  # 1 = spam, 0 = not spam

print("Training Data:")
for email, label in zip(emails, labels):
    status = "Spam" if label == 1 else "Not Spam"
    print(f"  {status}: {email}")

Step 1: Text Vectorization

Machine learning models need numbers, not text. We convert emails into numerical features using techniques like CountVectorizer or TF-IDF:

vectorization.py
# Converting text to numbers
from sklearn.feature_extraction.text import CountVectorizer

emails = [
    "win money now",
    "meeting tomorrow",
    "free prize click"
]

# Create vectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

print("Vocabulary (words):", vectorizer.get_feature_names_out())
print("\nVectorized emails (word counts):")
print(X.toarray())

print("\nEach row is an email, each column is a word count.")

Step 2: Train the Model

Use a classification algorithm like Naive Bayes, which works well with text data:

train_model.py
# Training the spam classifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Training data
emails = [
    "win money now",
    "meeting tomorrow",
    "free prize click",
    "project status update"
]
labels = [1, 0, 1, 0]  # 1 = spam, 0 = not spam

# Vectorize
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Train model
model = MultinomialNB()
model.fit(X, labels)

print("Model trained successfully!")
print(f"Learned from {len(emails)} emails")

Step 3: Make Predictions

Use the trained model to classify new emails:

predictions.py
# Making predictions on new emails
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Train model (same as before)
emails = ["win money", "meeting", "free prize"]
labels = [1, 0, 1]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
model = MultinomialNB()
model.fit(X, labels)

# New emails to classify
new_emails = ["free money now", "team meeting friday"]
X_new = vectorizer.transform(new_emails)
predictions = model.predict(X_new)
probabilities = model.predict_proba(X_new)

print("Email Classifications:")
for email, pred, prob in zip(new_emails, predictions, probabilities):
    status = "SPAM" if pred == 1 else "NOT SPAM"
    spam_prob = prob[1]
    print(f"  '{email}' → {status} (spam probability: {spam_prob:.2f})")

Step 4: Evaluate Performance

Measure how well your classifier performs using metrics like accuracy, precision, and recall:

evaluation.py
# Evaluating the spam classifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Training data
train_emails = ["win money", "meeting", "free prize"]
train_labels = [1, 0, 1]

# Test data
test_emails = ["click here", "project update"]
test_labels = [1, 0]

# Train
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_emails)
model = MultinomialNB()
model.fit(X_train, train_labels)

# Test
X_test = vectorizer.transform(test_emails)
predictions = model.predict(X_test)

# Evaluate
accuracy = accuracy_score(test_labels, predictions)
precision = precision_score(test_labels, predictions)
recall = recall_score(test_labels, predictions)

print("Model Performance:")
print(f"  Accuracy: {accuracy:.2f}")
print(f"  Precision: {precision:.2f}")
print(f"  Recall: {recall:.2f}")

Project Challenges

As you build this classifier, you'll encounter:

Text Preprocessing: Handling different cases, punctuation, stop words
Feature Engineering: Choosing between CountVectorizer, TF-IDF, or word embeddings
Model Selection: Trying different algorithms (Naive Bayes, SVM, Random Forest)
Imbalanced Data: Handling cases where you have more spam than non-spam emails
False Positives: Important emails incorrectly marked as spam

💡 Project Tips

Start with a simple CountVectorizer and Naive Bayes. Once that works, experiment with TF-IDF, try different models, and add text preprocessing (lowercasing, removing punctuation). Iterate and improve!

🎉

Lesson Complete!

Great work! Continue to the next lesson.