Chapter 7: Classification Models / Lesson 35

Spam Classifier Project

🎯 Project: Spam Email Classifier

In this project, you'll build a spam email classifier that can distinguish between spam and legitimate emails. This is a classic binary classification problem that demonstrates text processing, feature extraction, and model training.

You'll work through the complete ML pipeline: data preparation, text vectorization, model training, and evaluation.

Project Goal

Build a classifier that can automatically identify spam emails. The model should learn patterns from example emails and correctly classify new emails as spam or not spam.

project_overview.py
# Spam Classifier Project Overview # Training emails emails = [ "Win money now! Click here!", # Spam "Meeting at 3pm tomorrow", # Not spam "Free prize! Claim now!", # Spam "Project update: Status report" # Not spam ] labels = [1, 0, 1, 0] # 1 = spam, 0 = not spam print("Training Data:") for email, label in zip(emails, labels): status = "Spam" if label == 1 else "Not Spam" print(f" {status}: {email}")

Step 1: Text Vectorization

Machine learning models need numbers, not text. We convert emails into numerical features using techniques like CountVectorizer or TF-IDF:

vectorization.py
# Converting text to numbers from sklearn.feature_extraction.text import CountVectorizer emails = [ "win money now", "meeting tomorrow", "free prize click" ] # Create vectorizer vectorizer = CountVectorizer() X = vectorizer.fit_transform(emails) print("Vocabulary (words):", vectorizer.get_feature_names_out()) print("\nVectorized emails (word counts):") print(X.toarray()) print("\nEach row is an email, each column is a word count.")

Step 2: Train the Model

Use a classification algorithm like Naive Bayes, which works well with text data:

train_model.py
# Training the spam classifier from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB # Training data emails = [ "win money now", "meeting tomorrow", "free prize click", "project status update" ] labels = [1, 0, 1, 0] # 1 = spam, 0 = not spam # Vectorize vectorizer = CountVectorizer() X = vectorizer.fit_transform(emails) # Train model model = MultinomialNB() model.fit(X, labels) print("Model trained successfully!") print(f"Learned from {len(emails)} emails")

Step 3: Make Predictions

Use the trained model to classify new emails:

predictions.py
# Making predictions on new emails from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB # Train model (same as before) emails = ["win money", "meeting", "free prize"] labels = [1, 0, 1] vectorizer = CountVectorizer() X = vectorizer.fit_transform(emails) model = MultinomialNB() model.fit(X, labels) # New emails to classify new_emails = ["free money now", "team meeting friday"] X_new = vectorizer.transform(new_emails) predictions = model.predict(X_new) probabilities = model.predict_proba(X_new) print("Email Classifications:") for email, pred, prob in zip(new_emails, predictions, probabilities): status = "SPAM" if pred == 1 else "NOT SPAM" spam_prob = prob[1] print(f" '{email}' → {status} (spam probability: {spam_prob:.2f})")

Step 4: Evaluate Performance

Measure how well your classifier performs using metrics like accuracy, precision, and recall:

evaluation.py
# Evaluating the spam classifier from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, precision_score, recall_score # Training data train_emails = ["win money", "meeting", "free prize"] train_labels = [1, 0, 1] # Test data test_emails = ["click here", "project update"] test_labels = [1, 0] # Train vectorizer = CountVectorizer() X_train = vectorizer.fit_transform(train_emails) model = MultinomialNB() model.fit(X_train, train_labels) # Test X_test = vectorizer.transform(test_emails) predictions = model.predict(X_test) # Evaluate accuracy = accuracy_score(test_labels, predictions) precision = precision_score(test_labels, predictions) recall = recall_score(test_labels, predictions) print("Model Performance:") print(f" Accuracy: {accuracy:.2f}") print(f" Precision: {precision:.2f}") print(f" Recall: {recall:.2f}")

Project Challenges

As you build this classifier, you'll encounter:

  • Text Preprocessing: Handling different cases, punctuation, stop words
  • Feature Engineering: Choosing between CountVectorizer, TF-IDF, or word embeddings
  • Model Selection: Trying different algorithms (Naive Bayes, SVM, Random Forest)
  • Imbalanced Data: Handling cases where you have more spam than non-spam emails
  • False Positives: Important emails incorrectly marked as spam

💡 Project Tips

Start with a simple CountVectorizer and Naive Bayes. Once that works, experiment with TF-IDF, try different models, and add text preprocessing (lowercasing, removing punctuation). Iterate and improve!

🎉

Lesson Complete!

Great work! Continue to the next lesson.

main.py
📤 Output
Click "Run" to execute...