🎯 Project: Spam Email Classifier
In this project, you'll build a spam email classifier that can distinguish between spam and legitimate emails. This is a classic binary classification problem that demonstrates text processing, feature extraction, and model training.
You'll work through the complete ML pipeline: data preparation, text vectorization, model training, and evaluation.
Project Goal
Build a classifier that can automatically identify spam emails. The model should learn patterns from example emails and correctly classify new emails as spam or not spam.
emails = [
"Win money now! Click here!",
"Meeting at 3pm tomorrow",
"Free prize! Claim now!",
"Project update: Status report"
]
labels = [1, 0, 1, 0]
print("Training Data:")
for email, label in zip(emails, labels):
status = "Spam" if label == 1 else "Not Spam"
print(f" {status}: {email}")
Step 1: Text Vectorization
Machine learning models need numbers, not text. We convert emails into numerical features using techniques like CountVectorizer or TF-IDF:
from sklearn.feature_extraction.text import CountVectorizer
emails = [
"win money now",
"meeting tomorrow",
"free prize click"
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
print("Vocabulary (words):", vectorizer.get_feature_names_out())
print("\nVectorized emails (word counts):")
print(X.toarray())
print("\nEach row is an email, each column is a word count.")
Step 2: Train the Model
Use a classification algorithm like Naive Bayes, which works well with text data:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
emails = [
"win money now",
"meeting tomorrow",
"free prize click",
"project status update"
]
labels = [1, 0, 1, 0]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
model = MultinomialNB()
model.fit(X, labels)
print("Model trained successfully!")
print(f"Learned from {len(emails)} emails")
Step 3: Make Predictions
Use the trained model to classify new emails:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
emails = ["win money", "meeting", "free prize"]
labels = [1, 0, 1]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
model = MultinomialNB()
model.fit(X, labels)
new_emails = ["free money now", "team meeting friday"]
X_new = vectorizer.transform(new_emails)
predictions = model.predict(X_new)
probabilities = model.predict_proba(X_new)
print("Email Classifications:")
for email, pred, prob in zip(new_emails, predictions, probabilities):
status = "SPAM" if pred == 1 else "NOT SPAM"
spam_prob = prob[1]
print(f" '{email}' → {status} (spam probability: {spam_prob:.2f})")
Step 4: Evaluate Performance
Measure how well your classifier performs using metrics like accuracy, precision, and recall:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score
train_emails = ["win money", "meeting", "free prize"]
train_labels = [1, 0, 1]
test_emails = ["click here", "project update"]
test_labels = [1, 0]
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_emails)
model = MultinomialNB()
model.fit(X_train, train_labels)
X_test = vectorizer.transform(test_emails)
predictions = model.predict(X_test)
accuracy = accuracy_score(test_labels, predictions)
precision = precision_score(test_labels, predictions)
recall = recall_score(test_labels, predictions)
print("Model Performance:")
print(f" Accuracy: {accuracy:.2f}")
print(f" Precision: {precision:.2f}")
print(f" Recall: {recall:.2f}")
Project Challenges
As you build this classifier, you'll encounter:
- Text Preprocessing: Handling different cases, punctuation, stop words
- Feature Engineering: Choosing between CountVectorizer, TF-IDF, or word embeddings
- Model Selection: Trying different algorithms (Naive Bayes, SVM, Random Forest)
- Imbalanced Data: Handling cases where you have more spam than non-spam emails
- False Positives: Important emails incorrectly marked as spam
💡 Project Tips
Start with a simple CountVectorizer and Naive Bayes. Once that works, experiment with TF-IDF, try different models, and add text preprocessing (lowercasing, removing punctuation). Iterate and improve!
🎉
Lesson Complete!
Great work! Continue to the next lesson.