Lesson 46: Natural Language Processing

What is Natural Language Processing (NLP)?

Natural Language Processing (NLP) is a branch of artificial intelligence that enables machines to understand, interpret, and generate human language. NLP bridges the gap between human communication and computer understanding, allowing machines to process and analyze large amounts of natural language data.

NLP combines computational linguistics with machine learning to enable computers to perform tasks like sentiment analysis, language translation, text summarization, and question answering.

NLP Tasks Overview
# Common NLP Tasks:
# - Text Classification (spam detection, sentiment analysis)
# - Named Entity Recognition (finding names, locations, dates)
# - Machine Translation (Google Translate)
# - Text Summarization (creating summaries of long documents)
# - Question Answering (chatbots, virtual assistants)
# - Language Modeling (predicting next word)

import re
from collections import Counter

text = "Natural Language Processing is amazing!"
words = text.lower().split()
print("Words:", words)
print("Word count:", len(words))

Text Preprocessing

Before processing text with ML models, we need to preprocess it:

Tokenization: Splitting text into individual words or tokens
Lowercasing: Converting all text to lowercase for consistency
Removing Stop Words: Eliminating common words (the, is, and) that don't carry much meaning
Stemming/Lemmatization: Reducing words to their root form (running → run)

Text Preprocessing Example
import re

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize (split into words)
    tokens = text.split()
    # Remove stop words (simplified)
    stop_words = {'the', 'is', 'a', 'an', 'and', 'or', 'but'}
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

text = "Natural Language Processing is amazing! It helps computers understand text."
preprocessed = preprocess_text(text)
print("Preprocessed tokens:", preprocessed)

Word Embeddings

Word embeddings convert words into dense numerical vectors that capture semantic meaning. Similar words have similar vectors, enabling models to understand relationships between words.

Word2Vec: Learns word embeddings from large text corpora
GloVe: Global Vectors for word representation using co-occurrence statistics
FastText: Extends Word2Vec to handle subword information
Contextual Embeddings: Modern approaches like BERT, GPT that create context-aware embeddings

Simple Text Vectorization
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample texts
texts = ["I love machine learning", "Natural language processing is cool", "Deep learning is amazing"]

# Create tokenizer and fit on texts
tokenizer = Tokenizer(num_words=100, oov_token="")
tokenizer.fit_on_texts(texts)

# Convert texts to sequences
sequences = tokenizer.texts_to_sequences(texts)
print("Sequences:", sequences)

# Pad sequences to same length
padded = pad_sequences(sequences, maxlen=10, padding='post')
print("Padded sequences shape:", padded.shape)

NLP Models

Various models are used for NLP tasks:

RNNs/LSTMs: Process sequential text data, maintaining context
Transformers: Modern architecture (BERT, GPT) using attention mechanisms
CNN for Text: Apply convolutional layers to text sequences
Naive Bayes: Simple but effective for text classification

💡 Modern NLP

Transformer models like BERT and GPT have revolutionized NLP by understanding context bidirectionally and generating human-like text. These pre-trained models can be fine-tuned for specific tasks with relatively little data!

Practical Applications

NLP powers many applications we use daily:

Sentiment Analysis: Analyzing product reviews, social media posts
Machine Translation: Google Translate, language apps
Chatbots: Customer service bots, virtual assistants
Text Summarization: News aggregators, research tools
Spam Detection: Email filters, content moderation

Exercise: Text Classification

In the exercise on the right, you'll build a simple text classification model using tokenization and a neural network. You'll preprocess text, convert it to sequences, and train a model to classify text into categories.

This hands-on exercise will help you understand the fundamental steps in NLP pipeline.

Natural Language Processing