Lesson 7: Data Types in ML

Data Types in Machine Learning

Understanding data types is crucial in ML because different types require different processing techniques. ML algorithms expect data in specific formats, and knowing how to work with each type is essential for building effective models.

In this lesson, we'll explore the main data types you'll encounter in ML and how to work with them in Python.

Numerical Data Types

Numerical data can be continuous (any value) or discrete (whole numbers). Let's see examples:

numerical_types.py
# Numerical data types in ML

# Continuous numerical data (float)
# Can take any value in a range
continuous_data = [175.5, 68.2, 22.3, 99.99, 15.7]
print("Continuous (Float) Data:", continuous_data)
print("Examples: height, weight, temperature, price")
print("Type:", type(continuous_data[0]))

# Discrete numerical data (int)
# Whole numbers only
discrete_data = [1, 2, 3, 4, 5, 10, 20]
print("\nDiscrete (Integer) Data:", discrete_data)
print("Examples: count, age, number of items")
print("Type:", type(discrete_data[0]))

# Working with numerical data
print("\nNumerical Operations:")
print("  Mean:", sum(continuous_data) / len(continuous_data))
print("  Max:", max(continuous_data))
print("  Min:", min(continuous_data))
print("  Range:", max(continuous_data) - min(continuous_data))

Categorical Data Types

Categorical data represents categories or groups. It can be nominal (no order) or ordinal (has order):

categorical_types.py
# Categorical data types

# Nominal categorical (no inherent order)
colors = ["red", "blue", "green", "red", "blue"]
print("Nominal Categorical:", colors)
print("Examples: color, city, product category")
print("No meaningful order between categories")

# Ordinal categorical (has meaningful order)
sizes = ["small", "medium", "large", "small", "large"]
print("\nOrdinal Categorical:", sizes)
print("Examples: size (S, M, L), rating (1-5), education level")
print("Has meaningful order: small < medium < large")

# Encoding categorical data for ML
# ML algorithms need numbers, not text
print("\nEncoding for ML:")
print("  Colors need to be converted to numbers:")
color_map = {"red": 0, "blue": 1, "green": 2}
encoded_colors = [color_map[c] for c in colors]
print("  Original:", colors)
print("  Encoded:", encoded_colors)

Binary and Boolean Data

Binary data has only two possible values. It's common in classification problems:

binary_types.py
# Binary and boolean data types

# Boolean data (True/False)
boolean_data = [True, False, True, True, False]
print("Boolean Data:", boolean_data)
print("Examples: has_feature, is_active, passed_test")

# Binary data (0/1)
binary_data = [1, 0, 1, 1, 0]
print("\nBinary Data (0/1):", binary_data)
print("Examples: yes/no, on/off, success/failure")

# Converting boolean to binary
binary_from_bool = [1 if b else 0 for b in boolean_data]
print("\nBoolean to Binary:")
print("  Original:", boolean_data)
print("  Converted:", binary_from_bool)

print("\nUse Cases:")
print("  - Binary classification (spam/not spam)")
print("  - Feature flags (has_feature: yes/no)")
print("  - Decision outcomes (approved/rejected)")

Text Data

Text data requires special processing because ML algorithms can't directly work with words:

text_types.py
# Text data types in ML

# Raw text data
text_data = [
    "I love machine learning",
    "Python is great for data science",
    "ML models need good data"
]
print("Text Data:")
for i, text in enumerate(text_data, 1):
    print(f"  {i}. {text}")

print("\nText Processing for ML:")
print("  1. Tokenization: Split into words")
tokens = [text.split() for text in text_data]
print("     Tokens:", tokens[0])

print("\n  2. Vocabulary: Unique words")
vocabulary = set()
for text in text_data:
    vocabulary.update(text.lower().split())
print("     Vocabulary size:", len(vocabulary))
print("     Words:", sorted(vocabulary))

print("\n  3. Encoding: Convert to numbers")
print("     Each word gets a unique number")
print("     Example: 'machine' → 1, 'learning' → 2, etc.")

print("\nUse Cases:")
print("  - Sentiment analysis")
print("  - Spam detection")
print("  - Language translation")

Time Series Data

Time series data has a temporal component - values change over time:

time_series_types.py
# Time series data types

# Time series: values over time
time_points = ["2024-01", "2024-02", "2024-03", "2024-04", "2024-05"]
sales = [100, 120, 115, 140, 130]

print("Time Series Data:")
print("  Time | Sales")
print("  " + "-" * 20)
for time, value in zip(time_points, sales):
    print(f"  {time} | {value}")

print("\nTime Series Characteristics:")
print("  - Ordered by time")
print("  - Values depend on previous values")
print("  - May have trends and patterns")

print("\nUse Cases:")
print("  - Stock price prediction")
print("  - Weather forecasting")
print("  - Sales forecasting")
print("  - Energy consumption prediction")

Why Data Types Matter

Understanding data types is crucial because:

Algorithm Selection: Different algorithms work better with different data types
Preprocessing: Each type requires specific preprocessing steps
Feature Engineering: Knowing the type helps create better features
Model Performance: Properly handled data types lead to better results

Always check your data types before building models!

Exercise: Work with Different Data Types

Complete the exercise on the right side:

Task 1: Convert categorical data (colors) to numerical using label encoding
Task 2: Convert boolean data to binary (0/1) format
Task 3: Create age groups by binning continuous age data
Task 4: Print the encoded/binned results

Write your code to convert and transform the different data types!

Data Types in ML