Data Types in Machine Learning
Understanding data types is crucial in ML because different types require different processing techniques. ML algorithms expect data in specific formats, and knowing how to work with each type is essential for building effective models.
In this lesson, we'll explore the main data types you'll encounter in ML and how to work with them in Python.
Numerical Data Types
Numerical data can be continuous (any value) or discrete (whole numbers). Let's see examples:
continuous_data = [175.5, 68.2, 22.3, 99.99, 15.7]
print("Continuous (Float) Data:", continuous_data)
print("Examples: height, weight, temperature, price")
print("Type:", type(continuous_data[0]))
discrete_data = [1, 2, 3, 4, 5, 10, 20]
print("\nDiscrete (Integer) Data:", discrete_data)
print("Examples: count, age, number of items")
print("Type:", type(discrete_data[0]))
print("\nNumerical Operations:")
print(" Mean:", sum(continuous_data) / len(continuous_data))
print(" Max:", max(continuous_data))
print(" Min:", min(continuous_data))
print(" Range:", max(continuous_data) - min(continuous_data))
Categorical Data Types
Categorical data represents categories or groups. It can be nominal (no order) or ordinal (has order):
colors = ["red", "blue", "green", "red", "blue"]
print("Nominal Categorical:", colors)
print("Examples: color, city, product category")
print("No meaningful order between categories")
sizes = ["small", "medium", "large", "small", "large"]
print("\nOrdinal Categorical:", sizes)
print("Examples: size (S, M, L), rating (1-5), education level")
print("Has meaningful order: small < medium < large")
print("\nEncoding for ML:")
print(" Colors need to be converted to numbers:")
color_map = {"red": 0, "blue": 1, "green": 2}
encoded_colors = [color_map[c] for c in colors]
print(" Original:", colors)
print(" Encoded:", encoded_colors)
Binary and Boolean Data
Binary data has only two possible values. It's common in classification problems:
boolean_data = [True, False, True, True, False]
print("Boolean Data:", boolean_data)
print("Examples: has_feature, is_active, passed_test")
binary_data = [1, 0, 1, 1, 0]
print("\nBinary Data (0/1):", binary_data)
print("Examples: yes/no, on/off, success/failure")
binary_from_bool = [1 if b else 0 for b in boolean_data]
print("\nBoolean to Binary:")
print(" Original:", boolean_data)
print(" Converted:", binary_from_bool)
print("\nUse Cases:")
print(" - Binary classification (spam/not spam)")
print(" - Feature flags (has_feature: yes/no)")
print(" - Decision outcomes (approved/rejected)")
Text Data
Text data requires special processing because ML algorithms can't directly work with words:
text_data = [
"I love machine learning",
"Python is great for data science",
"ML models need good data"
]
print("Text Data:")
for i, text in enumerate(text_data, 1):
print(f" {i}. {text}")
print("\nText Processing for ML:")
print(" 1. Tokenization: Split into words")
tokens = [text.split() for text in text_data]
print(" Tokens:", tokens[0])
print("\n 2. Vocabulary: Unique words")
vocabulary = set()
for text in text_data:
vocabulary.update(text.lower().split())
print(" Vocabulary size:", len(vocabulary))
print(" Words:", sorted(vocabulary))
print("\n 3. Encoding: Convert to numbers")
print(" Each word gets a unique number")
print(" Example: 'machine' → 1, 'learning' → 2, etc.")
print("\nUse Cases:")
print(" - Sentiment analysis")
print(" - Spam detection")
print(" - Language translation")
Time Series Data
Time series data has a temporal component - values change over time:
time_points = ["2024-01", "2024-02", "2024-03", "2024-04", "2024-05"]
sales = [100, 120, 115, 140, 130]
print("Time Series Data:")
print(" Time | Sales")
print(" " + "-" * 20)
for time, value in zip(time_points, sales):
print(f" {time} | {value}")
print("\nTime Series Characteristics:")
print(" - Ordered by time")
print(" - Values depend on previous values")
print(" - May have trends and patterns")
print("\nUse Cases:")
print(" - Stock price prediction")
print(" - Weather forecasting")
print(" - Sales forecasting")
print(" - Energy consumption prediction")
Why Data Types Matter
Understanding data types is crucial because:
- Algorithm Selection: Different algorithms work better with different data types
- Preprocessing: Each type requires specific preprocessing steps
- Feature Engineering: Knowing the type helps create better features
- Model Performance: Properly handled data types lead to better results
Always check your data types before building models!
Exercise: Work with Different Data Types
Complete the exercise on the right side:
- Task 1: Convert categorical data (colors) to numerical using label encoding
- Task 2: Convert boolean data to binary (0/1) format
- Task 3: Create age groups by binning continuous age data
- Task 4: Print the encoded/binned results
Write your code to convert and transform the different data types!
💡 Learning Tip
Practice is essential. Try modifying the code examples, experiment with different parameters, and see how changes affect the results. Hands-on experience is the best teacher!
🎉
Lesson Complete!
Great work! Continue to the next lesson.