Python Machine Learning — Getting Started

Machine learning lets computers learn from data and make predictions without being explicitly programmed. This tutorial covers the fundamentals with scikit-learn.

Learning Objectives

Understand the ML workflow
Build regression and classification models
Evaluate model performance with proper metrics
Avoid common ML pitfalls

What is Machine Learning?

Traditional Programming:
  Input + Rules ──────► Output
  (data + if/else)     (result)

Machine Learning:
  Input + Output ──────► Rules
  (data + labels)       (learned model)

Example: Instead of writing rules to identify spam emails:

Traditional: if "free money" in email: spam = True
ML: Show the model 10,000 emails labeled "spam" or "not spam", and it learns the patterns itself.

The ML Workflow

1. Collect Data
       │
       ▼
2. Clean & Prepare Data
       │
       ▼
3. Split into Train/Test Sets
       │
       ▼
4. Choose a Model
       │
       ▼
5. Train the Model
       │
       ▼
6. Evaluate Performance
       │
       ▼
7. Tune & Improve
       │
       ▼
8. Deploy

Regression: Predicting Numbers

Regression predicts continuous values (price, temperature, salary).

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Sample data: house size vs price
# X = features (size in sq ft), y = target (price)
X = np.array([[800], [1000], [1200], [1400], [1600],
              [1800], [2000], [2200], [2400], [2600]])
y = np.array([150000, 180000, 210000, 250000, 280000,
              310000, 350000, 380000, 420000, 450000])

# Step 1: Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 2: Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 3: Make predictions
y_pred = model.predict(X_test)

# Step 4: Evaluate
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.0f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.0f}")

# Step 5: Predict new values
new_house = np.array([[1500]])
predicted_price = model.predict(new_house)
print(f"Predicted price for 1500 sqft: ${predicted_price[0]:,.0f}")

Understanding the Metrics

Metric	What it Measures	Good Value
R² Score	How well the model explains variance	Close to 1.0
MSE	Average squared error	Close to 0
RMSE	Average error in original units	Close to 0

Classification: Predicting Categories

Classification predicts discrete labels (spam/not spam, cat/dog/bird).

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load iris dataset (150 flowers, 4 measurements, 3 species)
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names

print(f"Features: {feature_names}")
print(f"Classes: {target_names}")
print(f"Samples: {len(X)}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features (important for many algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)

# Predict
y_pred = clf.predict(X_test_scaled)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

# Feature importance (which measurements matter most?)
importances = clf.feature_importances_
for name, importance in zip(feature_names, importances):
    print(f"  {name}: {importance:.3f}")

Common ML Algorithms

Algorithm	Type	Best For	Complexity
Linear Regression	Regression	Simple relationships	Low
Logistic Regression	Classification	Binary outcomes	Low
Decision Trees	Both	Interpretable models	Medium
Random Forest	Both	General purpose	Medium
SVM	Both	High-dimensional data	High
K-Nearest Neighbors	Both	Small datasets	Low
K-Means	Clustering	Grouping data	Medium

Avoiding Common Pitfalls

1. Data Leakage

# BAD: Using test data for training decisions
scaler.fit(X)  # Fits on ALL data including test
X_train, X_test = train_test_split(X)
# The scaler has already seen test data!

# GOOD: Split first, then fit
X_train, X_test = train_test_split(X)
scaler.fit(X_train)  # Only fits on training data
X_test_scaled = scaler.transform(X_test)  # Transform test data

2. Overfitting

# Overfitting: model memorizes training data but fails on new data
from sklearn.model_selection import cross_val_score

# Use cross-validation to detect overfitting
scores = cross_val_score(clf, X_train, y_train, cv=5)
print(f"CV Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")

# If training accuracy >> CV accuracy, model is overfitting

3. Not Enough Data

# Always check your dataset size
print(f"Samples: {len(X)}")
print(f"Features: {X.shape[1]}")
print(f"Samples per class: {np.bincount(y)}")

Key Takeaways

Always split data into train/test BEFORE any preprocessing
Scale features for algorithms sensitive to magnitude (SVM, KNN)
Use cross-validation for robust evaluation
Start simple, add complexity only when needed
Check for class imbalance in classification problems
Feature engineering often matters more than algorithm choice
R² close to 1.0 means good regression fit
Accuracy alone is misleading for imbalanced datasets

Python Machine Learning — Getting Started

Python Machine Learning — Getting Started

Learning Objectives

What is Machine Learning?

The ML Workflow

Regression: Predicting Numbers

Understanding the Metrics

Classification: Predicting Categories

Common ML Algorithms

Avoiding Common Pitfalls

1. Data Leakage

2. Overfitting

3. Not Enough Data

Key Takeaways

Need Expert Python Help?