Python Machine Learning — Getting Started
Machine learning lets computers learn from data and make predictions without being explicitly programmed. This tutorial covers the fundamentals with scikit-learn.
Learning Objectives
- Understand the ML workflow
- Build regression and classification models
- Evaluate model performance with proper metrics
- Avoid common ML pitfalls
What is Machine Learning?
Traditional Programming:
Input + Rules ──────► Output
(data + if/else) (result)
Machine Learning:
Input + Output ──────► Rules
(data + labels) (learned model)
Example: Instead of writing rules to identify spam emails:
- Traditional:
if "free money" in email: spam = True - ML: Show the model 10,000 emails labeled "spam" or "not spam", and it learns the patterns itself.
The ML Workflow
1. Collect Data
│
▼
2. Clean & Prepare Data
│
▼
3. Split into Train/Test Sets
│
▼
4. Choose a Model
│
▼
5. Train the Model
│
▼
6. Evaluate Performance
│
▼
7. Tune & Improve
│
▼
8. Deploy
Regression: Predicting Numbers
Regression predicts continuous values (price, temperature, salary).
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Sample data: house size vs price
# X = features (size in sq ft), y = target (price)
X = np.array([[800], [1000], [1200], [1400], [1600],
[1800], [2000], [2200], [2400], [2600]])
y = np.array([150000, 180000, 210000, 250000, 280000,
310000, 350000, 380000, 420000, 450000])
# Step 1: Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Step 2: Create and train model
model = LinearRegression()
model.fit(X_train, y_train)
# Step 3: Make predictions
y_pred = model.predict(X_test)
# Step 4: Evaluate
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.0f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.0f}")
# Step 5: Predict new values
new_house = np.array([[1500]])
predicted_price = model.predict(new_house)
print(f"Predicted price for 1500 sqft: ${predicted_price[0]:,.0f}")
Understanding the Metrics
| Metric | What it Measures | Good Value |
|---|---|---|
| R² Score | How well the model explains variance | Close to 1.0 |
| MSE | Average squared error | Close to 0 |
| RMSE | Average error in original units | Close to 0 |
Classification: Predicting Categories
Classification predicts discrete labels (spam/not spam, cat/dog/bird).
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
# Load iris dataset (150 flowers, 4 measurements, 3 species)
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print(f"Features: {feature_names}")
print(f"Classes: {target_names}")
print(f"Samples: {len(X)}")
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features (important for many algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)
# Predict
y_pred = clf.predict(X_test_scaled)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))
# Feature importance (which measurements matter most?)
importances = clf.feature_importances_
for name, importance in zip(feature_names, importances):
print(f" {name}: {importance:.3f}")
Common ML Algorithms
| Algorithm | Type | Best For | Complexity |
|---|---|---|---|
| Linear Regression | Regression | Simple relationships | Low |
| Logistic Regression | Classification | Binary outcomes | Low |
| Decision Trees | Both | Interpretable models | Medium |
| Random Forest | Both | General purpose | Medium |
| SVM | Both | High-dimensional data | High |
| K-Nearest Neighbors | Both | Small datasets | Low |
| K-Means | Clustering | Grouping data | Medium |
Avoiding Common Pitfalls
1. Data Leakage
# BAD: Using test data for training decisions
scaler.fit(X) # Fits on ALL data including test
X_train, X_test = train_test_split(X)
# The scaler has already seen test data!
# GOOD: Split first, then fit
X_train, X_test = train_test_split(X)
scaler.fit(X_train) # Only fits on training data
X_test_scaled = scaler.transform(X_test) # Transform test data
2. Overfitting
# Overfitting: model memorizes training data but fails on new data
from sklearn.model_selection import cross_val_score
# Use cross-validation to detect overfitting
scores = cross_val_score(clf, X_train, y_train, cv=5)
print(f"CV Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")
# If training accuracy >> CV accuracy, model is overfitting
3. Not Enough Data
# Always check your dataset size
print(f"Samples: {len(X)}")
print(f"Features: {X.shape[1]}")
print(f"Samples per class: {np.bincount(y)}")
Key Takeaways
- Always split data into train/test BEFORE any preprocessing
- Scale features for algorithms sensitive to magnitude (SVM, KNN)
- Use cross-validation for robust evaluation
- Start simple, add complexity only when needed
- Check for class imbalance in classification problems
- Feature engineering often matters more than algorithm choice
- R² close to 1.0 means good regression fit
- Accuracy alone is misleading for imbalanced datasets