Python Machine Learning — Getting Started

Python Data ScienceMachine LearningFree Lesson

Advertisement

Python Machine Learning — Getting Started

Machine learning lets computers learn from data and make predictions without being explicitly programmed. This tutorial covers the fundamentals with scikit-learn.

Learning Objectives

  • Understand the ML workflow
  • Build regression and classification models
  • Evaluate model performance with proper metrics
  • Avoid common ML pitfalls

What is Machine Learning?

Traditional Programming:
  Input + Rules ──────► Output
  (data + if/else)     (result)

Machine Learning:
  Input + Output ──────► Rules
  (data + labels)       (learned model)

Example: Instead of writing rules to identify spam emails:

  • Traditional: if "free money" in email: spam = True
  • ML: Show the model 10,000 emails labeled "spam" or "not spam", and it learns the patterns itself.

The ML Workflow

1. Collect Data
       │
       ▼
2. Clean & Prepare Data
       │
       ▼
3. Split into Train/Test Sets
       │
       ▼
4. Choose a Model
       │
       ▼
5. Train the Model
       │
       ▼
6. Evaluate Performance
       │
       ▼
7. Tune & Improve
       │
       ▼
8. Deploy

Regression: Predicting Numbers

Regression predicts continuous values (price, temperature, salary).

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Sample data: house size vs price
# X = features (size in sq ft), y = target (price)
X = np.array([[800], [1000], [1200], [1400], [1600],
              [1800], [2000], [2200], [2400], [2600]])
y = np.array([150000, 180000, 210000, 250000, 280000,
              310000, 350000, 380000, 420000, 450000])

# Step 1: Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 2: Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 3: Make predictions
y_pred = model.predict(X_test)

# Step 4: Evaluate
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"MSE: {mean_squared_error(y_test, y_pred):.0f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.0f}")

# Step 5: Predict new values
new_house = np.array([[1500]])
predicted_price = model.predict(new_house)
print(f"Predicted price for 1500 sqft: ${predicted_price[0]:,.0f}")

Understanding the Metrics

MetricWhat it MeasuresGood Value
R² ScoreHow well the model explains varianceClose to 1.0
MSEAverage squared errorClose to 0
RMSEAverage error in original unitsClose to 0

Classification: Predicting Categories

Classification predicts discrete labels (spam/not spam, cat/dog/bird).

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load iris dataset (150 flowers, 4 measurements, 3 species)
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names

print(f"Features: {feature_names}")
print(f"Classes: {target_names}")
print(f"Samples: {len(X)}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features (important for many algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)

# Predict
y_pred = clf.predict(X_test_scaled)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

# Feature importance (which measurements matter most?)
importances = clf.feature_importances_
for name, importance in zip(feature_names, importances):
    print(f"  {name}: {importance:.3f}")

Common ML Algorithms

AlgorithmTypeBest ForComplexity
Linear RegressionRegressionSimple relationshipsLow
Logistic RegressionClassificationBinary outcomesLow
Decision TreesBothInterpretable modelsMedium
Random ForestBothGeneral purposeMedium
SVMBothHigh-dimensional dataHigh
K-Nearest NeighborsBothSmall datasetsLow
K-MeansClusteringGrouping dataMedium

Avoiding Common Pitfalls

1. Data Leakage

# BAD: Using test data for training decisions
scaler.fit(X)  # Fits on ALL data including test
X_train, X_test = train_test_split(X)
# The scaler has already seen test data!

# GOOD: Split first, then fit
X_train, X_test = train_test_split(X)
scaler.fit(X_train)  # Only fits on training data
X_test_scaled = scaler.transform(X_test)  # Transform test data

2. Overfitting

# Overfitting: model memorizes training data but fails on new data
from sklearn.model_selection import cross_val_score

# Use cross-validation to detect overfitting
scores = cross_val_score(clf, X_train, y_train, cv=5)
print(f"CV Accuracy: {scores.mean():.2%} (+/- {scores.std():.2%})")

# If training accuracy >> CV accuracy, model is overfitting

3. Not Enough Data

# Always check your dataset size
print(f"Samples: {len(X)}")
print(f"Features: {X.shape[1]}")
print(f"Samples per class: {np.bincount(y)}")

Key Takeaways

  1. Always split data into train/test BEFORE any preprocessing
  2. Scale features for algorithms sensitive to magnitude (SVM, KNN)
  3. Use cross-validation for robust evaluation
  4. Start simple, add complexity only when needed
  5. Check for class imbalance in classification problems
  6. Feature engineering often matters more than algorithm choice
  7. R² close to 1.0 means good regression fit
  8. Accuracy alone is misleading for imbalanced datasets

Advertisement

Need Expert Python Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement