Why Python for Data Science?
Python has become the undisputed language of data science and machine learning. Over 80% of data scientists use Python as their primary tool. Understanding why helps you appreciate what you're learning.
Python vs Other Languages
| Language | Data Science Use | Pros | Cons |
|---|---|---|---|
| Python | Dominant (80%+) | Readable, vast libraries, fast prototyping | Slower than compiled languages |
| R | Statistics (15%) | Built for stats, great viz | Less general-purpose |
| SQL | Data querying (universal) | Declarative, database native | Not for ML/modeling |
| Julia | Scientific computing | Fast, mathematical | Smaller ecosystem |
| Scala/Java | Big data (Spark) | Fast, JVM ecosystem | Verbose, steep learning |
The Python Data Science Stack
Your Data Science Workflow
├── Data Storage → SQL, CSV, Parquet, HDF5
├── Data Loading → pandas, SQLAlchemy, pyarrow
├── Numerical Computing → NumPy, SciPy
├── Data Manipulation → pandas, polars
├── Visualization → matplotlib, seaborn, plotly
├── Statistics → scipy.stats, statsmodels, pingouin
├── Machine Learning → scikit-learn, XGBoost, LightGBM
├── Deep Learning → PyTorch, TensorFlow/Keras
├── NLP → transformers, spaCy, NLTK
└── Deployment → FastAPI, Docker, MLflow
Setting Up Your Environment
Option 1: Anaconda (Recommended for Beginners)
# 1. Download Anaconda from https://anaconda.com
# 2. Install following the installer prompts
# 3. Open Anaconda Navigator or use the terminal
# Create a dedicated data science environment
conda create -n dsenv python=3.11 numpy pandas matplotlib seaborn scikit-learn jupyter
conda activate dsenv
# Launch Jupyter Lab
jupyter lab
Option 2: pip + venv (Lightweight)
# Create virtual environment
python -m venv venv
# Activate (Linux/Mac)
source venv/bin/activate
# Activate (Windows)
venv\Scripts\activate
# Install core packages
pip install numpy pandas matplotlib seaborn scikit-learn scipy statsmodels jupyter jupyterlab
# Save dependencies
pip freeze > requirements.txt
# Install from requirements
pip install -r requirements.txt
Verify Your Setup
# Run this in a Jupyter notebook or Python script to verify everything works
import sys
print(f"Python version: {sys.version}")
import numpy as np
print(f"NumPy: {np.__version__}")
import pandas as pd
print(f"Pandas: {pd.__version__}")
import matplotlib
print(f"Matplotlib: {matplotlib.__version__}")
import seaborn as sns
print(f"Seaborn: {sns.__version__}")
import sklearn
print(f"Scikit-learn: {sklearn.__version__}")
import scipy
print(f"SciPy: {scipy.__version__}")
print("\n✅ All packages loaded successfully!")
Python Basics — Everything in One Place
Variables and Data Types
# Python is DYNAMICALLY TYPED — no type declarations needed
# The type is determined by the value assigned
# --- Numeric Types ---
integer_val = 42 # int: whole numbers, arbitrary precision
float_val = 3.14159 # float: 64-bit IEEE 754 decimal
complex_val = 3 + 4j # complex: real + imaginary
# Integer tricks
big_number = 1_000_000 # underscores for readability = 1000000
binary = 0b1010 # binary literal = 10
octal = 0o17 # octal literal = 15
hexadecimal = 0xFF # hex literal = 255
# Float gotcha — CRITICAL for data science
print(0.1 + 0.2) # 0.30000000000000004 ← NOT 0.3 !
print(0.1 + 0.2 == 0.3) # False ← dangerous for comparisons
print(abs(0.1 + 0.2 - 0.3) < 1e-10) # True ← correct way to compare
# Use decimal for financial calculations
from decimal import Decimal
price = Decimal("0.10") + Decimal("0.20")
print(price) # 0.30 ← exact
# --- String ---
name = "Alice" # double or single quotes, both fine
path = r"C:\Users\alice\data" # raw string: backslash is literal
multiline = """Line 1
Line 2
Line 3"""
# f-strings (Python 3.6+) — the BEST way to format
score = 92.456789
print(f"Score: {score:.2f}%") # Score: 92.46%
print(f"Score: {score:>10.2f}") # right-aligned, width 10
print(f"{'LEFT':<10}|{'RIGHT':>10}") # alignment
print(f"{1_000_000:,}") # 1,000,000 (thousands separator)
# String methods (immutable — all return new strings)
s = " Hello, Data Science! "
print(s.strip()) # "Hello, Data Science!" (remove whitespace)
print(s.lower()) # " hello, data science! "
print(s.upper()) # " HELLO, DATA SCIENCE! "
print(s.replace("Data", "Python"))
print("Data" in s) # True (membership test)
print(s.split(",")) # [' Hello', ' Data Science! ']
print(",".join(["a","b","c"])) # "a,b,c"
# String slicing [start:stop:step]
text = "Python"
print(text[0]) # P
print(text[-1]) # n
print(text[1:4]) # yth
print(text[::-1]) # nohtyP (reversed)
print(text[::2]) # Pto (every 2nd)
# --- Boolean ---
is_valid = True
print(True and False) # False
print(True or False) # True
print(not True) # False
# Comparison operators
print(5 > 3) # True
print(5 == 5) # True
print(5 != 3) # True
print(1 < 2 < 3) # True (chained comparison!)
print(0 <= score <= 100) # True
# Truthy / Falsy — IMPORTANT in Python
# Falsy: False, 0, 0.0, 0j, "", [], {}, (), set(), None
# Truthy: everything else
print(bool(0)) # False
print(bool("")) # False
print(bool([])) # False
print(bool(None)) # False
print(bool(42)) # True
print(bool("hi")) # True
# --- None ---
result = None
print(result is None) # True (use 'is', not '==')
print(result == None) # True but PEP8 says use 'is'
Control Flow
# if / elif / else
temperature = 28
if temperature > 35:
status = "hot"
elif temperature > 25:
status = "warm"
elif temperature > 15:
status = "mild"
else:
status = "cold"
print(f"It's {status} at {temperature}°C")
# Ternary operator (one-line if/else)
grade = "Pass" if score >= 50 else "Fail"
# match statement (Python 3.10+)
day = "Monday"
match day:
case "Saturday" | "Sunday":
print("Weekend!")
case "Monday":
print("Start of the work week")
case _:
print(f"Weekday: {day}")
# ─── for loops ─────────────────────────────────────────────
# Range — the most common loop pattern
for i in range(5): # 0, 1, 2, 3, 4
print(i, end=" ")
for i in range(1, 11): # 1 to 10
print(i, end=" ")
for i in range(0, 20, 3): # 0, 3, 6, ..., 18
print(i, end=" ")
# Iterate over collections
fruits = ["apple", "banana", "cherry", "date"]
for fruit in fruits:
print(fruit)
# enumerate — get index AND value
for i, fruit in enumerate(fruits):
print(f"{i}: {fruit}")
for i, fruit in enumerate(fruits, start=1): # start from 1
print(f"{i}. {fruit}")
# zip — parallel iteration
names = ["Alice", "Bob", "Carol"]
scores = [92, 87, 95]
grades = ["A", "B", "A"]
for name, score, grade in zip(names, scores, grades):
print(f"{name}: {score} ({grade})")
# zip with dict
data_dict = dict(zip(names, scores)) # {'Alice':92,'Bob':87,'Carol':95}
# break, continue, else
for n in range(1, 20):
if n % 2 == 0:
continue # skip even numbers
if n > 13:
break # stop at 13
print(n, end=" ") # 1 3 5 7 9 11 13
# for...else — else runs ONLY if loop completed without break
target = 99
for n in range(1, 10):
if n == target:
print(f"Found {target}")
break
else:
print(f"{target} was not found in range 1-9")
# while loop
count = 0
total = 0
while count < 10:
total += count
count += 1
print(f"Sum 0..9 = {total}") # 45
# ─── Comprehensions — Python's superpower ──────────────────
# List comprehension
squares = [x**2 for x in range(10)]
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
evens = [x for x in range(20) if x % 2 == 0]
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
# Nested comprehension
matrix = [[i*j for j in range(1, 4)] for i in range(1, 4)]
# [[1,2,3], [2,4,6], [3,6,9]]
# Dict comprehension
word_lengths = {word: len(word) for word in ["python", "data", "science"]}
# {'python': 6, 'data': 4, 'science': 7}
squares_dict = {x: x**2 for x in range(1, 6) if x % 2 != 0}
# {1:1, 3:9, 5:25}
# Set comprehension
unique_lengths = {len(w) for w in ["hello", "world", "python", "hi"]}
# {2, 5, 6}
# Generator expression — lazy evaluation (memory efficient!)
total = sum(x**2 for x in range(1_000_000)) # never stores list in memory
# Compare: list vs generator
import sys
list_comp = [x**2 for x in range(100_000)]
gen_exp = (x**2 for x in range(100_000))
print(f"List size: {sys.getsizeof(list_comp):,} bytes")
print(f"Generator size: {sys.getsizeof(gen_exp):,} bytes")
# Generator size: 104 bytes (vs 800KB for list!)
Functions
# Basic function with type hints and docstring
def calculate_bmi(weight_kg: float, height_m: float) -> float:
"""
Calculate Body Mass Index (BMI).
Args:
weight_kg: Weight in kilograms
height_m: Height in meters
Returns:
BMI value (float)
Raises:
ValueError: If weight or height are not positive
Example:
>>> calculate_bmi(70, 1.75)
22.857142857142858
"""
if weight_kg <= 0 or height_m <= 0:
raise ValueError("Weight and height must be positive numbers")
return weight_kg / (height_m ** 2)
bmi = calculate_bmi(70, 1.75)
print(f"BMI: {bmi:.1f}")
# Default arguments — evaluated ONCE at function definition
def greet(name: str, greeting: str = "Hello") -> str:
return f"{greeting}, {name}!"
print(greet("Alice")) # Hello, Alice!
print(greet("Bob", greeting="Hi")) # Hi, Bob!
# NEVER use mutable defaults — common Python gotcha!
def wrong_append(item, lst=[]): # BAD: lst is shared across calls!
lst.append(item)
return lst
def correct_append(item, lst=None): # GOOD
if lst is None:
lst = []
lst.append(item)
return lst
# *args — variable positional arguments
def sum_all(*numbers: float) -> float:
"""Accept any number of positional arguments."""
return sum(numbers)
print(sum_all(1, 2, 3)) # 6
print(sum_all(1, 2, 3, 4, 5, 6)) # 21
# **kwargs — variable keyword arguments
def create_profile(**kwargs) -> dict:
"""Build a user profile from keyword arguments."""
defaults = {"role": "user", "active": True}
defaults.update(kwargs)
return defaults
profile = create_profile(name="Alice", age=28, role="admin")
print(profile)
# Combining *args and **kwargs
def log_event(event_type: str, *args, **kwargs):
print(f"EVENT: {event_type}")
print(f" positional: {args}")
print(f" keyword: {kwargs}")
log_event("LOGIN", "user123", timestamp="2024-01-15", ip="192.168.1.1")
# Unpacking when calling
coords = (3, 4)
def distance(x, y):
return (x**2 + y**2)**0.5
print(distance(*coords)) # unpack tuple into args
config = {"weight_kg": 70, "height_m": 1.75}
print(calculate_bmi(**config)) # unpack dict into kwargs
# Lambda — anonymous one-liner functions
square = lambda x: x**2
add = lambda x, y: x + y
clamp = lambda x, lo, hi: max(lo, min(x, hi))
# Lambdas shine with higher-order functions
data = [("Alice", 92), ("Bob", 78), ("Carol", 95), ("Dave", 88)]
sorted_by_score = sorted(data, key=lambda x: x[1], reverse=True)
print(sorted_by_score) # [('Carol',95),('Alice',92),('Dave',88),('Bob',78)]
# map, filter, reduce
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
squared = list(map(lambda x: x**2, numbers))
evens = list(filter(lambda x: x % 2 == 0, numbers))
from functools import reduce
total = reduce(lambda acc, x: acc + x, numbers) # 55
# Closures — functions that remember their enclosing scope
def make_adder(n: int):
"""Return a function that adds n to its argument."""
def adder(x: int) -> int:
return x + n # n is captured from outer scope
return adder
add5 = make_adder(5)
add10 = make_adder(10)
print(add5(3)) # 8
print(add10(3)) # 13
# Decorators — wrap functions with additional behavior
import functools
import time
def timer(func):
"""Decorator that prints execution time."""
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
elapsed = time.perf_counter() - start
print(f"⏱ {func.__name__} completed in {elapsed:.4f}s")
return result
return wrapper
@timer
def compute_sum(n):
return sum(range(n))
compute_sum(10_000_000)
# Memoization — cache expensive function results
from functools import lru_cache
@lru_cache(maxsize=None)
def fibonacci(n: int) -> int:
"""Compute nth Fibonacci number with memoization."""
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
print(fibonacci(100)) # 354224848179261915075 (instant!)
print(f"Cache info: {fibonacci.cache_info()}")
Data Structures Deep Dive
# ── Lists ─────────────────────────────────────────────────
scores = [85, 92, 78, 95, 88, 73, 91]
# Modification
scores.append(96) # add to end
scores.insert(0, 70) # insert at index 0
scores.extend([80, 82]) # add multiple elements
scores.remove(73) # remove first occurrence of value
popped = scores.pop() # remove and return last (or specify index)
scores.sort(reverse=True) # sort in place
sorted_copy = sorted(scores) # new sorted list, original unchanged
scores.reverse() # reverse in place
# Slicing — [start:stop:step] (start inclusive, stop exclusive)
print(scores[1:4]) # elements at index 1, 2, 3
print(scores[:3]) # first 3
print(scores[-3:]) # last 3
print(scores[::2]) # every 2nd element
print(scores[::-1]) # reversed copy
# List operations
print(len(scores)) # length
print(sum(scores)) # sum
print(min(scores), max(scores))
print(scores.count(95)) # how many times 95 appears
print(scores.index(95)) # first position of 95
print(95 in scores) # membership: True/False
# Useful patterns
flat = [x for sublist in [[1,2],[3,4],[5,6]] for x in sublist] # flatten
unique = list(dict.fromkeys(scores)) # deduplicate preserving order
chunks = [scores[i:i+3] for i in range(0, len(scores), 3)] # chunking
# ── Tuples — immutable lists ───────────────────────────────
dimensions = (1920, 1080) # width, height
width, height = dimensions # unpacking
# Extended unpacking
first, *middle, last = [1, 2, 3, 4, 5]
# first=1, middle=[2,3,4], last=5
# Named tuples — tuples with named fields
from collections import namedtuple
Student = namedtuple('Student', ['name', 'gpa', 'year'])
alice = Student('Alice', 3.9, 'Junior')
print(alice.name) # Alice
print(alice[0]) # Alice (index still works)
print(alice._asdict()) # OrderedDict for serialization
# ── Dictionaries ──────────────────────────────────────────
student = {
"name": "Alice",
"gpa": 3.9,
"courses": ["Stats", "ML", "Python"],
"active": True
}
# Access
print(student["name"]) # Alice (KeyError if missing)
print(student.get("gpa")) # 3.9
print(student.get("age", "N/A")) # N/A (safe default)
# Modification
student["email"] = "alice@univ.edu" # add new key
student.update({"gpa": 4.0, "year": 3}) # update multiple
# Iteration
for key in student:
print(f"{key}: {student[key]}")
for key, value in student.items(): # items() gives (key, value) pairs
print(f" {key!r}: {value!r}")
# Dict comprehension
squared = {x: x**2 for x in range(1, 6)} # {1:1, 2:4, 3:9, 4:16, 5:25}
# defaultdict — no KeyError for missing keys
from collections import defaultdict
word_count = defaultdict(int)
text = "to be or not to be that is the question"
for word in text.split():
word_count[word] += 1
# Counter — specialized dict for frequency counting
from collections import Counter
freq = Counter(text.split())
print(freq.most_common(3)) # [('to',2),('be',2),('or',1)]
# Merge dicts (Python 3.9+)
defaults = {"theme": "dark", "lang": "en"}
user_prefs = {"lang": "fr", "timezone": "UTC"}
config = defaults | user_prefs # {'theme':'dark','lang':'fr','timezone':'UTC'}
# ── Sets ──────────────────────────────────────────────────
programming_languages = {"Python", "R", "Julia", "Scala"}
data_tools = {"Python", "SQL", "Excel", "Tableau"}
union = programming_languages | data_tools # all tools
intersection = programming_languages & data_tools # in both
difference = programming_languages - data_tools # only in first
sym_diff = programming_languages ^ data_tools # in one but not both
print(f"All tools: {union}")
print(f"In both: {intersection}") # {Python}
# Fast membership testing
large_set = set(range(1_000_000))
print(999_999 in large_set) # O(1) — instant
Error Handling — Production-Quality Code
import logging
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s %(levelname)s %(name)s: %(message)s'
)
logger = logging.getLogger(__name__)
# Basic try/except/else/finally
def safe_divide(a: float, b: float) -> float:
"""Divide a by b with error handling."""
try:
result = a / b
except ZeroDivisionError:
logger.warning(f"Attempted division by zero: {a} / {b}")
return float('inf')
except TypeError as e:
logger.error(f"Type error in division: {e}")
raise
else:
# runs ONLY if no exception was raised
logger.debug(f"Division successful: {a}/{b} = {result}")
return result
finally:
# ALWAYS runs — use for cleanup
pass
print(safe_divide(10, 2)) # 5.0
print(safe_divide(10, 0)) # inf (with warning)
# Custom exceptions
class DataValidationError(Exception):
"""Raised when data fails validation checks."""
def __init__(self, field: str, value, message: str):
self.field = field
self.value = value
super().__init__(f"Field '{field}': {message} (got {value!r})")
class MissingFieldError(DataValidationError):
"""Raised when a required field is missing."""
def __init__(self, field: str):
super().__init__(field, None, "Required field is missing")
def validate_dataset_record(record: dict) -> dict:
"""Validate a single data record."""
required_fields = ["id", "value", "timestamp"]
for field in required_fields:
if field not in record:
raise MissingFieldError(field)
if not isinstance(record["value"], (int, float)):
raise DataValidationError("value", record["value"], "Must be numeric")
if record["value"] < 0:
raise DataValidationError("value", record["value"], "Must be non-negative")
return record
# Usage
records = [
{"id": 1, "value": 42.5, "timestamp": "2024-01-15"},
{"id": 2, "value": -5.0, "timestamp": "2024-01-16"}, # will fail
{"id": 3, "timestamp": "2024-01-17"}, # missing value
]
valid_records = []
for record in records:
try:
validated = validate_dataset_record(record)
valid_records.append(validated)
except MissingFieldError as e:
print(f"⚠️ Missing field in record: {e}")
except DataValidationError as e:
print(f"❌ Validation failed: {e}")
print(f"\n{len(valid_records)} valid records out of {len(records)}")
Jupyter Notebook Productivity Tips
# Magic commands (Jupyter-specific)
%timeit sum(range(1000)) # time a single expression
%time expensive_function() # time a single run
%%timeit # time an entire cell
import numpy as np
np.sum(np.arange(1000))
%matplotlib inline # show plots in notebook
%load_ext autoreload # auto-reload modules
%autoreload 2
# Shell commands in Jupyter
!pip list | grep numpy
!ls -la data/
# Display utilities
from IPython.display import display, HTML, Markdown
display(HTML("<h2 style='color:blue'>Formatted Output</h2>"))
import pandas as pd
pd.set_option('display.max_columns', 50) # show more columns
pd.set_option('display.float_format', '{:.3f}'.format) # format floats
Practice Exercises
Exercise 1: Write a function normalize(data) that takes a list of numbers and returns a new list scaled to [0, 1].
Exercise 2: Use a dict comprehension to create a word-frequency map from a sentence.
Exercise 3: Write a decorator retry(n) that retries a function up to n times on exception.
# Exercise 1 solution
def normalize(data: list) -> list:
"""Min-max normalize a list of numbers to [0, 1]."""
min_val, max_val = min(data), max(data)
if max_val == min_val:
return [0.0] * len(data)
return [(x - min_val) / (max_val - min_val) for x in data]
print(normalize([10, 20, 30, 40, 50]))
# [0.0, 0.25, 0.5, 0.75, 1.0]
# Exercise 3 solution
import time
import functools
def retry(n: int, delay: float = 0.0):
"""Decorator that retries function up to n times on exception."""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
last_exception = None
for attempt in range(1, n + 1):
try:
return func(*args, **kwargs)
except Exception as e:
last_exception = e
print(f"Attempt {attempt}/{n} failed: {e}")
if attempt < n and delay > 0:
time.sleep(delay)
raise last_exception
return wrapper
return decorator
@retry(3, delay=0.1)
def unstable_api_call(url: str):
import random
if random.random() < 0.7: # 70% failure rate
raise ConnectionError(f"Failed to connect to {url}")
return f"Success: {url}"
Key Takeaways
- Python is dynamically typed — variables hold references, not values
- f-strings are the best way to format strings in Python 3.6+
- Floating-point arithmetic is not exact — never compare floats with
== - Comprehensions are Pythonic and fast — prefer them over explicit loops for transformations
- Mutable default arguments are a common bug — always use
Noneas default for lists/dicts - Decorators add behavior to functions without modifying them — use for logging, timing, caching
- Always write docstrings and type hints — your future self will thank you
- Custom exceptions make error handling self-documenting and precise
What's Next
→ NumPy — Fast Array Computing → Pandas — Data Manipulation → Data Visualization → Machine Learning with Scikit-Learn