Python for Data Science — Complete Introduction

Why Python for Data Science?

Python has become the undisputed language of data science and machine learning. Over 80% of data scientists use Python as their primary tool. Understanding why helps you appreciate what you're learning.

Python vs Other Languages

Language	Data Science Use	Pros	Cons
Python	Dominant (80%+)	Readable, vast libraries, fast prototyping	Slower than compiled languages
R	Statistics (15%)	Built for stats, great viz	Less general-purpose
SQL	Data querying (universal)	Declarative, database native	Not for ML/modeling
Julia	Scientific computing	Fast, mathematical	Smaller ecosystem
Scala/Java	Big data (Spark)	Fast, JVM ecosystem	Verbose, steep learning

The Python Data Science Stack

Your Data Science Workflow
├── Data Storage        → SQL, CSV, Parquet, HDF5
├── Data Loading        → pandas, SQLAlchemy, pyarrow
├── Numerical Computing → NumPy, SciPy
├── Data Manipulation   → pandas, polars
├── Visualization       → matplotlib, seaborn, plotly
├── Statistics          → scipy.stats, statsmodels, pingouin
├── Machine Learning    → scikit-learn, XGBoost, LightGBM
├── Deep Learning       → PyTorch, TensorFlow/Keras
├── NLP                 → transformers, spaCy, NLTK
└── Deployment          → FastAPI, Docker, MLflow

Setting Up Your Environment

Option 1: Anaconda (Recommended for Beginners)

# 1. Download Anaconda from https://anaconda.com
# 2. Install following the installer prompts
# 3. Open Anaconda Navigator or use the terminal

# Create a dedicated data science environment
conda create -n dsenv python=3.11 numpy pandas matplotlib seaborn scikit-learn jupyter
conda activate dsenv

# Launch Jupyter Lab
jupyter lab

Option 2: pip + venv (Lightweight)

# Create virtual environment
python -m venv venv

# Activate (Linux/Mac)
source venv/bin/activate

# Activate (Windows)
venv\Scripts\activate

# Install core packages
pip install numpy pandas matplotlib seaborn scikit-learn scipy statsmodels jupyter jupyterlab

# Save dependencies
pip freeze > requirements.txt

# Install from requirements
pip install -r requirements.txt

Verify Your Setup

# Run this in a Jupyter notebook or Python script to verify everything works

import sys
print(f"Python version: {sys.version}")

import numpy as np
print(f"NumPy: {np.__version__}")

import pandas as pd
print(f"Pandas: {pd.__version__}")

import matplotlib
print(f"Matplotlib: {matplotlib.__version__}")

import seaborn as sns
print(f"Seaborn: {sns.__version__}")

import sklearn
print(f"Scikit-learn: {sklearn.__version__}")

import scipy
print(f"SciPy: {scipy.__version__}")

print("\n✅ All packages loaded successfully!")

Python Basics — Everything in One Place

Variables and Data Types

# Python is DYNAMICALLY TYPED — no type declarations needed
# The type is determined by the value assigned

# --- Numeric Types ---
integer_val = 42               # int: whole numbers, arbitrary precision
float_val   = 3.14159          # float: 64-bit IEEE 754 decimal
complex_val = 3 + 4j           # complex: real + imaginary

# Integer tricks
big_number = 1_000_000         # underscores for readability = 1000000
binary     = 0b1010            # binary literal = 10
octal      = 0o17              # octal literal  = 15
hexadecimal = 0xFF             # hex literal    = 255

# Float gotcha — CRITICAL for data science
print(0.1 + 0.2)               # 0.30000000000000004  ← NOT 0.3 !
print(0.1 + 0.2 == 0.3)        # False  ← dangerous for comparisons
print(abs(0.1 + 0.2 - 0.3) < 1e-10)  # True ← correct way to compare

# Use decimal for financial calculations
from decimal import Decimal
price = Decimal("0.10") + Decimal("0.20")
print(price)  # 0.30  ← exact

# --- String ---
name = "Alice"                  # double or single quotes, both fine
path = r"C:\Users\alice\data"   # raw string: backslash is literal
multiline = """Line 1
Line 2
Line 3"""

# f-strings (Python 3.6+) — the BEST way to format
score = 92.456789
print(f"Score: {score:.2f}%")        # Score: 92.46%
print(f"Score: {score:>10.2f}")      # right-aligned, width 10
print(f"{'LEFT':<10}|{'RIGHT':>10}") # alignment
print(f"{1_000_000:,}")              # 1,000,000 (thousands separator)

# String methods (immutable — all return new strings)
s = "  Hello, Data Science!  "
print(s.strip())           # "Hello, Data Science!"  (remove whitespace)
print(s.lower())           # "  hello, data science!  "
print(s.upper())           # "  HELLO, DATA SCIENCE!  "
print(s.replace("Data", "Python"))
print("Data" in s)         # True (membership test)
print(s.split(","))        # ['  Hello', ' Data Science!  ']
print(",".join(["a","b","c"]))  # "a,b,c"

# String slicing [start:stop:step]
text = "Python"
print(text[0])             # P
print(text[-1])            # n
print(text[1:4])           # yth
print(text[::-1])          # nohtyP  (reversed)
print(text[::2])           # Pto     (every 2nd)

# --- Boolean ---
is_valid = True
print(True and False)      # False
print(True or False)       # True
print(not True)            # False

# Comparison operators
print(5 > 3)               # True
print(5 == 5)              # True
print(5 != 3)              # True
print(1 < 2 < 3)           # True  (chained comparison!)
print(0 <= score <= 100)   # True

# Truthy / Falsy — IMPORTANT in Python
# Falsy: False, 0, 0.0, 0j, "", [], {}, (), set(), None
# Truthy: everything else
print(bool(0))    # False
print(bool(""))   # False
print(bool([]))   # False
print(bool(None)) # False
print(bool(42))   # True
print(bool("hi")) # True

# --- None ---
result = None
print(result is None)   # True  (use 'is', not '==')
print(result == None)   # True but PEP8 says use 'is'

Control Flow

# if / elif / else
temperature = 28

if temperature > 35:
    status = "hot"
elif temperature > 25:
    status = "warm"
elif temperature > 15:
    status = "mild"
else:
    status = "cold"

print(f"It's {status} at {temperature}°C")

# Ternary operator (one-line if/else)
grade = "Pass" if score >= 50 else "Fail"

# match statement (Python 3.10+)
day = "Monday"
match day:
    case "Saturday" | "Sunday":
        print("Weekend!")
    case "Monday":
        print("Start of the work week")
    case _:
        print(f"Weekday: {day}")

# ─── for loops ─────────────────────────────────────────────

# Range — the most common loop pattern
for i in range(5):        # 0, 1, 2, 3, 4
    print(i, end=" ")

for i in range(1, 11):    # 1 to 10
    print(i, end=" ")

for i in range(0, 20, 3): # 0, 3, 6, ..., 18
    print(i, end=" ")

# Iterate over collections
fruits = ["apple", "banana", "cherry", "date"]

for fruit in fruits:
    print(fruit)

# enumerate — get index AND value
for i, fruit in enumerate(fruits):
    print(f"{i}: {fruit}")

for i, fruit in enumerate(fruits, start=1):  # start from 1
    print(f"{i}. {fruit}")

# zip — parallel iteration
names  = ["Alice", "Bob", "Carol"]
scores = [92, 87, 95]
grades = ["A", "B", "A"]

for name, score, grade in zip(names, scores, grades):
    print(f"{name}: {score} ({grade})")

# zip with dict
data_dict = dict(zip(names, scores))  # {'Alice':92,'Bob':87,'Carol':95}

# break, continue, else
for n in range(1, 20):
    if n % 2 == 0:
        continue          # skip even numbers
    if n > 13:
        break             # stop at 13
    print(n, end=" ")     # 1 3 5 7 9 11 13

# for...else — else runs ONLY if loop completed without break
target = 99
for n in range(1, 10):
    if n == target:
        print(f"Found {target}")
        break
else:
    print(f"{target} was not found in range 1-9")

# while loop
count = 0
total = 0
while count < 10:
    total += count
    count += 1
print(f"Sum 0..9 = {total}")   # 45

# ─── Comprehensions — Python's superpower ──────────────────

# List comprehension
squares = [x**2 for x in range(10)]
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

evens = [x for x in range(20) if x % 2 == 0]
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

# Nested comprehension
matrix = [[i*j for j in range(1, 4)] for i in range(1, 4)]
# [[1,2,3], [2,4,6], [3,6,9]]

# Dict comprehension
word_lengths = {word: len(word) for word in ["python", "data", "science"]}
# {'python': 6, 'data': 4, 'science': 7}

squares_dict = {x: x**2 for x in range(1, 6) if x % 2 != 0}
# {1:1, 3:9, 5:25}

# Set comprehension
unique_lengths = {len(w) for w in ["hello", "world", "python", "hi"]}
# {2, 5, 6}

# Generator expression — lazy evaluation (memory efficient!)
total = sum(x**2 for x in range(1_000_000))  # never stores list in memory

# Compare: list vs generator
import sys
list_comp = [x**2 for x in range(100_000)]
gen_exp   = (x**2 for x in range(100_000))
print(f"List size:      {sys.getsizeof(list_comp):,} bytes")
print(f"Generator size: {sys.getsizeof(gen_exp):,} bytes")
# Generator size: 104 bytes (vs 800KB for list!)

Functions

# Basic function with type hints and docstring
def calculate_bmi(weight_kg: float, height_m: float) -> float:
    """
    Calculate Body Mass Index (BMI).

    Args:
        weight_kg: Weight in kilograms
        height_m: Height in meters

    Returns:
        BMI value (float)

    Raises:
        ValueError: If weight or height are not positive

    Example:
        >>> calculate_bmi(70, 1.75)
        22.857142857142858
    """
    if weight_kg <= 0 or height_m <= 0:
        raise ValueError("Weight and height must be positive numbers")
    return weight_kg / (height_m ** 2)

bmi = calculate_bmi(70, 1.75)
print(f"BMI: {bmi:.1f}")

# Default arguments — evaluated ONCE at function definition
def greet(name: str, greeting: str = "Hello") -> str:
    return f"{greeting}, {name}!"

print(greet("Alice"))                  # Hello, Alice!
print(greet("Bob", greeting="Hi"))     # Hi, Bob!

# NEVER use mutable defaults — common Python gotcha!
def wrong_append(item, lst=[]):   # BAD: lst is shared across calls!
    lst.append(item)
    return lst

def correct_append(item, lst=None):  # GOOD
    if lst is None:
        lst = []
    lst.append(item)
    return lst

# *args — variable positional arguments
def sum_all(*numbers: float) -> float:
    """Accept any number of positional arguments."""
    return sum(numbers)

print(sum_all(1, 2, 3))              # 6
print(sum_all(1, 2, 3, 4, 5, 6))    # 21

# **kwargs — variable keyword arguments
def create_profile(**kwargs) -> dict:
    """Build a user profile from keyword arguments."""
    defaults = {"role": "user", "active": True}
    defaults.update(kwargs)
    return defaults

profile = create_profile(name="Alice", age=28, role="admin")
print(profile)

# Combining *args and **kwargs
def log_event(event_type: str, *args, **kwargs):
    print(f"EVENT: {event_type}")
    print(f"  positional: {args}")
    print(f"  keyword:    {kwargs}")

log_event("LOGIN", "user123", timestamp="2024-01-15", ip="192.168.1.1")

# Unpacking when calling
coords = (3, 4)
def distance(x, y):
    return (x**2 + y**2)**0.5

print(distance(*coords))             # unpack tuple into args

config = {"weight_kg": 70, "height_m": 1.75}
print(calculate_bmi(**config))       # unpack dict into kwargs

# Lambda — anonymous one-liner functions
square  = lambda x: x**2
add     = lambda x, y: x + y
clamp   = lambda x, lo, hi: max(lo, min(x, hi))

# Lambdas shine with higher-order functions
data = [("Alice", 92), ("Bob", 78), ("Carol", 95), ("Dave", 88)]
sorted_by_score = sorted(data, key=lambda x: x[1], reverse=True)
print(sorted_by_score)   # [('Carol',95),('Alice',92),('Dave',88),('Bob',78)]

# map, filter, reduce
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
squared = list(map(lambda x: x**2, numbers))
evens   = list(filter(lambda x: x % 2 == 0, numbers))

from functools import reduce
total = reduce(lambda acc, x: acc + x, numbers)   # 55

# Closures — functions that remember their enclosing scope
def make_adder(n: int):
    """Return a function that adds n to its argument."""
    def adder(x: int) -> int:
        return x + n    # n is captured from outer scope
    return adder

add5  = make_adder(5)
add10 = make_adder(10)
print(add5(3))    # 8
print(add10(3))   # 13

# Decorators — wrap functions with additional behavior
import functools
import time

def timer(func):
    """Decorator that prints execution time."""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        elapsed = time.perf_counter() - start
        print(f"⏱  {func.__name__} completed in {elapsed:.4f}s")
        return result
    return wrapper

@timer
def compute_sum(n):
    return sum(range(n))

compute_sum(10_000_000)

# Memoization — cache expensive function results
from functools import lru_cache

@lru_cache(maxsize=None)
def fibonacci(n: int) -> int:
    """Compute nth Fibonacci number with memoization."""
    if n < 2:
        return n
    return fibonacci(n - 1) + fibonacci(n - 2)

print(fibonacci(100))   # 354224848179261915075  (instant!)
print(f"Cache info: {fibonacci.cache_info()}")

Data Structures Deep Dive

# ── Lists ─────────────────────────────────────────────────
scores = [85, 92, 78, 95, 88, 73, 91]

# Modification
scores.append(96)            # add to end
scores.insert(0, 70)         # insert at index 0
scores.extend([80, 82])      # add multiple elements
scores.remove(73)            # remove first occurrence of value
popped = scores.pop()        # remove and return last (or specify index)
scores.sort(reverse=True)    # sort in place
sorted_copy = sorted(scores) # new sorted list, original unchanged
scores.reverse()             # reverse in place

# Slicing — [start:stop:step]  (start inclusive, stop exclusive)
print(scores[1:4])           # elements at index 1, 2, 3
print(scores[:3])            # first 3
print(scores[-3:])           # last 3
print(scores[::2])           # every 2nd element
print(scores[::-1])          # reversed copy

# List operations
print(len(scores))           # length
print(sum(scores))           # sum
print(min(scores), max(scores))
print(scores.count(95))      # how many times 95 appears
print(scores.index(95))      # first position of 95
print(95 in scores)          # membership: True/False

# Useful patterns
flat = [x for sublist in [[1,2],[3,4],[5,6]] for x in sublist]  # flatten
unique = list(dict.fromkeys(scores))  # deduplicate preserving order
chunks = [scores[i:i+3] for i in range(0, len(scores), 3)]  # chunking

# ── Tuples — immutable lists ───────────────────────────────
dimensions = (1920, 1080)      # width, height
width, height = dimensions     # unpacking

# Extended unpacking
first, *middle, last = [1, 2, 3, 4, 5]
# first=1, middle=[2,3,4], last=5

# Named tuples — tuples with named fields
from collections import namedtuple
Student = namedtuple('Student', ['name', 'gpa', 'year'])
alice = Student('Alice', 3.9, 'Junior')
print(alice.name)      # Alice
print(alice[0])        # Alice (index still works)
print(alice._asdict()) # OrderedDict for serialization

# ── Dictionaries ──────────────────────────────────────────
student = {
    "name": "Alice",
    "gpa": 3.9,
    "courses": ["Stats", "ML", "Python"],
    "active": True
}

# Access
print(student["name"])                    # Alice (KeyError if missing)
print(student.get("gpa"))                 # 3.9
print(student.get("age", "N/A"))          # N/A (safe default)

# Modification
student["email"] = "alice@univ.edu"       # add new key
student.update({"gpa": 4.0, "year": 3})  # update multiple

# Iteration
for key in student:
    print(f"{key}: {student[key]}")

for key, value in student.items():        # items() gives (key, value) pairs
    print(f"  {key!r}: {value!r}")

# Dict comprehension
squared = {x: x**2 for x in range(1, 6)}  # {1:1, 2:4, 3:9, 4:16, 5:25}

# defaultdict — no KeyError for missing keys
from collections import defaultdict

word_count = defaultdict(int)
text = "to be or not to be that is the question"
for word in text.split():
    word_count[word] += 1

# Counter — specialized dict for frequency counting
from collections import Counter
freq = Counter(text.split())
print(freq.most_common(3))   # [('to',2),('be',2),('or',1)]

# Merge dicts (Python 3.9+)
defaults = {"theme": "dark", "lang": "en"}
user_prefs = {"lang": "fr", "timezone": "UTC"}
config = defaults | user_prefs   # {'theme':'dark','lang':'fr','timezone':'UTC'}

# ── Sets ──────────────────────────────────────────────────
programming_languages = {"Python", "R", "Julia", "Scala"}
data_tools = {"Python", "SQL", "Excel", "Tableau"}

union        = programming_languages | data_tools    # all tools
intersection = programming_languages & data_tools    # in both
difference   = programming_languages - data_tools    # only in first
sym_diff     = programming_languages ^ data_tools    # in one but not both

print(f"All tools: {union}")
print(f"In both:   {intersection}")   # {Python}

# Fast membership testing
large_set = set(range(1_000_000))
print(999_999 in large_set)  # O(1) — instant

Error Handling — Production-Quality Code

import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s %(levelname)s %(name)s: %(message)s'
)
logger = logging.getLogger(__name__)

# Basic try/except/else/finally
def safe_divide(a: float, b: float) -> float:
    """Divide a by b with error handling."""
    try:
        result = a / b
    except ZeroDivisionError:
        logger.warning(f"Attempted division by zero: {a} / {b}")
        return float('inf')
    except TypeError as e:
        logger.error(f"Type error in division: {e}")
        raise
    else:
        # runs ONLY if no exception was raised
        logger.debug(f"Division successful: {a}/{b} = {result}")
        return result
    finally:
        # ALWAYS runs — use for cleanup
        pass

print(safe_divide(10, 2))    # 5.0
print(safe_divide(10, 0))    # inf (with warning)

# Custom exceptions
class DataValidationError(Exception):
    """Raised when data fails validation checks."""
    def __init__(self, field: str, value, message: str):
        self.field = field
        self.value = value
        super().__init__(f"Field '{field}': {message} (got {value!r})")

class MissingFieldError(DataValidationError):
    """Raised when a required field is missing."""
    def __init__(self, field: str):
        super().__init__(field, None, "Required field is missing")

def validate_dataset_record(record: dict) -> dict:
    """Validate a single data record."""
    required_fields = ["id", "value", "timestamp"]

    for field in required_fields:
        if field not in record:
            raise MissingFieldError(field)

    if not isinstance(record["value"], (int, float)):
        raise DataValidationError("value", record["value"], "Must be numeric")

    if record["value"] < 0:
        raise DataValidationError("value", record["value"], "Must be non-negative")

    return record

# Usage
records = [
    {"id": 1, "value": 42.5, "timestamp": "2024-01-15"},
    {"id": 2, "value": -5.0, "timestamp": "2024-01-16"},   # will fail
    {"id": 3, "timestamp": "2024-01-17"},                   # missing value
]

valid_records = []
for record in records:
    try:
        validated = validate_dataset_record(record)
        valid_records.append(validated)
    except MissingFieldError as e:
        print(f"⚠️  Missing field in record: {e}")
    except DataValidationError as e:
        print(f"❌ Validation failed: {e}")

print(f"\n{len(valid_records)} valid records out of {len(records)}")

Jupyter Notebook Productivity Tips

# Magic commands (Jupyter-specific)
%timeit sum(range(1000))          # time a single expression
%time expensive_function()        # time a single run

%%timeit                           # time an entire cell
import numpy as np
np.sum(np.arange(1000))

%matplotlib inline                 # show plots in notebook
%load_ext autoreload               # auto-reload modules
%autoreload 2

# Shell commands in Jupyter
!pip list | grep numpy
!ls -la data/

# Display utilities
from IPython.display import display, HTML, Markdown
display(HTML("<h2 style='color:blue'>Formatted Output</h2>"))

import pandas as pd
pd.set_option('display.max_columns', 50)     # show more columns
pd.set_option('display.float_format', '{:.3f}'.format)  # format floats

Practice Exercises

Exercise 1: Write a function normalize(data) that takes a list of numbers and returns a new list scaled to [0, 1].

Exercise 2: Use a dict comprehension to create a word-frequency map from a sentence.

Exercise 3: Write a decorator retry(n) that retries a function up to n times on exception.

# Exercise 1 solution
def normalize(data: list) -> list:
    """Min-max normalize a list of numbers to [0, 1]."""
    min_val, max_val = min(data), max(data)
    if max_val == min_val:
        return [0.0] * len(data)
    return [(x - min_val) / (max_val - min_val) for x in data]

print(normalize([10, 20, 30, 40, 50]))
# [0.0, 0.25, 0.5, 0.75, 1.0]

# Exercise 3 solution
import time
import functools

def retry(n: int, delay: float = 0.0):
    """Decorator that retries function up to n times on exception."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            for attempt in range(1, n + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    print(f"Attempt {attempt}/{n} failed: {e}")
                    if attempt < n and delay > 0:
                        time.sleep(delay)
            raise last_exception
        return wrapper
    return decorator

@retry(3, delay=0.1)
def unstable_api_call(url: str):
    import random
    if random.random() < 0.7:    # 70% failure rate
        raise ConnectionError(f"Failed to connect to {url}")
    return f"Success: {url}"

Key Takeaways

Python is dynamically typed — variables hold references, not values
f-strings are the best way to format strings in Python 3.6+
Floating-point arithmetic is not exact — never compare floats with ==
Comprehensions are Pythonic and fast — prefer them over explicit loops for transformations
Mutable default arguments are a common bug — always use None as default for lists/dicts
Decorators add behavior to functions without modifying them — use for logging, timing, caching
Always write docstrings and type hints — your future self will thank you
Custom exceptions make error handling self-documenting and precise

What's Next

→ NumPy — Fast Array Computing → Pandas — Data Manipulation → Data Visualization → Machine Learning with Scikit-Learn