Web Scraping & API Data Collection

Why This Matters

Data scientists spend 60-80% of their time collecting and cleaning data. APIs and web scraping are the primary methods for obtaining real-world data that doesn't come in neat CSV files.

DfApplication Programming Interface (API)

A structured set of protocols and conventions that allows software applications to communicate with each other. In data science, APIs typically provide a RESTful HTTP interface to a server's database, returning data in JSON format. APIs abstract away the underlying database schema and provide a stable, documented contract for data access.

The data collection hierarchy: APIs are always preferred over web scraping when available. APIs provide structured, documented, rate-limited access with stable interfaces. Web scraping is fragile (breaks when HTML changes), potentially unethical (may violate ToS), and requires parsing unstructured data. Only scrape when no API exists and the data is publicly available.

Web Fundamentals

How the Web Works

Architecture Diagram

Client (Your Python Script)              Server (Website/API)
        |                                      |
        |  1. HTTP Request (GET /api/data)     |
        | -----------------------------------> |
        |                                      |
        |  2. Server processes request         |
        |                                      |
        |  3. HTTP Response (200 OK + JSON)    |
        | <----------------------------------- |
        |                                      |
        |  4. Parse response                   |

HTTP Methods

Method	Purpose	Body	Use Case
GET	Retrieve data	No	Reading API data, web pages
POST	Create/submit data	Yes	Submitting forms, creating records
PUT	Update existing data	Yes	Updating a record
DELETE	Remove data	Yes	Deleting a record

DfRepresentational State Transfer (REST)

An architectural style for distributed hypermedia systems. REST APIs use standard HTTP methods (GET, POST, PUT, DELETE) to operate on resources identified by URLs. Key constraints include statelessness (each request contains all information needed), uniform interface (consistent URL patterns), and resource-based addressing (every entity has a unique URI).

Status Codes

Architecture Diagram

2xx Success:
  200 OK                 Request successful
  201 Created            New resource created
  204 No Content         Success, no response body

3xx Redirection:
  301 Moved Permanently  URL has changed
  304 Not Modified       Cached version is current

4xx Client Error:
  400 Bad Request        Invalid parameters
  401 Unauthorized       Authentication required
  403 Forbidden          No permission
  404 Not Found          Resource doesn't exist
  429 Too Many Requests  Rate limit exceeded

5xx Server Error:
  500 Internal Error     Server crashed
  503 Service Unavailable Server overloaded

The 429 status code is your friend: When you receive a 429 Too Many Requests, the server is telling you to slow down. Most APIs include a Retry-After header specifying how many seconds to wait. Always honor this — ignoring rate limits can get your IP banned permanently.

Working with APIs

What is an API?

An API (Application Programming Interface) is a structured way to request data from a server. Most modern APIs use REST (Representational State Transfer) and return JSON.

REST API Concepts

DfAPI Endpoint

A specific URL path where an API resource can be accessed. For example, https://api.example.com/v1/users/42 is an endpoint that returns the user with ID 42. Endpoints follow hierarchical URL patterns that map to resources (nouns) and use HTTP methods to indicate operations (verbs).

Architecture Diagram

Base URL: https://api.example.com/v1

Endpoint Structure:
  GET    /users           -> List all users
  GET    /users/42        -> Get user 42
  POST   /users           -> Create new user
  PUT    /users/42        -> Update user 42
  DELETE /users/42        -> Delete user 42

Query Parameters:
  GET /users?page=2&limit=10&sort=name
       |        |       |         |
       |        |       |         +-- Sort by name
       |        |       +-- Return 10 results
       |        +-- Second page
       +-- Endpoint

The requests Library

import requests
import json

# Basic GET request
response = requests.get("https://jsonplaceholder.typicode.com/posts")

# Check status code
print(f"Status: {response.status_code}")  # 200
print(f"Content-Type: {response.headers['Content-Type']}")

# Parse JSON response
data = response.json()
print(f"Number of posts: {len(data)}")
print(f"First post: {data[0]['title'][:50]}...")

Query Parameters

# Parameters as dictionary (recommended)
params = {
    'userId': 1,
    '_limit': 5
}

response = requests.get(
    "https://jsonplaceholder.typicode.com/posts",
    params=params
)

# URL becomes: /posts?userId=1&_limit=5
print(f"URL: {response.url}")
print(f"Posts by user 1: {len(response.json())}")

Headers and Authentication

DfBearer Token Authentication

An authentication scheme where the client includes a token string in the Authorization HTTP header with the prefix "Bearer ". The token is typically a JWT (JSON Web Token) or an opaque string issued by the server after initial authentication. The server validates the token on each request, eliminating the need to transmit credentials with every call.

# Custom headers
headers = {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json',
    'User-Agent': 'DataScienceBot/1.0'
}

response = requests.get(
    "https://api.example.com/data",
    headers=headers
)

# API Key in query parameter (alternative)
params = {'api_key': 'YOUR_KEY', 'query': 'python'}
response = requests.get("https://api.example.com/search", params=params)

Error Handling and Retries

DfExponential Backoff

A retry strategy where the wait time between successive retry attempts grows exponentially: 1s, 2s, 4s, 8s, ... This prevents overwhelming a struggling server (thundering herd problem) while still making progress. The formula for wait time at attempt $k$ is: $t_k = \text{base} \cdot 2^k \cdot \text{uniform}(0,1)$ , where the random jitter prevents synchronized retries from multiple clients.

import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def fetch_with_retry(url, params=None, max_retries=3):
    """Fetch URL with automatic retry on failure."""
    session = requests.Session()
    
    # Configure retry strategy
    retry = Retry(
        total=max_retries,
        backoff_factor=1,  # Wait 1, 2, 4 seconds between retries
        status_forcelist=[429, 500, 502, 503, 504]
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    
    try:
        response = session.get(url, params=params, timeout=10)
        response.raise_for_status()  # Raise exception for 4xx/5xx
        return response.json()
    except requests.exceptions.Timeout:
        print("Request timed out")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP error: {e}")
    except requests.exceptions.ConnectionError:
        print("Connection failed")
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
    
    return None

# Usage
data = fetch_with_retry("https://jsonplaceholder.typicode.com/posts/1")

Why exponential backoff works: If a server is overloaded and returns 503, immediately retrying would add to the load. Exponential backoff gives the server time to recover. The randomized jitter is critical — without it, 100 clients retrying simultaneously would all retry at exactly the same time, creating a synchronized burst that overwhelms the server again.

Pagination

def fetch_all_pages(base_url, params=None):
    """Handle page-based pagination."""
    all_data = []
    page = 1
    
    while True:
        request_params = (params or {}).copy()
        request_params['page'] = page
        
        response = requests.get(base_url, params=request_params)
        
        if response.status_code != 200:
            break
        
        page_data = response.json()
        
        if not page_data:  # Empty page = no more data
            break
        
        all_data.extend(page_data)
        page += 1
        
        # Rate limiting: don't overwhelm the server
        time.sleep(0.5)
    
    return all_data

# Cursor-based pagination (used by Twitter, GitHub)
def fetch_cursor_paginated(url, cursor=None):
    """Handle cursor-based pagination."""
    params = {'limit': 100}
    if cursor:
        params['cursor'] = cursor
    
    response = requests.get(url, params=params)
    data = response.json()
    
    return data['results'], data.get('next_cursor')

DfCursor-Based Pagination

A pagination method where the API returns an opaque cursor token that points to the next page of results. Unlike offset-based pagination (?page=3), cursor-based pagination is stable even when new records are inserted between requests — offset-based pagination can skip or duplicate records when the underlying data changes.

Web Scraping with BeautifulSoup

When to Scrape vs Use API

Use API When	Scrape When
API exists and is documented	No API available
Data is structured	Data is in HTML pages
Rate limits are reasonable	Need public data only
Authentication is simple	Simple public pages

HTML Basics

<html>
<head><title>Page Title</title></head>
<body>
  <h1 class="title">Main Heading</h1>
  <div id="content">
    <p>First paragraph</p>
    <p class="highlight">Second paragraph</p>
    <ul>
      <li><a href="/page1">Link 1</a></li>
      <li><a href="/page2">Link 2</a></li>
    </ul>
  </div>
</body>
</html>

BeautifulSoup Basics

from bs4 import BeautifulSoup
import requests

# Fetch and parse a page
url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find elements
quotes = soup.find_all('div', class_='quote')
print(f"Found {len(quotes)} quotes")

for quote in quotes[:3]:
    text = quote.find('span', class_='text').get_text()
    author = quote.find('small', class_='author').get_text()
    print(f'"{text[:60]}..." — {author}')

CSS Selectors

DfCSS Selector

A pattern used to select and style HTML elements. In web scraping, CSS selectors provide a concise syntax for targeting specific elements: div.quote > span selects <span> elements that are direct children of <div class="quote">. The select() method returns all matching elements, while select_one() returns only the first match.

# CSS selectors for precise targeting
soup.select('div.quote')           # All divs with class "quote"
soup.select('div.quote > span')    # Direct children
soup.select('a[href^="/page"]')    # Links starting with "/page"
soup.select('p.highlight')         # p with class "highlight"
soup.select('#content > p')        # p inside id="content"
soup.select('ul li:nth-child(2)')  # Second li in each ul

Extracting Data

def scrape_quotes():
    """Scrape all quotes from quotes.toscrape.com."""
    quotes = []
    page = 1
    
    while True:
        url = f"https://quotes.toscrape.com/page/{page}/"
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        page_quotes = soup.find_all('div', class_='quote')
        if not page_quotes:
            break
        
        for q in page_quotes:
            quotes.append({
                'text': q.find('span', class_='text').get_text(),
                'author': q.find('small', class_='author').get_text(),
                'tags': [tag.get_text() for tag in q.find_all('a', class_='tag')]
            })
        
        page += 1
        time.sleep(1)  # Be polite
    
    return quotes

quotes = scrape_quotes()
print(f"Total quotes scraped: {len(quotes)}")

Polite scraping checklist: Before scraping any site: (1) Check robots.txt at the root URL, (2) Set a descriptive User-Agent header with your contact info, (3) Add delays between requests (at least 1 second), (4) Only scrape publicly accessible pages, (5) Cache responses to avoid re-fetching. Violating these can result in IP bans or legal action.

Ethical Web Scraping

Rules to Follow

Architecture Diagram

1. Check robots.txt first:
   https://example.com/robots.txt

   User-agent: *
   Disallow: /private/
   Crawl-delay: 10  # Wait 10 seconds between requests

2. Respect rate limits:
   - Add delays between requests (time.sleep)
   - Don't send more than 1 request per second
   - Use sessions to reuse connections

3. Identify yourself:
   - Set a descriptive User-Agent header
   - Include your email in the User-Agent
   - Example: "MyDataBot/1.0 (contact@example.com)"

4. Only scrape public data:
   - Don't bypass authentication
   - Don't scrape personal/private information
   - Respect terms of service

5. Cache responses:
   - Save responses to avoid re-fetching
   - Use if-modified-since headers

Dfrobots.txt Protocol

A standard file placed at the root of a website that tells web crawlers which pages are allowed or disallowed for scraping. The file uses a simple syntax: User-agent: * applies to all crawlers, Disallow: /path/ blocks access to specific paths, and Crawl-delay: N specifies minimum seconds between requests. While robots.txt is voluntary (not enforced technically), ignoring it is considered unethical and may violate the Computer Fraud and Abuse Act in some jurisdictions.

Complete Data Collection Pipeline

import requests
import pandas as pd
import json
import time
from datetime import datetime

class DataCollector:
    """Complete data collection pipeline from APIs."""
    
    def __init__(self, base_url, rate_limit=1.0):
        self.base_url = base_url
        self.rate_limit = rate_limit
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'DataSciencePipeline/1.0'
        })
        self.last_request_time = 0
    
    def _rate_limit(self):
        """Enforce rate limiting."""
        elapsed = time.time() - self.last_request_time
        if elapsed < self.rate_limit:
            time.sleep(self.rate_limit - elapsed)
        self.last_request_time = time.time()
    
    def fetch(self, endpoint, params=None):
        """Fetch data from API endpoint."""
        self._rate_limit()
        
        url = f"{self.base_url}/{endpoint}"
        try:
            response = self.session.get(url, params=params, timeout=10)
            response.raise_for_status()
            return response.json()
        except requests.exceptions.RequestException as e:
            print(f"Error fetching {url}: {e}")
            return None
    
    def fetch_paginated(self, endpoint, params=None, max_pages=10):
        """Fetch all pages of data."""
        all_data = []
        params = params or {}
        params['page'] = 1
        
        while params['page'] <= max_pages:
            data = self.fetch(endpoint, params)
            if not data or len(data) == 0:
                break
            all_data.extend(data)
            params['page'] += 1
            print(f"  Fetched page {params['page']-1}: {len(data)} records")
        
        return all_data
    
    def to_dataframe(self, data):
        """Convert collected data to DataFrame."""
        return pd.DataFrame(data)
    
    def save(self, data, filename, format='csv'):
        """Save collected data."""
        if format == 'csv':
            pd.DataFrame(data).to_csv(filename, index=False)
        elif format == 'json':
            with open(filename, 'w') as f:
                json.dump(data, f, indent=2, default=str)
        elif format == 'parquet':
            pd.DataFrame(data).to_parquet(filename, index=False)
        print(f"Saved {len(data)} records to {filename}")

# Usage
collector = DataCollector("https://jsonplaceholder.typicode.com")

# Fetch posts
posts = collector.fetch_paginated("posts", max_pages=5)
df = collector.to_dataframe(posts)
print(f"\nPosts DataFrame: {df.shape}")
print(df.head())

# Save
collector.save(posts, "posts.csv")

Why use requests.Session(): A Session object persists TCP connections between requests to the same host. Without a session, each request opens and closes a new TCP connection (plus SSL handshake for HTTPS). For 100 requests to the same API, a Session can reduce total time by 30-50% by reusing the connection pool.

API Authentication Patterns

Pattern	How It Works	Example
API Key	Key in header or query param	`?api_key=abc123`
Bearer Token	Token in Authorization header	`Authorization: Bearer token`
OAuth 2.0	Delegated authorization flow	Google APIs, GitHub
Basic Auth	Username:Password base64 encoded	`Authorization: Basic dXNlcjpwYXNz`
HMAC	Signed requests with secret key	AWS APIs

DfOAuth 2.0

An authorization framework that enables third-party applications to obtain limited access to a user's resources without exposing credentials. The flow works by redirecting the user to the resource owner's authorization server, where they grant consent. The authorization server issues an access token that the client uses to make API requests. Tokens are short-lived and can be refreshed without re-authenticating the user.

Key Takeaways

📋Summary: Web Scraping & API Data

APIs are preferred over scraping — they provide structured, documented, and rate-limited access to data. Always check if an API exists before writing a scraper. APIs are stable contracts; scrapers break whenever HTML changes.
Always check robots.txt and respect rate limits. The Crawl-delay directive specifies minimum seconds between requests. Ignoring robots.txt is unethical and may have legal consequences.
Handle errors gracefully with retries and exponential backoff. The 429 Too Many Requests status code is a signal to slow down — always honor it. Use requests.Session() for connection pooling and HTTPAdapter for configurable retry strategies.
Paginate large results — don't try to fetch everything at once. Cursor-based pagination is more stable than offset-based when data changes between requests.
Cache API responses to avoid re-fetching. Use if-modified-since headers and store responses locally with timestamps.
Convert collected data to DataFrames for analysis. Use pd.DataFrame() directly on JSON responses, and save as Parquet for large datasets.
Identify yourself with a descriptive User-Agent string including your email. This allows server admins to contact you if your requests cause issues.

Practice Exercises

Public API Exploration: Use the GitHub API (api.github.com) to fetch your repositories and create a DataFrame of repo names, stars, and languages.
Multi-Source Collection: Collect data from 3 different public APIs (weather, news, crypto) and merge them into a single analysis.
Web Scraping Challenge: Scrape a table from Wikipedia (e.g., list of countries by GDP) and clean it into a usable DataFrame.
Rate-Limited Pipeline: Build a data collector that handles rate limits (429 errors), retries failed requests, and saves progress incrementally.

Web Scraping & API Data Collection

Why This Matters

DfApplication Programming Interface (API)

Web Fundamentals

How the Web Works

HTTP Methods

DfRepresentational State Transfer (REST)

Status Codes

Working with APIs

What is an API?

REST API Concepts

DfAPI Endpoint

The requests Library

Query Parameters

Headers and Authentication

DfBearer Token Authentication

Error Handling and Retries

DfExponential Backoff

Pagination

DfCursor-Based Pagination

Web Scraping with BeautifulSoup

When to Scrape vs Use API

HTML Basics

BeautifulSoup Basics

CSS Selectors

DfCSS Selector

Extracting Data

Ethical Web Scraping

Rules to Follow

Dfrobots.txt Protocol

Complete Data Collection Pipeline

API Authentication Patterns

DfOAuth 2.0

Key Takeaways

📋Summary: Web Scraping & API Data

Practice Exercises

Need Expert Data Science Help?