Why This Matters
Data scientists spend 60-80% of their time collecting and cleaning data. APIs and web scraping are the primary methods for obtaining real-world data that doesn't come in neat CSV files.
DfApplication Programming Interface (API)
A structured set of protocols and conventions that allows software applications to communicate with each other. In data science, APIs typically provide a RESTful HTTP interface to a server's database, returning data in JSON format. APIs abstract away the underlying database schema and provide a stable, documented contract for data access.
The data collection hierarchy: APIs are always preferred over web scraping when available. APIs provide structured, documented, rate-limited access with stable interfaces. Web scraping is fragile (breaks when HTML changes), potentially unethical (may violate ToS), and requires parsing unstructured data. Only scrape when no API exists and the data is publicly available.
Web Fundamentals
How the Web Works
Client (Your Python Script) Server (Website/API)
| |
| 1. HTTP Request (GET /api/data) |
| -----------------------------------> |
| |
| 2. Server processes request |
| |
| 3. HTTP Response (200 OK + JSON) |
| <----------------------------------- |
| |
| 4. Parse response |
HTTP Methods
| Method | Purpose | Body | Use Case |
|---|---|---|---|
| GET | Retrieve data | No | Reading API data, web pages |
| POST | Create/submit data | Yes | Submitting forms, creating records |
| PUT | Update existing data | Yes | Updating a record |
| DELETE | Remove data | Yes | Deleting a record |
DfRepresentational State Transfer (REST)
An architectural style for distributed hypermedia systems. REST APIs use standard HTTP methods (GET, POST, PUT, DELETE) to operate on resources identified by URLs. Key constraints include statelessness (each request contains all information needed), uniform interface (consistent URL patterns), and resource-based addressing (every entity has a unique URI).
Status Codes
2xx Success:
200 OK Request successful
201 Created New resource created
204 No Content Success, no response body
3xx Redirection:
301 Moved Permanently URL has changed
304 Not Modified Cached version is current
4xx Client Error:
400 Bad Request Invalid parameters
401 Unauthorized Authentication required
403 Forbidden No permission
404 Not Found Resource doesn't exist
429 Too Many Requests Rate limit exceeded
5xx Server Error:
500 Internal Error Server crashed
503 Service Unavailable Server overloaded
The 429 status code is your friend: When you receive a 429 Too Many Requests, the server is telling you to slow down. Most APIs include a Retry-After header specifying how many seconds to wait. Always honor this — ignoring rate limits can get your IP banned permanently.
Working with APIs
What is an API?
An API (Application Programming Interface) is a structured way to request data from a server. Most modern APIs use REST (Representational State Transfer) and return JSON.
REST API Concepts
DfAPI Endpoint
A specific URL path where an API resource can be accessed. For example, https://api.example.com/v1/users/42 is an endpoint that returns the user with ID 42. Endpoints follow hierarchical URL patterns that map to resources (nouns) and use HTTP methods to indicate operations (verbs).
Base URL: https://api.example.com/v1
Endpoint Structure:
GET /users -> List all users
GET /users/42 -> Get user 42
POST /users -> Create new user
PUT /users/42 -> Update user 42
DELETE /users/42 -> Delete user 42
Query Parameters:
GET /users?page=2&limit=10&sort=name
| | | |
| | | +-- Sort by name
| | +-- Return 10 results
| +-- Second page
+-- Endpoint
The requests Library
import requests
import json
# Basic GET request
response = requests.get("https://jsonplaceholder.typicode.com/posts")
# Check status code
print(f"Status: {response.status_code}") # 200
print(f"Content-Type: {response.headers['Content-Type']}")
# Parse JSON response
data = response.json()
print(f"Number of posts: {len(data)}")
print(f"First post: {data[0]['title'][:50]}...")
Query Parameters
# Parameters as dictionary (recommended)
params = {
'userId': 1,
'_limit': 5
}
response = requests.get(
"https://jsonplaceholder.typicode.com/posts",
params=params
)
# URL becomes: /posts?userId=1&_limit=5
print(f"URL: {response.url}")
print(f"Posts by user 1: {len(response.json())}")
Headers and Authentication
DfBearer Token Authentication
An authentication scheme where the client includes a token string in the Authorization HTTP header with the prefix "Bearer ". The token is typically a JWT (JSON Web Token) or an opaque string issued by the server after initial authentication. The server validates the token on each request, eliminating the need to transmit credentials with every call.
# Custom headers
headers = {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json',
'User-Agent': 'DataScienceBot/1.0'
}
response = requests.get(
"https://api.example.com/data",
headers=headers
)
# API Key in query parameter (alternative)
params = {'api_key': 'YOUR_KEY', 'query': 'python'}
response = requests.get("https://api.example.com/search", params=params)
Error Handling and Retries
DfExponential Backoff
A retry strategy where the wait time between successive retry attempts grows exponentially: 1s, 2s, 4s, 8s, ... This prevents overwhelming a struggling server (thundering herd problem) while still making progress. The formula for wait time at attempt is: , where the random jitter prevents synchronized retries from multiple clients.
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def fetch_with_retry(url, params=None, max_retries=3):
"""Fetch URL with automatic retry on failure."""
session = requests.Session()
# Configure retry strategy
retry = Retry(
total=max_retries,
backoff_factor=1, # Wait 1, 2, 4 seconds between retries
status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
try:
response = session.get(url, params=params, timeout=10)
response.raise_for_status() # Raise exception for 4xx/5xx
return response.json()
except requests.exceptions.Timeout:
print("Request timed out")
except requests.exceptions.HTTPError as e:
print(f"HTTP error: {e}")
except requests.exceptions.ConnectionError:
print("Connection failed")
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None
# Usage
data = fetch_with_retry("https://jsonplaceholder.typicode.com/posts/1")
Why exponential backoff works: If a server is overloaded and returns 503, immediately retrying would add to the load. Exponential backoff gives the server time to recover. The randomized jitter is critical — without it, 100 clients retrying simultaneously would all retry at exactly the same time, creating a synchronized burst that overwhelms the server again.
Pagination
def fetch_all_pages(base_url, params=None):
"""Handle page-based pagination."""
all_data = []
page = 1
while True:
request_params = (params or {}).copy()
request_params['page'] = page
response = requests.get(base_url, params=request_params)
if response.status_code != 200:
break
page_data = response.json()
if not page_data: # Empty page = no more data
break
all_data.extend(page_data)
page += 1
# Rate limiting: don't overwhelm the server
time.sleep(0.5)
return all_data
# Cursor-based pagination (used by Twitter, GitHub)
def fetch_cursor_paginated(url, cursor=None):
"""Handle cursor-based pagination."""
params = {'limit': 100}
if cursor:
params['cursor'] = cursor
response = requests.get(url, params=params)
data = response.json()
return data['results'], data.get('next_cursor')
DfCursor-Based Pagination
A pagination method where the API returns an opaque cursor token that points to the next page of results. Unlike offset-based pagination (?page=3), cursor-based pagination is stable even when new records are inserted between requests — offset-based pagination can skip or duplicate records when the underlying data changes.
Web Scraping with BeautifulSoup
When to Scrape vs Use API
| Use API When | Scrape When |
|---|---|
| API exists and is documented | No API available |
| Data is structured | Data is in HTML pages |
| Rate limits are reasonable | Need public data only |
| Authentication is simple | Simple public pages |
HTML Basics
<html>
<head><title>Page Title</title></head>
<body>
<h1 class="title">Main Heading</h1>
<div id="content">
<p>First paragraph</p>
<p class="highlight">Second paragraph</p>
<ul>
<li><a href="/page1">Link 1</a></li>
<li><a href="/page2">Link 2</a></li>
</ul>
</div>
</body>
</html>
BeautifulSoup Basics
from bs4 import BeautifulSoup
import requests
# Fetch and parse a page
url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find elements
quotes = soup.find_all('div', class_='quote')
print(f"Found {len(quotes)} quotes")
for quote in quotes[:3]:
text = quote.find('span', class_='text').get_text()
author = quote.find('small', class_='author').get_text()
print(f'"{text[:60]}..." — {author}')
CSS Selectors
DfCSS Selector
A pattern used to select and style HTML elements. In web scraping, CSS selectors provide a concise syntax for targeting specific elements: div.quote > span selects <span> elements that are direct children of <div class="quote">. The select() method returns all matching elements, while select_one() returns only the first match.
# CSS selectors for precise targeting
soup.select('div.quote') # All divs with class "quote"
soup.select('div.quote > span') # Direct children
soup.select('a[href^="/page"]') # Links starting with "/page"
soup.select('p.highlight') # p with class "highlight"
soup.select('#content > p') # p inside id="content"
soup.select('ul li:nth-child(2)') # Second li in each ul
Extracting Data
def scrape_quotes():
"""Scrape all quotes from quotes.toscrape.com."""
quotes = []
page = 1
while True:
url = f"https://quotes.toscrape.com/page/{page}/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
page_quotes = soup.find_all('div', class_='quote')
if not page_quotes:
break
for q in page_quotes:
quotes.append({
'text': q.find('span', class_='text').get_text(),
'author': q.find('small', class_='author').get_text(),
'tags': [tag.get_text() for tag in q.find_all('a', class_='tag')]
})
page += 1
time.sleep(1) # Be polite
return quotes
quotes = scrape_quotes()
print(f"Total quotes scraped: {len(quotes)}")
Polite scraping checklist: Before scraping any site: (1) Check robots.txt at the root URL, (2) Set a descriptive User-Agent header with your contact info, (3) Add delays between requests (at least 1 second), (4) Only scrape publicly accessible pages, (5) Cache responses to avoid re-fetching. Violating these can result in IP bans or legal action.
Ethical Web Scraping
Rules to Follow
1. Check robots.txt first:
https://example.com/robots.txt
User-agent: *
Disallow: /private/
Crawl-delay: 10 # Wait 10 seconds between requests
2. Respect rate limits:
- Add delays between requests (time.sleep)
- Don't send more than 1 request per second
- Use sessions to reuse connections
3. Identify yourself:
- Set a descriptive User-Agent header
- Include your email in the User-Agent
- Example: "MyDataBot/1.0 (contact@example.com)"
4. Only scrape public data:
- Don't bypass authentication
- Don't scrape personal/private information
- Respect terms of service
5. Cache responses:
- Save responses to avoid re-fetching
- Use if-modified-since headers
Dfrobots.txt Protocol
A standard file placed at the root of a website that tells web crawlers which pages are allowed or disallowed for scraping. The file uses a simple syntax: User-agent: * applies to all crawlers, Disallow: /path/ blocks access to specific paths, and Crawl-delay: N specifies minimum seconds between requests. While robots.txt is voluntary (not enforced technically), ignoring it is considered unethical and may violate the Computer Fraud and Abuse Act in some jurisdictions.
Complete Data Collection Pipeline
import requests
import pandas as pd
import json
import time
from datetime import datetime
class DataCollector:
"""Complete data collection pipeline from APIs."""
def __init__(self, base_url, rate_limit=1.0):
self.base_url = base_url
self.rate_limit = rate_limit
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'DataSciencePipeline/1.0'
})
self.last_request_time = 0
def _rate_limit(self):
"""Enforce rate limiting."""
elapsed = time.time() - self.last_request_time
if elapsed < self.rate_limit:
time.sleep(self.rate_limit - elapsed)
self.last_request_time = time.time()
def fetch(self, endpoint, params=None):
"""Fetch data from API endpoint."""
self._rate_limit()
url = f"{self.base_url}/{endpoint}"
try:
response = self.session.get(url, params=params, timeout=10)
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
def fetch_paginated(self, endpoint, params=None, max_pages=10):
"""Fetch all pages of data."""
all_data = []
params = params or {}
params['page'] = 1
while params['page'] <= max_pages:
data = self.fetch(endpoint, params)
if not data or len(data) == 0:
break
all_data.extend(data)
params['page'] += 1
print(f" Fetched page {params['page']-1}: {len(data)} records")
return all_data
def to_dataframe(self, data):
"""Convert collected data to DataFrame."""
return pd.DataFrame(data)
def save(self, data, filename, format='csv'):
"""Save collected data."""
if format == 'csv':
pd.DataFrame(data).to_csv(filename, index=False)
elif format == 'json':
with open(filename, 'w') as f:
json.dump(data, f, indent=2, default=str)
elif format == 'parquet':
pd.DataFrame(data).to_parquet(filename, index=False)
print(f"Saved {len(data)} records to {filename}")
# Usage
collector = DataCollector("https://jsonplaceholder.typicode.com")
# Fetch posts
posts = collector.fetch_paginated("posts", max_pages=5)
df = collector.to_dataframe(posts)
print(f"\nPosts DataFrame: {df.shape}")
print(df.head())
# Save
collector.save(posts, "posts.csv")
Why use requests.Session(): A Session object persists TCP connections between requests to the same host. Without a session, each request opens and closes a new TCP connection (plus SSL handshake for HTTPS). For 100 requests to the same API, a Session can reduce total time by 30-50% by reusing the connection pool.
API Authentication Patterns
| Pattern | How It Works | Example |
|---|---|---|
| API Key | Key in header or query param | ?api_key=abc123 |
| Bearer Token | Token in Authorization header | Authorization: Bearer token |
| OAuth 2.0 | Delegated authorization flow | Google APIs, GitHub |
| Basic Auth | Username:Password base64 encoded | Authorization: Basic dXNlcjpwYXNz |
| HMAC | Signed requests with secret key | AWS APIs |
DfOAuth 2.0
An authorization framework that enables third-party applications to obtain limited access to a user's resources without exposing credentials. The flow works by redirecting the user to the resource owner's authorization server, where they grant consent. The authorization server issues an access token that the client uses to make API requests. Tokens are short-lived and can be refreshed without re-authenticating the user.
Key Takeaways
📋Summary: Web Scraping & API Data
- APIs are preferred over scraping — they provide structured, documented, and rate-limited access to data. Always check if an API exists before writing a scraper. APIs are stable contracts; scrapers break whenever HTML changes.
- Always check robots.txt and respect rate limits. The
Crawl-delaydirective specifies minimum seconds between requests. Ignoring robots.txt is unethical and may have legal consequences. - Handle errors gracefully with retries and exponential backoff. The
429 Too Many Requestsstatus code is a signal to slow down — always honor it. Userequests.Session()for connection pooling andHTTPAdapterfor configurable retry strategies. - Paginate large results — don't try to fetch everything at once. Cursor-based pagination is more stable than offset-based when data changes between requests.
- Cache API responses to avoid re-fetching. Use
if-modified-sinceheaders and store responses locally with timestamps. - Convert collected data to DataFrames for analysis. Use
pd.DataFrame()directly on JSON responses, and save as Parquet for large datasets. - Identify yourself with a descriptive User-Agent string including your email. This allows server admins to contact you if your requests cause issues.
Practice Exercises
-
Public API Exploration: Use the GitHub API (
api.github.com) to fetch your repositories and create a DataFrame of repo names, stars, and languages. -
Multi-Source Collection: Collect data from 3 different public APIs (weather, news, crypto) and merge them into a single analysis.
-
Web Scraping Challenge: Scrape a table from Wikipedia (e.g., list of countries by GDP) and clean it into a usable DataFrame.
-
Rate-Limited Pipeline: Build a data collector that handles rate limits (429 errors), retries failed requests, and saves progress incrementally.