Python Web Scraping — BeautifulSoup & Beyond
Web scraping extracts data from websites. It is used for data collection, price monitoring, and research.
Learning Objectives
- Parse HTML with BeautifulSoup
- Handle pagination and dynamic content
- Respect robots.txt and rate limits
- Store scraped data efficiently
BeautifulSoup Basics
from bs4 import BeautifulSoup
import requests
response = requests.get('https://example.com')
soup = BeautifulSoup(response.text, 'html.parser')
# Find elements
title = soup.find('h1').text
links = soup.find_all('a')
for link in links:
print(link.get('href'), link.text)
# CSS selectors
items = soup.select('div.product > h2.title')
prices = soup.select('.price')
Structured Scraping
def scrape_products(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
products = []
for card in soup.select('.product-card'):
product = {
'name': card.select_one('.title').text.strip(),
'price': float(card.select_one('.price').text.strip('$')),
'rating': float(card.select_one('.rating').text),
'url': card.select_one('a')['href']
}
products.append(product)
return products
Handling Pagination
def scrape_all_pages(base_url):
all_products = []
page = 1
while True:
url = f"{base_url}?page={page}"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
products = soup.select('.product')
if not products:
break
for product in products:
all_products.append(parse_product(product))
page += 1
time.sleep(1) # Be polite
return all_products
Key Takeaways
- Use BeautifulSoup for HTML parsing
- Always add delays between requests
- Check robots.txt before scraping
- Use CSS selectors for precise element targeting
- Store data as JSON or CSV for easy analysis