What Is Web Scraping?
Web scraping is the automated process of extracting data from websites. Instead of manually copying information from pages, you write a program that fetches the HTML, parses it and pulls out exactly the data you need — then saves it in a structured format like CSV, JSON or a database.
Common real-world use cases include:
- Price monitoring — track competitor prices across e-commerce sites
- Lead generation — collect business emails and contact details
- Content aggregation — pull articles, listings or job postings
- Research datasets — collect data for analysis or machine learning
- SEO monitoring — track rankings, backlinks and page changes
Before you scrape: Always check the site's robots.txt (e.g. https://example.com/robots.txt) and Terms of Service. Scraping personal data without consent may violate GDPR or other privacy laws. Only scrape publicly available, non-personal data for legitimate purposes.
Choosing the Right Scraping Library
Python has a rich ecosystem of scraping libraries. Here is how they compare so you can pick the right one for your project:
| Library | Best For | JS Support | Speed | Difficulty |
|---|---|---|---|---|
| requests + BeautifulSoup | Static HTML pages, beginners | ❌ No | Fast | Easy |
| httpx | Async scraping, HTTP/2 | ❌ No | Very Fast | Easy |
| Scrapy | Large-scale crawlers, spiders | ❌ No | Very Fast | Medium |
| Playwright | JS-rendered SPAs, login flows | ✅ Yes | Slower | Medium |
| Selenium | Legacy JS scraping, browser automation | ✅ Yes | Slow | Medium |
For this tutorial we start with requests + BeautifulSoup (perfect for 80% of sites) and then cover Playwright for JavaScript-heavy pages.
Step 1 — Install and Setup
# Core scraping stack
pip install requests beautifulsoup4 lxml fake-useragent
# For JavaScript-rendered pages
pip install playwright
playwright install chromium
# For async scraping
pip install httpx
# DNS validation (used in email scraping)
pip install dnspython
Step 2 — Your First Scraper
Let us build a scraper step by step. We will fetch a web page, inspect its HTML structure and extract specific data using BeautifulSoup's powerful CSS selector syntax.
Fetch the page HTML
Use requests.get(url) to download the raw HTML. Always set a User-Agent header — without it many sites return a 403 error or serve a bot-detection page.
Parse with BeautifulSoup
Pass the HTML to BeautifulSoup(html, "lxml"). This creates a parse tree you can navigate like a Python object. Always use the lxml parser — it is faster and more lenient than the built-in html.parser.
Find elements with CSS selectors
Use soup.select("css-selector") to find elements. Right-click any element in your browser and choose "Inspect" to see its HTML and figure out the right selector.
Extract and clean the text
Use .get_text(strip=True) to get clean text from any element, or tag["attribute"] to get attribute values like href or src.
import requests
from bs4 import BeautifulSoup
# Step 1: Fetch the page
url = "https://books.toscrape.com/" # A free, legal scraping practice site
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status() # Raise error if request failed
# Step 2: Parse HTML
soup = BeautifulSoup(response.text, "lxml")
# Step 3: Find all book elements
books = soup.select("article.product_pod")
print(f"Found {len(books)} books on this page\n")
# Step 4: Extract data from each book
for book in books[:5]: # First 5 only
title = book.select_one("h3 a")["title"]
price = book.select_one("p.price_color").get_text(strip=True)
rating = book.select_one("p.star-rating")["class"][1] # "Three", "Four" etc.
avail = book.select_one("p.availability").get_text(strip=True)
print(f"📚 {title}")
print(f" 💰 {price} ⭐ {rating} 📦 {avail}\n")
Step 3 — CSS Selectors Cheat Sheet
CSS selectors are the most powerful way to target elements in BeautifulSoup. Here are the patterns you will use 90% of the time:
# ── By tag name ────────────────────────────────────
soup.select("h1") # All <h1> tags
soup.select_one("title") # First <title> tag
# ── By class ───────────────────────────────────────
soup.select(".product-card") # All elements with class="product-card"
soup.select("div.container") # <div> elements with class="container"
# ── By ID ──────────────────────────────────────────
soup.select_one("#main-content") # Element with id="main-content"
# ── By attribute ───────────────────────────────────
soup.select("a[href]") # All <a> tags with an href attribute
soup.select("a[href^='https']") # href starting with "https"
soup.select("img[src*='.jpg']") # src containing ".jpg"
# ── Nested selectors ───────────────────────────────
soup.select("table tr td") # <td> inside <tr> inside <table>
soup.select("ul.menu > li") # Direct <li> children of <ul.menu>
# ── Extracting data ────────────────────────────────
el.get_text(strip=True) # Clean inner text
el.get_text(separator=" ") # Text with spaces between tags
el["href"] # Value of href attribute
el.get("src", "") # Safe attribute access (no KeyError)
el.find_parent("div") # Navigate up to parent <div>
Step 4 — Handling Pagination
Most real-world scraping targets span multiple pages. There are two common patterns: next-page links (follow the "Next" button) and numbered URLs (increment a page number in the URL). Here is how to handle both:
import requests
import time
import random
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE_URL = "https://books.toscrape.com/"
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
def scrape_all_pages():
all_books = []
current_url = BASE_URL
while current_url:
print(f"📄 Scraping: {current_url}")
response = requests.get(current_url, headers=HEADERS, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
# Extract books from this page
for book in soup.select("article.product_pod"):
all_books.append({
"title": book.select_one("h3 a")["title"],
"price": book.select_one("p.price_color").get_text(strip=True),
})
# Find the "next" page link
next_btn = soup.select_one("li.next a")
if next_btn:
# Build absolute URL from relative href
current_url = urljoin(current_url, next_btn["href"])
time.sleep(random.uniform(1.0, 2.5)) # Polite delay
else:
current_url = None # No more pages — stop loop
print(f"\n✅ Total books scraped: {len(all_books)}")
return all_books
books = scrape_all_pages()
Step 5 — Exporting Data to CSV
Once you have your data in a Python list of dictionaries, exporting to CSV takes just a few lines. CSV is the universal format for scraped data — it opens in Excel, Google Sheets and imports into every database and CRM.
import csv
from datetime import datetime
def save_to_csv(data: list, filename: str = None):
if not data:
print("No data to save.")
return
if not filename:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"scraped_data_{timestamp}.csv"
fieldnames = list(data[0].keys())
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
print(f"💾 Saved {len(data)} rows to {filename}")
# Example data
results = [
{"title": "Book One", "price": "£12.99", "rating": "Four"},
{"title": "Book Two", "price": "£9.49", "rating": "Three"},
]
save_to_csv(results, "books.csv")
Step 6 — Scraping JavaScript Pages with Playwright
About 30% of modern websites are JavaScript-rendered SPAs — the HTML you get from requests is an empty shell, and the actual content is loaded by JavaScript after the page loads. For these you need a real browser. Playwright is the modern choice — it is faster and more reliable than Selenium.
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
def scrape_with_playwright(url: str) -> str:
"""Fetch fully rendered HTML from a JavaScript-heavy page."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True) # headless=False to see the browser
page = browser.new_page()
# Set a real user agent
page.set_extra_http_headers({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
page.goto(url, wait_until="networkidle") # Wait for all JS to finish
# Optional: wait for a specific element to appear
page.wait_for_selector(".product-list", timeout=10000)
html = page.content() # Get fully rendered HTML
browser.close()
return html
# Use exactly like requests — pass html to BeautifulSoup
html = scrape_with_playwright("https://example-spa-site.com/products")
soup = BeautifulSoup(html, "lxml")
# Extract data as normal
products = soup.select(".product-card")
print(f"Found {len(products)} products")
Playwright — Useful Interactions
# Click a button
page.click("button.load-more")
# Fill and submit a form
page.fill("input[name='search']", "python books")
page.press("input[name='search']", "Enter")
# Scroll to bottom to trigger lazy loading
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000) # Wait 2 seconds for content to load
# Take a screenshot for debugging
page.screenshot(path="screenshot.png", full_page=True)
# Handle login before scraping
page.goto("https://example.com/login")
page.fill("#email", "user@example.com")
page.fill("#password", "mypassword")
page.click("button[type='submit']")
page.wait_for_url("**/dashboard") # Wait for redirect after login
Step 7 — Avoiding Blocks and Detection
Websites use various techniques to detect and block scrapers. Here is a complete anti-detection toolkit:
import requests
import time
import random
from fake_useragent import UserAgent
ua = UserAgent()
def get_headers() -> dict:
"""Generate realistic browser headers for each request."""
return {
"User-Agent": ua.random,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
PROXIES = [
"http://user:pass@proxy1:8080",
"http://user:pass@proxy2:8080",
]
def polite_get(url: str, use_proxy: bool = False) -> requests.Response:
"""Fetch a URL with anti-detection measures."""
# Random human-like delay
time.sleep(random.uniform(1.5, 4.0))
kwargs = {"headers": get_headers(), "timeout": 15}
if use_proxy and PROXIES:
proxy = random.choice(PROXIES)
kwargs["proxies"] = {"http": proxy, "https": proxy}
return requests.get(url, **kwargs)
# For Cloudflare-protected sites:
# pip install cloudscraper
# import cloudscraper
# scraper = cloudscraper.create_scraper()
# response = scraper.get(url)
Step 8 — Full Reusable Scraper Template
Here is a production-ready scraper template that combines everything — robust fetching, BeautifulSoup parsing, error handling, rate limiting and CSV export — ready to adapt for any target site:
import requests, csv, time, random, os
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from datetime import datetime
from fake_useragent import UserAgent
ua = UserAgent()
# ── CONFIG ───────────────────────────────────────────
START_URL = "https://books.toscrape.com/"
OUTPUT_FILE = f"output/books_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
MAX_PAGES = 50 # Limit pages scraped
DELAY_MIN = 1.0 # Min seconds between requests
DELAY_MAX = 3.0 # Max seconds between requests
# ─────────────────────────────────────────────────────
def fetch(url: str) -> BeautifulSoup | None:
try:
time.sleep(random.uniform(DELAY_MIN, DELAY_MAX))
r = requests.get(url,
headers={"User-Agent": ua.random},
timeout=15)
r.raise_for_status()
return BeautifulSoup(r.text, "lxml")
except Exception as e:
print(f" ⚠️ Failed: {url} — {e}")
return None
def parse_page(soup: BeautifulSoup) -> list:
"""Extract structured data from one page — CUSTOMIZE THIS."""
items = []
for book in soup.select("article.product_pod"):
items.append({
"title": book.select_one("h3 a")["title"],
"price": book.select_one("p.price_color").get_text(strip=True),
"rating": book.select_one("p.star-rating")["class"][1],
"scraped_at": datetime.now().isoformat()
})
return items
def run():
os.makedirs("output", exist_ok=True)
all_data = []
current_url = START_URL
page_count = 0
while current_url and page_count < MAX_PAGES:
page_count += 1
print(f"[{page_count}] {current_url}")
soup = fetch(current_url)
if not soup:
break
items = parse_page(soup)
all_data.extend(items)
print(f" ✅ Extracted {len(items)} items (total: {len(all_data)})")
next_btn = soup.select_one("li.next a")
current_url = urljoin(current_url, next_btn["href"]) if next_btn else None
# Save to CSV
if all_data:
with open(OUTPUT_FILE, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=list(all_data[0].keys()))
writer.writeheader()
writer.writerows(all_data)
print(f"\n💾 Saved {len(all_data)} rows → {OUTPUT_FILE}")
if __name__ == "__main__":
run()
Best Practices Summary
- Always use absolute paths when building URLs from relative hrefs — use
urljoin(base_url, href) - Add delays between requests — use
time.sleep(random.uniform(1, 3))to avoid rate limiting - Rotate user agents — use the
fake-useragentlibrary for a different UA on each request - Cache raw HTML locally — save fetched HTML to disk so you can rerun parsing without re-crawling
- Use
raise_for_status()— always check that the request succeeded before parsing - Handle
Noneselectors gracefully — useel.select_one(".cls")and checkif el:before accessing.get_text() - Respect
robots.txt— use Python's built-inurllib.robotparserto check before scraping - Use a
Sessionobject —requests.Session()reuses connections and cookies across requests, improving speed