What Is Web Scraping?

Web scraping is the automated process of extracting data from websites. Instead of manually copying information from pages, you write a program that fetches the HTML, parses it and pulls out exactly the data you need — then saves it in a structured format like CSV, JSON or a database.

Common real-world use cases include:

  • Price monitoring — track competitor prices across e-commerce sites
  • Lead generation — collect business emails and contact details
  • Content aggregation — pull articles, listings or job postings
  • Research datasets — collect data for analysis or machine learning
  • SEO monitoring — track rankings, backlinks and page changes
⚠️

Before you scrape: Always check the site's robots.txt (e.g. https://example.com/robots.txt) and Terms of Service. Scraping personal data without consent may violate GDPR or other privacy laws. Only scrape publicly available, non-personal data for legitimate purposes.

Choosing the Right Scraping Library

Python has a rich ecosystem of scraping libraries. Here is how they compare so you can pick the right one for your project:

LibraryBest ForJS SupportSpeedDifficulty
requests + BeautifulSoup Static HTML pages, beginners ❌ No Fast Easy
httpx Async scraping, HTTP/2 ❌ No Very Fast Easy
Scrapy Large-scale crawlers, spiders ❌ No Very Fast Medium
Playwright JS-rendered SPAs, login flows ✅ Yes Slower Medium
Selenium Legacy JS scraping, browser automation ✅ Yes Slow Medium

For this tutorial we start with requests + BeautifulSoup (perfect for 80% of sites) and then cover Playwright for JavaScript-heavy pages.

Step 1 — Install and Setup

🐍 Bash — Install Libraries
# Core scraping stack
pip install requests beautifulsoup4 lxml fake-useragent

# For JavaScript-rendered pages
pip install playwright
playwright install chromium

# For async scraping
pip install httpx

# DNS validation (used in email scraping)
pip install dnspython

Step 2 — Your First Scraper

Let us build a scraper step by step. We will fetch a web page, inspect its HTML structure and extract specific data using BeautifulSoup's powerful CSS selector syntax.

1

Fetch the page HTML

Use requests.get(url) to download the raw HTML. Always set a User-Agent header — without it many sites return a 403 error or serve a bot-detection page.

2

Parse with BeautifulSoup

Pass the HTML to BeautifulSoup(html, "lxml"). This creates a parse tree you can navigate like a Python object. Always use the lxml parser — it is faster and more lenient than the built-in html.parser.

3

Find elements with CSS selectors

Use soup.select("css-selector") to find elements. Right-click any element in your browser and choose "Inspect" to see its HTML and figure out the right selector.

4

Extract and clean the text

Use .get_text(strip=True) to get clean text from any element, or tag["attribute"] to get attribute values like href or src.

🐍 Python — Basic Scraper
import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the page
url = "https://books.toscrape.com/"  # A free, legal scraping practice site
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}

response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()  # Raise error if request failed

# Step 2: Parse HTML
soup = BeautifulSoup(response.text, "lxml")

# Step 3: Find all book elements
books = soup.select("article.product_pod")

print(f"Found {len(books)} books on this page\n")

# Step 4: Extract data from each book
for book in books[:5]:  # First 5 only
    title  = book.select_one("h3 a")["title"]
    price  = book.select_one("p.price_color").get_text(strip=True)
    rating = book.select_one("p.star-rating")["class"][1]  # "Three", "Four" etc.
    avail  = book.select_one("p.availability").get_text(strip=True)

    print(f"📚 {title}")
    print(f"   💰 {price}  ⭐ {rating}  📦 {avail}\n")

Step 3 — CSS Selectors Cheat Sheet

CSS selectors are the most powerful way to target elements in BeautifulSoup. Here are the patterns you will use 90% of the time:

🐍 Python — BeautifulSoup Selector Patterns
# ── By tag name ────────────────────────────────────
soup.select("h1")             # All <h1> tags
soup.select_one("title")      # First <title> tag

# ── By class ───────────────────────────────────────
soup.select(".product-card")  # All elements with class="product-card"
soup.select("div.container") # <div> elements with class="container"

# ── By ID ──────────────────────────────────────────
soup.select_one("#main-content") # Element with id="main-content"

# ── By attribute ───────────────────────────────────
soup.select("a[href]")          # All <a> tags with an href attribute
soup.select("a[href^='https']") # href starting with "https"
soup.select("img[src*='.jpg']") # src containing ".jpg"

# ── Nested selectors ───────────────────────────────
soup.select("table tr td")       # <td> inside <tr> inside <table>
soup.select("ul.menu > li")      # Direct <li> children of <ul.menu>

# ── Extracting data ────────────────────────────────
el.get_text(strip=True)         # Clean inner text
el.get_text(separator=" ")      # Text with spaces between tags
el["href"]                       # Value of href attribute
el.get("src", "")               # Safe attribute access (no KeyError)
el.find_parent("div")           # Navigate up to parent <div>

Step 4 — Handling Pagination

Most real-world scraping targets span multiple pages. There are two common patterns: next-page links (follow the "Next" button) and numbered URLs (increment a page number in the URL). Here is how to handle both:

🐍 Python — Pagination Scraper
import requests
import time
import random
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE_URL = "https://books.toscrape.com/"
HEADERS  = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}

def scrape_all_pages():
    all_books   = []
    current_url = BASE_URL

    while current_url:
        print(f"📄 Scraping: {current_url}")
        response = requests.get(current_url, headers=HEADERS, timeout=10)
        soup     = BeautifulSoup(response.text, "lxml")

        # Extract books from this page
        for book in soup.select("article.product_pod"):
            all_books.append({
                "title": book.select_one("h3 a")["title"],
                "price": book.select_one("p.price_color").get_text(strip=True),
            })

        # Find the "next" page link
        next_btn = soup.select_one("li.next a")
        if next_btn:
            # Build absolute URL from relative href
            current_url = urljoin(current_url, next_btn["href"])
            time.sleep(random.uniform(1.0, 2.5))  # Polite delay
        else:
            current_url = None  # No more pages — stop loop

    print(f"\n✅ Total books scraped: {len(all_books)}")
    return all_books

books = scrape_all_pages()

Step 5 — Exporting Data to CSV

Once you have your data in a Python list of dictionaries, exporting to CSV takes just a few lines. CSV is the universal format for scraped data — it opens in Excel, Google Sheets and imports into every database and CRM.

🐍 Python — Save to CSV
import csv
from datetime import datetime

def save_to_csv(data: list, filename: str = None):
    if not data:
        print("No data to save.")
        return

    if not filename:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename  = f"scraped_data_{timestamp}.csv"

    fieldnames = list(data[0].keys())

    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

    print(f"💾 Saved {len(data)} rows to {filename}")

# Example data
results = [
    {"title": "Book One", "price": "£12.99", "rating": "Four"},
    {"title": "Book Two", "price": "£9.49",  "rating": "Three"},
]
save_to_csv(results, "books.csv")

🕷️ Ready to scrape emails specifically?

Read our dedicated Email Scraping Automation guide — covers regex extraction, MX validation, proxy rotation and full pipeline automation.

📧 Email Scraping Guide →

Step 6 — Scraping JavaScript Pages with Playwright

About 30% of modern websites are JavaScript-rendered SPAs — the HTML you get from requests is an empty shell, and the actual content is loaded by JavaScript after the page loads. For these you need a real browser. Playwright is the modern choice — it is faster and more reliable than Selenium.

🐍 Python — Playwright Scraper
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_with_playwright(url: str) -> str:
    """Fetch fully rendered HTML from a JavaScript-heavy page."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)  # headless=False to see the browser
        page    = browser.new_page()

        # Set a real user agent
        page.set_extra_http_headers({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        })

        page.goto(url, wait_until="networkidle")  # Wait for all JS to finish

        # Optional: wait for a specific element to appear
        page.wait_for_selector(".product-list", timeout=10000)

        html = page.content()  # Get fully rendered HTML
        browser.close()
        return html

# Use exactly like requests — pass html to BeautifulSoup
html = scrape_with_playwright("https://example-spa-site.com/products")
soup = BeautifulSoup(html, "lxml")

# Extract data as normal
products = soup.select(".product-card")
print(f"Found {len(products)} products")

Playwright — Useful Interactions

🐍 Python — Playwright Interactions
# Click a button
page.click("button.load-more")

# Fill and submit a form
page.fill("input[name='search']", "python books")
page.press("input[name='search']", "Enter")

# Scroll to bottom to trigger lazy loading
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)  # Wait 2 seconds for content to load

# Take a screenshot for debugging
page.screenshot(path="screenshot.png", full_page=True)

# Handle login before scraping
page.goto("https://example.com/login")
page.fill("#email", "user@example.com")
page.fill("#password", "mypassword")
page.click("button[type='submit']")
page.wait_for_url("**/dashboard")  # Wait for redirect after login

Step 7 — Avoiding Blocks and Detection

Websites use various techniques to detect and block scrapers. Here is a complete anti-detection toolkit:

🐍 Python — Anti-Detection Setup
import requests
import time
import random
from fake_useragent import UserAgent

ua = UserAgent()

def get_headers() -> dict:
    """Generate realistic browser headers for each request."""
    return {
        "User-Agent":      ua.random,
        "Accept":          "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT":             "1",
        "Connection":      "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    }

PROXIES = [
    "http://user:pass@proxy1:8080",
    "http://user:pass@proxy2:8080",
]

def polite_get(url: str, use_proxy: bool = False) -> requests.Response:
    """Fetch a URL with anti-detection measures."""
    # Random human-like delay
    time.sleep(random.uniform(1.5, 4.0))

    kwargs = {"headers": get_headers(), "timeout": 15}

    if use_proxy and PROXIES:
        proxy = random.choice(PROXIES)
        kwargs["proxies"] = {"http": proxy, "https": proxy}

    return requests.get(url, **kwargs)

# For Cloudflare-protected sites:
# pip install cloudscraper
# import cloudscraper
# scraper = cloudscraper.create_scraper()
# response = scraper.get(url)

Step 8 — Full Reusable Scraper Template

Here is a production-ready scraper template that combines everything — robust fetching, BeautifulSoup parsing, error handling, rate limiting and CSV export — ready to adapt for any target site:

🐍 Python — Production Scraper Template
import requests, csv, time, random, os
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from datetime import datetime
from fake_useragent import UserAgent

ua = UserAgent()

# ── CONFIG ───────────────────────────────────────────
START_URL   = "https://books.toscrape.com/"
OUTPUT_FILE = f"output/books_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
MAX_PAGES   = 50     # Limit pages scraped
DELAY_MIN   = 1.0   # Min seconds between requests
DELAY_MAX   = 3.0   # Max seconds between requests
# ─────────────────────────────────────────────────────

def fetch(url: str) -> BeautifulSoup | None:
    try:
        time.sleep(random.uniform(DELAY_MIN, DELAY_MAX))
        r = requests.get(url,
            headers={"User-Agent": ua.random},
            timeout=15)
        r.raise_for_status()
        return BeautifulSoup(r.text, "lxml")
    except Exception as e:
        print(f"  ⚠️ Failed: {url} — {e}")
        return None

def parse_page(soup: BeautifulSoup) -> list:
    """Extract structured data from one page — CUSTOMIZE THIS."""
    items = []
    for book in soup.select("article.product_pod"):
        items.append({
            "title":  book.select_one("h3 a")["title"],
            "price":  book.select_one("p.price_color").get_text(strip=True),
            "rating": book.select_one("p.star-rating")["class"][1],
            "scraped_at": datetime.now().isoformat()
        })
    return items

def run():
    os.makedirs("output", exist_ok=True)
    all_data    = []
    current_url = START_URL
    page_count  = 0

    while current_url and page_count < MAX_PAGES:
        page_count += 1
        print(f"[{page_count}] {current_url}")

        soup = fetch(current_url)
        if not soup:
            break

        items = parse_page(soup)
        all_data.extend(items)
        print(f"  ✅ Extracted {len(items)} items (total: {len(all_data)})")

        next_btn    = soup.select_one("li.next a")
        current_url = urljoin(current_url, next_btn["href"]) if next_btn else None

    # Save to CSV
    if all_data:
        with open(OUTPUT_FILE, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=list(all_data[0].keys()))
            writer.writeheader()
            writer.writerows(all_data)
        print(f"\n💾 Saved {len(all_data)} rows → {OUTPUT_FILE}")

if __name__ == "__main__":
    run()

⏰ Schedule your scraper to run automatically

Use our free Cron Generator to schedule your Python scraper on any Linux server — daily, hourly or weekly, no syntax memorizing needed.

⏰ Open Cron Generator →

Best Practices Summary

  • Always use absolute paths when building URLs from relative hrefs — use urljoin(base_url, href)
  • Add delays between requests — use time.sleep(random.uniform(1, 3)) to avoid rate limiting
  • Rotate user agents — use the fake-useragent library for a different UA on each request
  • Cache raw HTML locally — save fetched HTML to disk so you can rerun parsing without re-crawling
  • Use raise_for_status() — always check that the request succeeded before parsing
  • Handle None selectors gracefully — use el.select_one(".cls") and check if el: before accessing .get_text()
  • Respect robots.txt — use Python's built-in urllib.robotparser to check before scraping
  • Use a Session objectrequests.Session() reuses connections and cookies across requests, improving speed