What is the difference between requests and Playwright for scraping?

requests fetches raw static HTML — it cannot execute JavaScript. Playwright controls a real browser and renders JavaScript, making it suitable for modern single-page applications and dynamic content.

How do I scrape a website without getting blocked?

Use rotating user agents, add random delays between requests (1–4 seconds), rotate IP addresses using proxies, and respect robots.txt. For Cloudflare-protected sites use the cloudscraper library.

What is BeautifulSoup used for in web scraping?

BeautifulSoup is a Python library that parses HTML and XML documents into a navigable tree structure. It lets you find and extract specific elements using CSS selectors, tag names and attributes.

How do I handle pagination when scraping?

Find the 'Next' button link in the HTML and follow it in a loop until no next page exists. Alternatively, if the site uses numbered pages, construct URLs programmatically like /page/1, /page/2 up to the last page number.

Python Web Scraping for Beginners — Complete Tutorial 2026

Q: Is web scraping with Python legal?

Web scraping is legal when you scrape publicly available data and respect the site's robots.txt and Terms of Service. Scraping personal data without consent may violate GDPR. Always check the target site's ToS before scraping.

What Is Web Scraping?

Web scraping is the automated process of extracting data from websites. Instead of manually copying information from pages, you write a program that fetches the HTML, parses it and pulls out exactly the data you need — then saves it in a structured format like CSV, JSON or a database.

Common real-world use cases include:

Price monitoring — track competitor prices across e-commerce sites
Lead generation — collect business emails and contact details
Content aggregation — pull articles, listings or job postings
Research datasets — collect data for analysis or machine learning
SEO monitoring — track rankings, backlinks and page changes

⚠️

Before you scrape: Always check the site's robots.txt (e.g. https://example.com/robots.txt) and Terms of Service. Scraping personal data without consent may violate GDPR or other privacy laws. Only scrape publicly available, non-personal data for legitimate purposes.

Choosing the Right Scraping Library

Python has a rich ecosystem of scraping libraries. Here is how they compare so you can pick the right one for your project:

Library	Best For	JS Support	Speed	Difficulty
requests + BeautifulSoup	Static HTML pages, beginners	❌ No	Fast	Easy
httpx	Async scraping, HTTP/2	❌ No	Very Fast	Easy
Scrapy	Large-scale crawlers, spiders	❌ No	Very Fast	Medium
Playwright	JS-rendered SPAs, login flows	✅ Yes	Slower	Medium
Selenium	Legacy JS scraping, browser automation	✅ Yes	Slow	Medium

For this tutorial we start with requests + BeautifulSoup (perfect for 80% of sites) and then cover Playwright for JavaScript-heavy pages.

Step 1 — Install and Setup

🐍 Bash — Install Libraries

# Core scraping stack
pip install requests beautifulsoup4 lxml fake-useragent

# For JavaScript-rendered pages
pip install playwright
playwright install chromium

# For async scraping
pip install httpx

# DNS validation (used in email scraping)
pip install dnspython

Step 2 — Your First Scraper

Let us build a scraper step by step. We will fetch a web page, inspect its HTML structure and extract specific data using BeautifulSoup's powerful CSS selector syntax.

Fetch the page HTML

Use requests.get(url) to download the raw HTML. Always set a User-Agent header — without it many sites return a 403 error or serve a bot-detection page.

Parse with BeautifulSoup

Pass the HTML to BeautifulSoup(html, "lxml"). This creates a parse tree you can navigate like a Python object. Always use the lxml parser — it is faster and more lenient than the built-in html.parser.

Find elements with CSS selectors

Use soup.select("css-selector") to find elements. Right-click any element in your browser and choose "Inspect" to see its HTML and figure out the right selector.

Extract and clean the text

Use .get_text(strip=True) to get clean text from any element, or tag["attribute"] to get attribute values like href or src.

🐍 Python — Basic Scraper

import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the page
url = "https://books.toscrape.com/"  # A free, legal scraping practice site
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}

response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()  # Raise error if request failed

# Step 2: Parse HTML
soup = BeautifulSoup(response.text, "lxml")

# Step 3: Find all book elements
books = soup.select("article.product_pod")

print(f"Found {len(books)} books on this page\n")

# Step 4: Extract data from each book
for book in books[:5]:  # First 5 only
    title  = book.select_one("h3 a")["title"]
    price  = book.select_one("p.price_color").get_text(strip=True)
    rating = book.select_one("p.star-rating")["class"][1]  # "Three", "Four" etc.
    avail  = book.select_one("p.availability").get_text(strip=True)

    print(f"📚 {title}")
    print(f"   💰 {price}  ⭐ {rating}  📦 {avail}\n")

Step 3 — CSS Selectors Cheat Sheet

CSS selectors are the most powerful way to target elements in BeautifulSoup. Here are the patterns you will use 90% of the time:

🐍 Python — BeautifulSoup Selector Patterns

# ── By tag name ────────────────────────────────────
soup.select("h1")             # All <h1> tags
soup.select_one("title")      # First <title> tag

# ── By class ───────────────────────────────────────
soup.select(".product-card")  # All elements with class="product-card"
soup.select("div.container") # <div> elements with class="container"

# ── By ID ──────────────────────────────────────────
soup.select_one("#main-content") # Element with id="main-content"

# ── By attribute ───────────────────────────────────
soup.select("a[href]")          # All <a> tags with an href attribute
soup.select("a[href^='https']") # href starting with "https"
soup.select("img[src*='.jpg']") # src containing ".jpg"

# ── Nested selectors ───────────────────────────────
soup.select("table tr td")       # <td> inside <tr> inside <table>
soup.select("ul.menu > li")      # Direct <li> children of <ul.menu>

# ── Extracting data ────────────────────────────────
el.get_text(strip=True)         # Clean inner text
el.get_text(separator=" ")      # Text with spaces between tags
el["href"]                       # Value of href attribute
el.get("src", "")               # Safe attribute access (no KeyError)
el.find_parent("div")           # Navigate up to parent <div>

Step 4 — Handling Pagination

Most real-world scraping targets span multiple pages. There are two common patterns: next-page links (follow the "Next" button) and numbered URLs (increment a page number in the URL). Here is how to handle both:

🐍 Python — Pagination Scraper

import requests
import time
import random
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE_URL = "https://books.toscrape.com/"
HEADERS  = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}

def scrape_all_pages():
    all_books   = []
    current_url = BASE_URL

    while current_url:
        print(f"📄 Scraping: {current_url}")
        response = requests.get(current_url, headers=HEADERS, timeout=10)
        soup     = BeautifulSoup(response.text, "lxml")

        # Extract books from this page
        for book in soup.select("article.product_pod"):
            all_books.append({
                "title": book.select_one("h3 a")["title"],
                "price": book.select_one("p.price_color").get_text(strip=True),
            })

        # Find the "next" page link
        next_btn = soup.select_one("li.next a")
        if next_btn:
            # Build absolute URL from relative href
            current_url = urljoin(current_url, next_btn["href"])
            time.sleep(random.uniform(1.0, 2.5))  # Polite delay
        else:
            current_url = None  # No more pages — stop loop

    print(f"\n✅ Total books scraped: {len(all_books)}")
    return all_books

books = scrape_all_pages()

Step 5 — Exporting Data to CSV

Once you have your data in a Python list of dictionaries, exporting to CSV takes just a few lines. CSV is the universal format for scraped data — it opens in Excel, Google Sheets and imports into every database and CRM.

🐍 Python — Save to CSV

import csv
from datetime import datetime

def save_to_csv(data: list, filename: str = None):
    if not data:
        print("No data to save.")
        return

    if not filename:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename  = f"scraped_data_{timestamp}.csv"

    fieldnames = list(data[0].keys())

    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)

    print(f"💾 Saved {len(data)} rows to {filename}")

# Example data
results = [
    {"title": "Book One", "price": "£12.99", "rating": "Four"},
    {"title": "Book Two", "price": "£9.49",  "rating": "Three"},
]
save_to_csv(results, "books.csv")

Step 6 — Scraping JavaScript Pages with Playwright

About 30% of modern websites are JavaScript-rendered SPAs — the HTML you get from requests is an empty shell, and the actual content is loaded by JavaScript after the page loads. For these you need a real browser. Playwright is the modern choice — it is faster and more reliable than Selenium.

🐍 Python — Playwright Scraper

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_with_playwright(url: str) -> str:
    """Fetch fully rendered HTML from a JavaScript-heavy page."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)  # headless=False to see the browser
        page    = browser.new_page()

        # Set a real user agent
        page.set_extra_http_headers({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
        })

        page.goto(url, wait_until="networkidle")  # Wait for all JS to finish

        # Optional: wait for a specific element to appear
        page.wait_for_selector(".product-list", timeout=10000)

        html = page.content()  # Get fully rendered HTML
        browser.close()
        return html

# Use exactly like requests — pass html to BeautifulSoup
html = scrape_with_playwright("https://example-spa-site.com/products")
soup = BeautifulSoup(html, "lxml")

# Extract data as normal
products = soup.select(".product-card")
print(f"Found {len(products)} products")

Playwright — Useful Interactions

🐍 Python — Playwright Interactions

# Click a button
page.click("button.load-more")

# Fill and submit a form
page.fill("input[name='search']", "python books")
page.press("input[name='search']", "Enter")

# Scroll to bottom to trigger lazy loading
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)  # Wait 2 seconds for content to load

# Take a screenshot for debugging
page.screenshot(path="screenshot.png", full_page=True)

# Handle login before scraping
page.goto("https://example.com/login")
page.fill("#email", "user@example.com")
page.fill("#password", "mypassword")
page.click("button[type='submit']")
page.wait_for_url("**/dashboard")  # Wait for redirect after login

Step 7 — Avoiding Blocks and Detection

Websites use various techniques to detect and block scrapers. Here is a complete anti-detection toolkit:

🐍 Python — Anti-Detection Setup

import requests
import time
import random
from fake_useragent import UserAgent

ua = UserAgent()

def get_headers() -> dict:
    """Generate realistic browser headers for each request."""
    return {
        "User-Agent":      ua.random,
        "Accept":          "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT":             "1",
        "Connection":      "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    }

PROXIES = [
    "http://user:pass@proxy1:8080",
    "http://user:pass@proxy2:8080",
]

def polite_get(url: str, use_proxy: bool = False) -> requests.Response:
    """Fetch a URL with anti-detection measures."""
    # Random human-like delay
    time.sleep(random.uniform(1.5, 4.0))

    kwargs = {"headers": get_headers(), "timeout": 15}

    if use_proxy and PROXIES:
        proxy = random.choice(PROXIES)
        kwargs["proxies"] = {"http": proxy, "https": proxy}

    return requests.get(url, **kwargs)

# For Cloudflare-protected sites:
# pip install cloudscraper
# import cloudscraper
# scraper = cloudscraper.create_scraper()
# response = scraper.get(url)

Step 8 — Full Reusable Scraper Template

Here is a production-ready scraper template that combines everything — robust fetching, BeautifulSoup parsing, error handling, rate limiting and CSV export — ready to adapt for any target site:

🐍 Python — Production Scraper Template

import requests, csv, time, random, os
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from datetime import datetime
from fake_useragent import UserAgent

ua = UserAgent()

# ── CONFIG ───────────────────────────────────────────
START_URL   = "https://books.toscrape.com/"
OUTPUT_FILE = f"output/books_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
MAX_PAGES   = 50     # Limit pages scraped
DELAY_MIN   = 1.0   # Min seconds between requests
DELAY_MAX   = 3.0   # Max seconds between requests
# ─────────────────────────────────────────────────────

def fetch(url: str) -> BeautifulSoup | None:
    try:
        time.sleep(random.uniform(DELAY_MIN, DELAY_MAX))
        r = requests.get(url,
            headers={"User-Agent": ua.random},
            timeout=15)
        r.raise_for_status()
        return BeautifulSoup(r.text, "lxml")
    except Exception as e:
        print(f"  ⚠️ Failed: {url} — {e}")
        return None

def parse_page(soup: BeautifulSoup) -> list:
    """Extract structured data from one page — CUSTOMIZE THIS."""
    items = []
    for book in soup.select("article.product_pod"):
        items.append({
            "title":  book.select_one("h3 a")["title"],
            "price":  book.select_one("p.price_color").get_text(strip=True),
            "rating": book.select_one("p.star-rating")["class"][1],
            "scraped_at": datetime.now().isoformat()
        })
    return items

def run():
    os.makedirs("output", exist_ok=True)
    all_data    = []
    current_url = START_URL
    page_count  = 0

    while current_url and page_count < MAX_PAGES:
        page_count += 1
        print(f"[{page_count}] {current_url}")

        soup = fetch(current_url)
        if not soup:
            break

        items = parse_page(soup)
        all_data.extend(items)
        print(f"  ✅ Extracted {len(items)} items (total: {len(all_data)})")

        next_btn    = soup.select_one("li.next a")
        current_url = urljoin(current_url, next_btn["href"]) if next_btn else None

    # Save to CSV
    if all_data:
        with open(OUTPUT_FILE, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=list(all_data[0].keys()))
            writer.writeheader()
            writer.writerows(all_data)
        print(f"\n💾 Saved {len(all_data)} rows → {OUTPUT_FILE}")

if __name__ == "__main__":
    run()

Best Practices Summary

Always use absolute paths when building URLs from relative hrefs — use urljoin(base_url, href)
Add delays between requests — use time.sleep(random.uniform(1, 3)) to avoid rate limiting
Rotate user agents — use the fake-useragent library for a different UA on each request
Cache raw HTML locally — save fetched HTML to disk so you can rerun parsing without re-crawling
Use raise_for_status() — always check that the request succeeded before parsing
Handle None selectors gracefully — use el.select_one(".cls") and check if el: before accessing .get_text()
Respect robots.txt — use Python's built-in urllib.robotparser to check before scraping
Use a Session object — requests.Session() reuses connections and cookies across requests, improving speed

Python Web Scraping for Beginners — Complete Tutorial

What Is Web Scraping?

Choosing the Right Scraping Library

Step 1 — Install and Setup

Step 2 — Your First Scraper

Fetch the page HTML

Parse with BeautifulSoup

Find elements with CSS selectors

Extract and clean the text

Step 3 — CSS Selectors Cheat Sheet

Step 5 — Exporting Data to CSV

🕷️ Ready to scrape emails specifically?

Step 6 — Scraping JavaScript Pages with Playwright

Playwright — Useful Interactions

Step 7 — Avoiding Blocks and Detection

Step 8 — Full Reusable Scraper Template

⏰ Schedule your scraper to run automatically

Best Practices Summary

❓ Frequently Asked Questions

Python Web Scraping for Beginners — Complete Tutorial

What Is Web Scraping?

Choosing the Right Scraping Library

Step 1 — Install and Setup

Step 2 — Your First Scraper

Fetch the page HTML

Parse with BeautifulSoup

Find elements with CSS selectors

Extract and clean the text

Step 3 — CSS Selectors Cheat Sheet

Step 4 — Handling Pagination

Step 5 — Exporting Data to CSV

🕷️ Ready to scrape emails specifically?

Step 6 — Scraping JavaScript Pages with Playwright

Playwright — Useful Interactions

Step 7 — Avoiding Blocks and Detection

Step 8 — Full Reusable Scraper Template

⏰ Schedule your scraper to run automatically

Best Practices Summary

❓ Frequently Asked Questions

📚 Related Articles