Scrape Content from Dynamic Websites

Last Updated : 14 Jan, 2025

Scraping from dynamic websites means extracting data from pages where content is loaded or updated using JavaScript after the initial HTML is delivered. Unlike static websites, dynamic ones require tools that can handle JavaScript execution to access the fully rendered content. Simple scrapers like requests and BeautifulSoup can't handle this, as they only fetch raw HTML. Tools like Selenium or Playwright simulate a real browser, execute JavaScript, and allow scraping of dynamically generated elements.

Let's understand dynamic website scraping through an example project.

In this project, we are automating the process of extracting job profile links from the "Top Jobs by Designation" page on Naukri Gulf. Using Selenium, we load the webpage, wait for specific elements to load, and fetch its HTML content. BeautifulSoup then parses the HTML to extract and display the top 10 job profiles listed on the page. We'll start by installing selenium:

Install Selenium

Before using Selenium, We need to install it in our Python environment. This can be done using the pip package manager.

Run the following command in notebook or terminal:

pip install selenium

Download the WebDriver

The WebDriver is a browser-specific driver that Selenium uses to control the browser. For Chrome:

Download the ChromeDriver version compatible with your browser from Chromedriver Downloads.
Extract it and note the path where it's saved. You'll need this path in the code.
Only after these installations, proceed to import libraries and write the rest of the script.

You can copy and paste following links in your browser to find the version that is compatible with your browser version:

For Chrome: https://developer.chrome.com/docs/chromedriver/downloads
Firefox: https://github.com/mozilla/geckodriver/releases
Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10/

Step 1: Importing Required Libraries:

selenium is used for browser automation to load the webpage and interact with its dynamic content.
BeautifulSoup from bs4 is used to parse and extract specific elements from the loaded HTML.
time is used for time-related operations, though not directly used in this script.

Python

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

Explanation:

selenium.webdriver: Controls the browser for automation.
Service and Options: Manage ChromeDriver and configure browser options.
By, WebDriverWait, expected_conditions: Help locate elements and wait for them to load dynamically.
BeautifulSoup: Parses HTML to extract specific content.
time: Allows us to add delays if necessary.

2. Setting Up Chrome Options

Here, Chrome options are defined to customize the browser behavior.

What this does:

--headless: Runs Chrome without opening a visible browser window, making it faster and more resource-efficient.
--disable-gpu: Disables GPU acceleration, which is unnecessary for scraping.
user-agent: Pretends the request comes from a real user browser, reducing the chance of being blocked.

Python

chrome_options = Options()
chrome_options.add_argument("--headless")  # Run without GUI
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument(
    "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.6778.265 Safari/537.36"
)

3. Specifying the Path to ChromeDriver

The Selenium WebDriver requires a browser driver (ChromeDriver in this case) to interface with the browser. This step sets the path to the ChromeDriver executable.

Python

service = Service(r"C:\Users\GFG19702\Downloads\chromedriver-win64\chromedriver.exe")  # Update the path

Step 4: Initialize WebDriver

We create a WebDriver instance with the configured options and service.

What this does:

Launches the Chrome browser with the specified configurations.
Ensures that it operates headlessly (in the background) and uses the provided user-agent.

Python

driver = webdriver.Chrome(service=service, options=chrome_options)

Step 5: Open the Target Webpage

We navigate to the webpage containing the job profiles.

What this does:

The get() method loads the webpage.
This is the initial step to interact with the target page.

Python

url = "https://www.naukrigulf.com/top-jobs-by-designation"
driver.get(url)

Step 6: Wait for Dynamic Content to Load

Dynamic content might take time to load. We ensure the target elements are present before proceeding.

What this does:

WebDriverWait: Waits for up to 30 seconds.
presence_of_element_located: Ensures an element with the class name soft-link is present in the DOM.
This step ensures the JavaScript-rendered content is fully loaded before extraction.

Python

WebDriverWait(driver, 30).until(
    EC.presence_of_element_located((By.CLASS_NAME, "soft-link"))
)

Step 7: Get and Parse the Page Source

We retrieve the HTML content of the loaded webpage.

What this does:

page_source: Gets the fully rendered HTML after JavaScript execution.
BeautifulSoup: Parses the HTML for easy navigation and extraction of elements.

Python

html = driver.page_source
soup = BeautifulSoup(html, "html.parser")

8. Parsing the HTML with BeautifulSoup

The HTML source is passed to BeautifulSoup, which converts it into a structured format, making it easy to extract specific data.

Python

soup = BeautifulSoup(html, "html.parser")

9. Extracting Job Profiles

Using BeautifulSoup, the section containing job profiles (links with class soft-link darker) is identified. The top 10 job profiles are then extracted and printed.

Python

job_profiles_section = soup.find_all('a', class_='soft-link darker')

print("Top Job Profiles:")
for i, job in enumerate(job_profiles_section[:10], start=1):  # Limit to top 10
    print(f"{i}. {job.text.strip()}")

Output:

10. Closing the WebDriver

To ensure resources are freed, the WebDriver is closed at the end of the process.

Python

driver.quit()

Combined Code example:

Python

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

# Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run without GUI
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument(
    "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.6778.265 Safari/537.36"
)

# Path to Chromedriver
service = Service(r"C:\Users\GFG19702\Downloads\chromedriver-win64\chromedriver.exe")  # Update the path

try:
    # Initialize WebDriver
    driver = webdriver.Chrome(service=service, options=chrome_options)
    
    # Open the Naukri Gulf "Top Jobs by Designation" page
    url = "https://www.naukrigulf.com/top-jobs-by-designation"
    driver.get(url)
    
    # Wait for the required section to load (with the correct class name)
    WebDriverWait(driver, 30).until(
        EC.presence_of_element_located((By.CLASS_NAME, "soft-link"))
    )
    
    # Get the page source
    html = driver.page_source
    
    # Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(html, "html.parser")
    
    # Locate the section containing job titles
    job_profiles_section = soup.find_all('a', class_='soft-link darker')
    
    # Extract and print top job profiles
    print("Top Job Profiles:")
    for i, job in enumerate(job_profiles_section[:10], start=1):  # Limit to top 10
        print(f"{i}. {job.text.strip()}")
    
finally:
    # Close the WebDriver
    driver.quit()

Output:

Automate Instagram Messages using Python

instantramen

Improve

Article Tags :

Practice Tags :

python

Scrape Content from Dynamic Websites

Install Selenium

Download the WebDriver

Step 1: Importing Required Libraries:

2. Setting Up Chrome Options

3. Specifying the Path to ChromeDriver

Step 4: Initialize WebDriver

Step 5: Open the Target Webpage

Step 6: Wait for Dynamic Content to Load

Step 7: Get and Parse the Page Source

8. Parsing the HTML with BeautifulSoup

9. Extracting Job Profiles

10. Closing the WebDriver

Combined Code example:

Similar Reads

Projects for Beginners

Projects for Intermediate

Web Scraping

Automating boring Stuff Using Python

Tkinter Projects

Turtle Projects

OpenCV Projects

Python Django Projects

Python Text to Speech and Vice-Versa

Thank You!

What kind of Experience do you want to share?