Scrape Content from Dynamic Websites
Scraping from dynamic websites means extracting data from pages where content is loaded or updated using JavaScript after the initial HTML is delivered. Unlike static websites, dynamic ones require tools that can handle JavaScript execution to access the fully rendered content. Simple scrapers like requests and BeautifulSoup can't handle this, as they only fetch raw HTML. Tools like Selenium or Playwright simulate a real browser, execute JavaScript, and allow scraping of dynamically generated elements.
Let's understand dynamic website scraping through an example project.
In this project, we are automating the process of extracting job profile links from the "Top Jobs by Designation" page on Naukri Gulf. Using Selenium, we load the webpage, wait for specific elements to load, and fetch its HTML content. BeautifulSoup then parses the HTML to extract and display the top 10 job profiles listed on the page. We'll start by installing selenium:
Install Selenium
Before using Selenium, We need to install it in our Python environment. This can be done using the pip package manager.
Run the following command in notebook or terminal:
pip install selenium
Download the WebDriver
The WebDriver is a browser-specific driver that Selenium uses to control the browser. For Chrome:
- Download the ChromeDriver version compatible with your browser from Chromedriver Downloads.
- Extract it and note the path where it's saved. You'll need this path in the code.
- Only after these installations, proceed to import libraries and write the rest of the script.
You can copy and paste following links in your browser to find the version that is compatible with your browser version:
For Chrome: https://developer.chrome.com/docs/chromedriver/downloads Firefox: https://github.com/mozilla/geckodriver/releases Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10/
Step 1: Importing Required Libraries:
- selenium is used for browser automation to load the webpage and interact with its dynamic content.
- BeautifulSoup from bs4 is used to parse and extract specific elements from the loaded HTML.
- time is used for time-related operations, though not directly used in this script.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
Explanation:
- selenium.webdriver: Controls the browser for automation.
- Service and Options: Manage ChromeDriver and configure browser options.
- By, WebDriverWait, expected_conditions: Help locate elements and wait for them to load dynamically.
- BeautifulSoup: Parses HTML to extract specific content.
- time: Allows us to add delays if necessary.
2. Setting Up Chrome Options
Here, Chrome options are defined to customize the browser behavior.
What this does:
- --headless: Runs Chrome without opening a visible browser window, making it faster and more resource-efficient.
- --disable-gpu: Disables GPU acceleration, which is unnecessary for scraping.
- user-agent: Pretends the request comes from a real user browser, reducing the chance of being blocked.
chrome_options = Options()
chrome_options.add_argument("--headless") # Run without GUI
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.6778.265 Safari/537.36"
)
3. Specifying the Path to ChromeDriver
The Selenium WebDriver requires a browser driver (ChromeDriver in this case) to interface with the browser. This step sets the path to the ChromeDriver executable.
service = Service(r"C:\Users\GFG19702\Downloads\chromedriver-win64\chromedriver.exe") # Update the path
Step 4: Initialize WebDriver
We create a WebDriver instance with the configured options and service.
What this does:
- Launches the Chrome browser with the specified configurations.
- Ensures that it operates headlessly (in the background) and uses the provided user-agent.
driver = webdriver.Chrome(service=service, options=chrome_options)
Step 5: Open the Target Webpage
We navigate to the webpage containing the job profiles.
What this does:
- The get() method loads the webpage.
- This is the initial step to interact with the target page.
url = "https://www.naukrigulf.com/top-jobs-by-designation"
driver.get(url)
Step 6: Wait for Dynamic Content to Load
Dynamic content might take time to load. We ensure the target elements are present before proceeding.
What this does:
- WebDriverWait: Waits for up to 30 seconds.
- presence_of_element_located: Ensures an element with the class name soft-link is present in the DOM.
- This step ensures the JavaScript-rendered content is fully loaded before extraction.
WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.CLASS_NAME, "soft-link"))
)
Step 7: Get and Parse the Page Source
We retrieve the HTML content of the loaded webpage.
What this does:
- page_source: Gets the fully rendered HTML after JavaScript execution.
- BeautifulSoup: Parses the HTML for easy navigation and extraction of elements.
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
8. Parsing the HTML with BeautifulSoup
The HTML source is passed to BeautifulSoup, which converts it into a structured format, making it easy to extract specific data.
soup = BeautifulSoup(html, "html.parser")
9. Extracting Job Profiles
Using BeautifulSoup, the section containing job profiles (links with class soft-link darker) is identified. The top 10 job profiles are then extracted and printed.
job_profiles_section = soup.find_all('a', class_='soft-link darker')
print("Top Job Profiles:")
for i, job in enumerate(job_profiles_section[:10], start=1): # Limit to top 10
print(f"{i}. {job.text.strip()}")
Output:

10. Closing the WebDriver
To ensure resources are freed, the WebDriver is closed at the end of the process.
driver.quit()
Combined Code example:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
# Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless") # Run without GUI
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument(
"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.6778.265 Safari/537.36"
)
# Path to Chromedriver
service = Service(r"C:\Users\GFG19702\Downloads\chromedriver-win64\chromedriver.exe") # Update the path
try:
# Initialize WebDriver
driver = webdriver.Chrome(service=service, options=chrome_options)
# Open the Naukri Gulf "Top Jobs by Designation" page
url = "https://www.naukrigulf.com/top-jobs-by-designation"
driver.get(url)
# Wait for the required section to load (with the correct class name)
WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.CLASS_NAME, "soft-link"))
)
# Get the page source
html = driver.page_source
# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Locate the section containing job titles
job_profiles_section = soup.find_all('a', class_='soft-link darker')
# Extract and print top job profiles
print("Top Job Profiles:")
for i, job in enumerate(job_profiles_section[:10], start=1): # Limit to top 10
print(f"{i}. {job.text.strip()}")
finally:
# Close the WebDriver
driver.quit()
Output:
