Blog

Selenium introduction: A web scraping tools in practice

Dominik Vach
Dominik Vach
CTO, Co-founder
Tutorial
July 19, 2023
Selenium introduction: A web scraping tools in practice

In our data-driven world, the ability to extract and analyze information from the web is an essential skill for many roles, ranging from data scientists to software developers. There's a whole universe of data out there waiting to be harvested! This practical guide is designed to equip you with the ability to harness this information effectively using Selenium, a powerful tool for web scraping. It is a 4th article from the series “A web scraping tools in practice”.

Introduction to Web Scraping and Selenium

Web scraping, or web harvesting, involves extracting data from websites. This data could be anything - blog posts, articles, prices of goods, customer reviews, and more. Web scraping allows us to transform the vast ocean of web data into structured datasets that we can analyze and utilize.

Selenium is a web testing library used to automate browser activities. While it's predominantly used for testing purposes in the software industry, it's also an extremely effective tool for web data extraction. Originally developed in 2004, it has stood the test of time, evolving into a robust and versatile framework for various web-based applications.

Selenium web scraping
Originally developed in 2004, Selenium has stood the test of time, evolving into a robust and versatile framework for various web-based applications.

How is Selenium different from Scrapy, Beautiful Soup, and Puppeteer?

Before we dive into details, let's briefly discuss how it's different from other popular web scraping tools, such as Scrapy, Beautiful Soup, and Puppeteer.

Scrapy and Beautiful Soup, which are Python-based, are fantastic tools for static web scraping. They fetch the HTML content of a web page and allow you to parse it, extract the data you need, and discard the rest. However, they are unable to handle JavaScript elements effectively, and that's where Selenium shines.

Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium browsers. It also interacts with the webpage like a user and can handle JavaScript. While Puppeteer is similar to Selenium, it lacks Selenium's widespread community support, compatibility with various programming languages, and comprehensive testing capabilities.

Selenium interacts with a webpage just like a human user would - clicking buttons, scrolling through pages, filling out forms, and more. It can handle dynamic content loaded by JavaScript, which is commonly found on modern websites. This makes it a powerful tool when the data you need is embedded in interactive elements or only loads under certain conditions.

In summary, while all these tools have their place, Selenium stands out for its versatility and ability to handle dynamic web content.

Installing and Setting Up Selenium

Now, let's get to the exciting part—installing and setting up Selenium for your Python environment. Here are the steps:

  1. Python: Make sure you have Python installed. You can download it from the official Python website.
  2. Selenium: Install Selenium by running the following command in your command prompt or terminal: pip install selenium
  3. WebDriver: Selenium requires a driver to interface with your chosen browser. Firefox requires geckodriver, which needs to be installed before running your script. Chrome requires chromedriver. You can download these from the official Selenium documentation. Make sure to add the location of your driver to your system's PATH.

Note: Personally, I work with Ubuntu 22.04 and Firefox, and to make it work I needed to follow the steps in this tutorial.

That's it! You are now ready to scrape the web with Selenium.

Navigating and Interacting with a website using Selenium

To understand how Selenium works, let's use it to navigate and interact with a website. In this case, we'll use the blog section of the forloop.ai/blog website.

Let's start by importing the necessary Selenium functions and opening a webpage:

from selenium import webdriver

# Create a new instance of the Firefox driver
driver = webdriver.Firefox()

# Go to the forloop.ai blog page
driver.get("https://www.forloop.ai/blog")

At this point, your Python script will open a new Firefox window and navigate to the provided URL (see the pic. below).

First lines of code working with Selenium.

Interacting with the page is just as easy. Let's say we want to click the first article on the page. We could do it like this:

# Find the first article link
first_article = driver.find_element(By.CLASS_NAME, "article-item")

# Click the article link
first_article.click()

This script will find the first link on the page with the class name 'article-title' and click it, taking you to the article page.

Locating and extracting all articles from Forloop’s website

Now that we know how to interact with a webpage, let's extract some useful information. We'll fetch all article titles and their corresponding URLs from the forloop.ai blog page.

from selenium import webdriver
from selenium.webdriver.common.by import By

# Create a new instance of the Firefox driver
driver = webdriver.Firefox()

# Go to the forloop.ai blog page
driver.get("<https://www.forloop.ai/blog>")

# Find the first article link
articles = driver.find_elements(By.CSS_SELECTOR, "div.article-item")

# Extract titles and URLs
for item in articles:
title = item.find_element(By.CSS_SELECTOR, "h4").text
tag = item.find_element(By.CSS_SELECTOR, "div.text-white").text
date = item.find_element(By.CSS_SELECTOR, "div.blog-post-date").text
print(f'Title: {title}\\nTag: {tag}\\nDate: {date}\\n---')

driver.quit()

This script will print the title, tag, and date of each article on the page. Amazing, isn't it? With just a few lines of Python code and Selenium, you can start extracting valuable data from the web.

Final results of the web scraping using Selenium

Handling Common Challenges and Errors in Web Scraping

Web scraping is an art, and like any art, it requires practice. You'll likely encounter challenges and errors along the way. Here are some common issues and tips on how to handle them:

  1. Website Structure Changes: Websites change all the time, which can break your scraping code. Make sure to handle exceptions and regularly update your code to accommodate changes in website structure.
  2. Dynamic Content: As we discussed, some content is loaded dynamically with JavaScript. Selenium can handle this, but you may need to use explicit waits to ensure that the content has loaded before you try to extract it.
  3. CAPTCHAs and Login Walls: Some websites have measures in place to prevent automated interaction, such as CAPTCHAs or mandatory login. In these cases, you might need to consider other methods of data collection.
  4. Legal and Ethical Considerations: Remember that not all data is free to scrape. Respect website terms of use, robots.txt files, and relevant laws.

Remember, every problem is an opportunity to learn more about web scraping and improve your skills!

5 Hot Tips & Tricks for Supercharging Your Web Scraping with Selenium

Running in Headless Mode

In headless mode using Firefox, you'll need to import webdriver and FirefoxOptions from selenium.webdriver and add argument --headless=new . Here's how to do it:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.add_argument("--headless=new")

driver = webdriver.Firefox(options=options)
driver.get("<https://www.forloop.ai/blog>")

driver.quit()

This script initiates a Firefox driver in headless mode and navigates to https://www.forloop.ai/blog.

Employ Explicit Waits

Explicit waits are a great way to ensure that your code only executes once the necessary web elements have fully loaded. This is particularly useful when dealing with websites that have a lot of dynamic content.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("https://www.forloop.ai/blog")

try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "div.article-item"))
)
finally:
driver.quit()

Combining Selenium with BeautifulSoup

Selenium is great for interacting with the webpage and dealing with dynamic content, but it can be a bit slow when parsing the HTML. If you're dealing with static content, consider using Selenium to load the page and BeautifulSoup to parse it:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.get("https://www.forloop.ai/blog")

soup = BeautifulSoup(driver.page_source, "html.parser")
articles = soup.select('a.article-title')

Implement Error Handling

Web scraping can often be unpredictable due to various factors such as website structure changes, network errors, etc. Implementing robust error handling can help your scraper recover from unexpected issues and continue running:

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException


driver = webdriver.Firefox()
driver.get("https://www.forloop.ai/blog")

try:
element = driver.find_element(By.CSS_SELECTOR, "div.article-title")
except NoSuchElementException:
print("Element not found")

Use Page Object Model (POM)

For large-scale projects, consider using the Page Object Model (POM). It is a design pattern that enhances test maintenance and reduces code duplication. Each webpage is represented as a class, and the web elements are represented as variables on the class.

from selenium import webdriver
from selenium.webdriver.common.by import By


class BlogPage:
def __init__(self, driver):
self.driver = driver
self.articles = self.driver.find_elements(By.CSS_SELECTOR, 'div.article-item')

driver = webdriver.Firefox()
driver.get("https://www.forloop.ai/blog")

blog_page = BlogPage(driver)
print(len(blog_page.articles)) # prints the number of articles

Armed with these tips, tricks, and snippets of code, you're now better equipped to tackle more complex web scraping tasks with Selenium. As always, remember to scrape ethically and responsibly. Happy scraping!

Summary

In this comprehensive guide, we delved into the captivating world of web scraping with Selenium, a versatile and powerful tool for data extraction. Selenium stands out for its ability to interact with webpages just like a human user would. It's ideal for dealing with dynamic, JavaScript-heavy sites that other tools like BeautifulSoup or Scrapy may struggle with.

That said, Selenium isn't without its quirks. It can be slower than other scraping tools, given its focus on browser automation. It also requires a fair amount of coding, which may not appeal to everyone.

Yet, for those who are not afraid to get their hands dirty with some coding, Selenium provides an impressive level of control and flexibility. It excels at handling complex web scraping tasks, such as form submission, cookie handling, or AJAX request manipulation, that go beyond simple HTML parsing. Coupled with Python, is a great choice for several use cases. Whether you're looking to gather data for data science projects, automate testing for your web application, or monitor competitors’ websites, Selenium can do it all.

We hope this guide provided you with a solid foundation and some hot tricks for your web scraping endeavors. As with all things in tech, the best way to learn is by doing, so we encourage you to roll up your sleeves and start scraping.

Want to dive deeper? Have questions or experiences to share? Join us on Forloop's Slack. Our vibrant community of developers and data enthusiasts is always there to help and inspire.

Finally, don’t forget to check out our other articles and tutorials. We regularly share practical guides, tech insights, and tips just like this one. Subscribe to stay updated, and let’s keep learning together.