Blog

A web scraping quick guide with a hands-on tutorial

Dominik Vach
Dominik Vach
CTO, Co-founder
Tutorial
January 25, 2023
A web scraping quick guide with a hands-on tutorial

Are you tired of manually collecting data from the internet? Are you looking for a more efficient way to gather information and make data-driven decisions? Web scraping is the solution you've been searching for. The ability to access and analyze vast amounts of data is more valuable than ever before. And web scraping is one of the most effective ways to do that! In this article, we will explore the basics of web scraping as well as we will go through a sample use case.

What exactly is web scraping?

Web scraping, also known as web harvesting or web data extraction, is the process of extracting data from websites. It's a technique that's been around for quite some time, and it's become increasingly popular in recent years. More and more companies look for ways to gain insights from the vast amount of data available on the internet.

So, how does it work? Essentially, web scraping involves using software or scripts to open a website and collect specific information automatically. This can include everything from text and images to links and pricing information. The information is then saved and can be analyzed to gain insights and make data-driven decisions.

Web scraping done by Forloop
Web scraping might be used to extract a list of articles from a website.

One of the major benefits of web scraping is that it allows companies to access data that would otherwise be difficult to obtain. For example, a company may use web scraping to collect information about competitors' prices, product offerings, or marketing strategies. This information can then be used to inform pricing decisions, product development, and marketing campaigns. Web scraping can also be used for more creative purposes. For example, to collect data on social media sentiment to gauge public opinion on a particular topic. Or, a company may use web scraping to collect data on weather patterns to predict demand for its products better.

It's important to note that not all websites allow web scraping, and some have terms of service that prohibit it. Therefore, it's important to always check a website's terms of service before scraping data.

What are the possible ways to web scrape a website?

With so many different ways to web scrape a website, it can sometimes be difficult to know where to start. Let’s explore the most popular methods for web scraping, focusing on online scraping services, programming frameworks, libraries, and APIs.

web scraping quick guide
Web scraping might be a good solution in some cases.

Online scraping services

First, let's talk about online scraping services. These services, such as Simplescraper, and Parsehub, offer a user-friendly interface and provide the data in a structured format for easy use. It’s a great web scraping solution for businesses and individuals who need to extract data from websites cost-effectively and quickly. The most common limitation of that solution is handling large amounts of data or content on dynamic websites. Therefore, the best-fitted user persona for this solution would be a small business owner or marketer who lacks technical expertise. That might be the prices of their competitors' products on their website.

Programming frameworks and libraries

Next, we have programming frameworks and libraries, such as Scrapy, Beautiful Soup, and Selenium. These frameworks and libraries allow developers to customize the scraping process and extract specific data from the website. For developers with technical expertise, this can be a powerful solution. However, it can also require a significant investment in terms of time and resources. It’s definitely a way to go for developers, engineers, and also growth hackers. Mainly those who look for a lot of freedom and customization. A sample use case would be extracting data from a real estate website to build a predictive model.

APIs

Lastly, we have APIs. An API (Application Programming Interface) allows developers to access data from a website in a structured format. This can be a more efficient and reliable way to access data. Especially when the website owner has control over the data and can ensure that it is accurate and up-to-date. However, not all websites offer APIs, and the data available through an API may be limited. The best-fitted user persona for this solution is a data analyst or a data scientist who needs to control the data's structure. A perfect use case might be using social media APIs to obtain a sentiment analysis.

Let’s web scrape a sample website.

We already discussed the basics of web scraping so let’s jump into practice. In the following section, we will guide you through a quick process of extracting information from the website. For this purpose, we will use Python and Beautiful Soup to scrape articles from the website https://www.forloop.ai/blog. The end result will be a DataFrame containing the extracted articles.

Step 1: Install the required libraries

First, we need to install the required libraries. Open up your terminal and type in the following command to install Beautiful Soup:

pip install beautifulsoup4

You will also need to install the requests library, which allows us to send HTTP requests in Python:

pip install requests

Step 2: Import the libraries and send an HTTP request

We will begin our web scraping by importing the libraries we need for this tutorial.

import requests
from bs4 import BeautifulSoup
import pandas as pd

Later we need to send an HTTP request to the website we want to web scrape. We will use the requests library to do this.

# Make a request to the website
url = 'https://www.forloop.ai/blog'
response = requests.get(url)

Step 3: Parse the HTML content and extract data.

Once we have the HTML content, we can use Beautiful Soup to parse it.

# Parthe the HTML content of the website
soup = BeautifulSoup(response.content, 'html.parser')

Now that we have the HTML content, we can use Beautiful Soup to extract the data we want. In this case, we want to extract the articles from the website. We can do this by finding the appropriate HTML tags and classes.

# Find all the arciles elemtns on the page
articles = soup.find_all("div", class_="article-item")

Step 4: Create a DataFrame

Once we have extracted the data, we can create a DataFrame to store it.

# Create an empty list to store the data
data = []

# Iterate through each article element
for article in articles:
title = article.find('h4').text
date = article.find(class_='blog-post-date').text
link = article.find('a')['href']

# Append the data as a dictionary to the list
data.append({'title': title, 'date': date, 'link': link})

Step 5: View the data

We’re almost done with the web scraping. Now that we have the DataFrame, we can view our data using the print() function.

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(data)

# Print DataFrame
print(df)

The output looks as follows:

A sample web scraping output
Web scraping output using Beautiful Soup

And that's it! You've successfully performed a web scraping of articles from the https://www.forloop.ai/blog website and stored it in a DataFrame. This is just the tip of the iceberg regarding web scraping. You can definitely do many other things with Beautiful Soup and Python. Please note that some websites have restrictions and terms of use; it's always good to read the terms of use before scraping.

Conclusion

We covered the basics of web scraping going through the basics, potential approaches, and also a sample tutorial. In conclusion, there are multiple ways to web scrape a website. Depending on your technical skills, the website you're scraping, and the data you're looking to extract. It's important to note that before scraping a website, it's important to check the website's terms of service. Very often, web scraping is not allowed.

If those topics are interesting to you, please get in touch with us or join our slack channel. In addition, every Tuesday at 12:00 CET we host a free webinar on external data! We not only share our experiences but also discuss many challenging topics.