Scrapy introduction: A web scraping tools in practice

Dominik Vach

CTO, Co-founder

Tutorial

July 7, 2023

Scrapy introduction: A web scraping tools in practice

Are you a budding web developer, a savvy data scientist, or a curious technology enthusiast interested in diving into the world of web scraping? If so, this guide is tailored just for you. In this comprehensive tutorial, we'll introduce you to Scrapy, an open-source web crawling framework that will help you navigate web scraping tasks like a pro.

Web scraping, the automated method of extracting large amounts of data from websites, is a crucial skill in today's data-driven world. Whether you're extracting customer reviews for sentiment analysis or mining e-commerce sites for competitive analysis, web scraping has countless applications. One tool that makes this task much more manageable is Scrapy.

Let's begin our journey toward mastering this fast and powerful web scraping tool.

What is Scrapy?

Scrapy is a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also extract data using APIs or as a general-purpose web crawler.

A standout feature of Scrapy is its speed. Unlike other tools that send a new request after the previous one has been handled, it uses an asynchronous networking library, allowing it to handle multiple requests concurrently. This makes it faster and more efficient, especially when dealing with large-scale scraping tasks.

Installing Scrapy

To start using Scrapy, we need to install it. But before that, make sure you have Python and pip installed. Once you've confirmed that, open your terminal or command prompt and type the following command:

pip install scrapy

If the installation is successful, you can confirm by typing:

scrapy version

You should see the installed version displayed. For us it’s 2.9.0.

Scrapy Architecture

One of Scrapy's strengths lies in its well-thought-out architecture, which comprises several components working together to scrape web pages, making the tool highly customizable and flexible.

Scrapy Engine: This is the main part of the Scrapy architecture. It controls the data flow between all other components and triggers events when certain actions occur.
Scheduler: This component receives requests from the Scrapy engine and queues them for later execution.
Downloader: After a request has been scheduled, it is sent to the downloader, which fetches the page and generates a response.
Spiders: These are custom classes where you define how a site (or a group of sites) should be scraped, including how to perform the crawl and how to parse the data.
Item Pipeline: Once the spiders have scraped the data, the item pipeline processes it. You can define several pipelines to perform various processing tasks like data cleaning or storing the data in a database.

The data flow in Scrapy happens as follows: the engine gets the initial requests from spiders, sends them to the scheduler, and asks for the next request to send to the downloader. Once a page is downloaded, the response is sent to the spider that issued the request to parse it. The parsed items are sent to the item pipeline, and any follow-up requests to the scheduler.

scrapy introduction by forloop — Scrapy architecture

Creating a Scrapy Project and Building a Spider

Now that we have Scrapy installed and understand its architecture, it's time to get our hands dirty. The first step is to create a new Scrapy project. Open your terminal or command prompt and navigate to the directory where you want to store your project. Then type:

scrapy startproject forloop

"forloop" is the name of your project. You can choose any name that suits your preference.

Now let's create our first spider. But first, what is a spider? In Scrapy, a spider is a class that defines how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e., follow links) and how to extract structured data from their pages. Essentially, it's where you define your scraping rules.

To create a spider, navigate to the spiders directory in your project folder:

cd forloop/forloop/spiders

Then, you can create a spider using the genspider command followed by the name of the spider and the domain (without www or https) you wish to scrape:

scrapy genspider myspider forloop.ai/blog

This command generates a spider named myspider that will be used to scrape https://www.forloop.ai/blogs.

Your spider will look something like this:

import scrapy

class MyspiderSpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['forloop.ai/blog']
    start_urls = ['<https://www.forloop.ai/blog>']

    def parse(self, response):
        pass

You can define how the spider should download and extract data in the parse method.

The Scrapy Shell

The Scrapy shell is an interactive shell where you can try and debug your scraping code quickly without running the spider. It's a helpful tool for testing your XPath or CSS expressions to extract data.

To start the Scrapy shell, use the shell command followed by a URL you are interested in:

scrapy shell '<https://www.forloop.ai/blog>'

In the shell, you can try extracting data using the response object:

response.xpath('//title/text()').get()

This command will extract the title of the page. In our case that will be Blog : : Forloop.

In order to exit the shell you just need to write the following command.

quit()

Extracting and Storing Data

When it comes to working with Scrapy, the most crucial tasks involve extracting data from web pages and storing it in a usable format. Here's how you can do it, along with clear steps on where and how you can run your code.

Setting up your Spider

First, you need to create a spider. A Scrapy spider is a Python class that you define, and it tells Scrapy how to traverse a site and extract the data.

Let's create a simple spider for a hypothetical blog site - forloop.ai/blog by creating a python file named blog_spider.py inside the spiders directory, and put the following code into it:

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['<https://www.forloop.ai/blog>']

    def parse(self, response):
        items = response.css('div.article-item')  # Select the div with class "article-item"
        for item in items:
            title = item.css('h4::text').get()  # Extract the title text
            link = item.css('a::attr(href)').get()  # Extract the link
            image = item.css('img::attr(src)').get()  # Extract the image source
            date = item.css('div.blog-post-date::text').get()  # Extract the date

            yield {
                'title': title,
                'link': link,
                'image': image,
                'date': date
            }

This simple BlogSpider will scrape the blog titles from forloop.ai/blog

Extracting Data

In the parse method above, we're using CSS selectors to extract data. Scrapy uses selectors to extract the data that you need from web pages. In this case, div.article-item is a CSS selector that matches the blog titles on the web page.

When the parse method is called, it returns an iterable of Requests and items. This is where the data extraction happens. The line yield {'title': title, 'link': link, 'image': image, 'date': date} extracts the blog title and returns it as an item.

Storing Data

Once the data is extracted, you can store it in a file. By default, Scrapy provides support to export the scraped data in various formats such as JSON, XML, and CSV. To store the scraped data, you need to run the crawl command followed by the spider name and the desired output format. This command needs to be run from the root of your project directory, where the scrapy.cfg file is located.

For example, to store the data in JSON format, open your command line, navigate to your project's root directory, and run:

scrapy crawl blogspider -o result.json

This command will run the spider named 'blogspider', and the -o option specifies the name and format of the file where the scraped data will be stored, in this case, result.json.

And there you have it! You've just scraped your first website using Scrapy and stored the data in a JSON file. Now you can open up result.json in your project directory. The results should look as follows.

scrapy forloop.ai — result.json was automatically generated after the website was scraped.

Practical tips & tricks using Scrapy

1. Parallel Processing

Scrapy handles parallel requests inherently. To control the number of concurrent requests, you can modify the CONCURRENT_REQUESTS setting in the Scrapy configuration. For example, CONCURRENT_REQUESTS = 20 allows for 20 simultaneous requests.

2. Middleware Customization

Create your own middleware to handle custom scenarios. For instance, you can design a middleware to rotate user agents. The middleware might look like this:

import random
class RandomUserAgentMiddleware(object):
    def __init__(self, user_agents):
        self.user_agents = user_agents
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings.getlist('USER_AGENTS'))
    def process_request(self, request, spider):
        request.headers.setdefault('User-Agent', random.choice(self.user_agents))

Don't forget to update USER_AGENTS and DOWNLOADER_MIDDLEWARES settings in your project.

3. Robust Error Handling

Implement checks in your parsing function to handle potential errors. For instance, to handle missing fields in the item:

def parse(self, response):
    item = MyItem()
    item['field'] = response.css('div.field::text').get(default='Missing')
    return item

4. Broad Crawling

Use the CrawlSpider class when you need to follow links and scrape data across an entire site. For example:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_urls = ['<http://www.mysite.com>']

    rules = (
        Rule(LinkExtractor(), callback='parse_item'),
    )

    def parse_item(self, response):
        # parsing code here

5. Link Extractors

Utilize the LinkExtractor to follow links in a page. For instance, to follow all links to product pages you could use:

Rule(LinkExtractor(allow='/product/'), callback='parse_item')

6. Item Loaders

Use item loaders to simplify the extraction process. They provide input and output processors to clean up your data. For example:

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.default_output_processor = TakeFirst()
    l.add_css('name', 'div.product_name')
    l.add_css('price', 'p.price::text', MapCompose(str.strip, str.replace, args=(',', '')))
    l.add_css('description', 'div.description::text', Join())
    return l.load_item()

7. Respecting Robots.txt

To instruct Scrapy to respect the rules in robots.txt, set the ROBOTSTXT_OBEY setting to True in your project's settings.

With these practical tips and examples, you'll be more equipped to handle the challenges you might face in your scraping journey with Scrapy.

Conclusion

Scrapy is a powerful and versatile tool for web scraping, providing robust and efficient methods to extract data from the web. Whether you're a beginner starting your journey into web scraping or an experienced developer seeking to enhance your skills, this guide provided a comprehensive introduction to Scrapy's fundamental concepts and practical tips to enhance your scraping techniques.

We walked through the installation process, discussed the architecture of Scrapy, and took a step-by-step journey through creating a Scrapy project and building a spider. We dove into the power of Scrapy shell for testing, explored the extraction and storage of data, and offered valuable tips for best practices.

In addition, we delved into some unofficial tips and tricks for using Scrapy, demonstrating how it can be leveraged for growth hacking and practical development scenarios, with short examples to illustrate the concepts. These tips are intended to inspire and guide you in extracting the most out of this dynamic framework.

Remember, web scraping should always be done responsibly, respecting the website's policies and user privacy. As you continue to experiment with Scrapy, you'll discover that it's a highly customizable tool that can handle complex scraping tasks, and its potential is limited only by your creativity.

We hope this guide has been a useful introduction to Scrapy. Now it's over to you to put these concepts into practice and start your web scraping journey. If you have any questions or want to share your experiences, join our Slack. Keep exploring, keep learning, and most importantly, happy scraping!