Blog

Puppeteer Python API introduction: A web scraping tools in practice

Dominik Vach
Dominik Vach
CTO, Co-founder
Tutorial
June 16, 2023
Puppeteer Python API introduction: A web scraping tools in practice

Certain skills add a feather to your cap in the bustling world of web development and data analysis. One such skill is web scraping - the art of extracting useful information from the vast expanse of the internet. Web scraping with the Puppeteer tool, a powerful Node.js library, can be a game-changer in your data extraction tasks. Therefore, this article introduces you to the Puppeteer Python API and gain hands-on experience with real-world web scraping examples. Whether you're a programmer, data analyst, IT student, or web developer with a basic Python knowledge, this guide will equip you with the practical knowledge needed to efficiently use Puppeteer.

This is the 2nd article from the series “A web scraping tools in practice”. In the previous one, we discovered the basics of beautiful soup.

Introduction to Puppeteer and Web Scraping

Web scraping is a technique employed to extract large amounts of data from websites, which can then be saved to a local file or database in your computer for later use. There is many solutions on the market that allows you to web scrape data. One of them is Puppeteer.

Puppeteer is a Node.js library providing a high-level API to control headless Chrome or Chromium browsers. While it's native to JavaScript, it has also been ported to other languages, such as Python, giving Python developers a chance to leverage its powerful features. It can automate a variety of tasks, including generating screenshots and PDFs of pages, testing browser compatibility, automating form submission, UI testing, keyboard input, and much more. It's especially effective in web scraping because it simulates the actions of a real user, allowing you to scrape even the most JavaScript-heavy websites.

Core Features of Puppeteer

Puppeteer has many powerful features, making it one of the most versatile web scraping tools. Here are some of its core features:

  1. Generating Screenshots and PDFs: Puppeteer can generate screenshots and PDFs of web pages.
  2. Crawling SPA (Single-Page Application): Puppeteer can render and crawl Single Page Applications, enabling the scraping of modern, dynamic websites.
  3. Automated form submission: Puppeteer can automate form submissions and also create an automated testing environment.
  4. UI testing: Puppeteer provides a headless browser testing environment for user interface testing, which is critical in ensuring that users do not experience unexpected issues.
  5. Network throttling: Puppeteer can throttle network and CPU, which allows you to understand how your site performs under different conditions.

Now that we've covered the basics, let's delve into some practical examples of web scraping with Puppeteer.

puppeteer website
Puppeteer website, source: pptr.dev

Puppeteer vs Beautiful Soup: Choosing Your Web Scraping Tool

When it comes to web scraping in Python, Beautiful Soup is another popular tool in the arsenal of many developers. But how does it compare to Puppeteer? Let's examine both tools in depth to help you make an informed choice.

Beautiful Soup

Beautiful Soup is a Python library designed for web scraping purposes to extract data from HTML and XML documents. It creates a parse tree from page source code that can be used to extract data in a hierarchical and readable manner.

Beautiful Soup Pros:

  1. Simplicity and Ease of Use: Beautiful Soup's straightforward syntax and methods make it a very beginner-friendly tool.
  2. Great for Static Sites: Beautiful Soup is fantastic for extracting data from static websites, i.e., sites where the HTML markup doesn't change with JavaScript.
  3. Integration with Other Libraries: Beautiful Soup can be used in conjunction with other Python libraries like pandas, enabling powerful data manipulation and analysis possibilities.

Beautiful Soup Cons:

  1. Limited JavaScript Support: Beautiful Soup alone cannot render JavaScript and hence might not be able to scrape data from dynamic websites that heavily rely on JavaScript. It's often used with tools like Selenium or requests-HTML to overcome this.

Puppeteer

As we've discussed, Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium browsers.

Puppeteer Pros:

  1. Support for JavaScript-heavy Sites: Puppeteer shines when it comes to scraping dynamic websites. It can interact with web pages just like a real user and can scrape practically any site, irrespective of how JavaScript-heavy it is.
  2. Automating Browser Tasks: Puppeteer can do more than just web scraping. You can automate several browser-related tasks like form submission, UI testing, generating screenshots, and more.
  3. Headless Browsing: Puppeteer operates in a headless mode, which makes it a lot faster and resource-efficient.

Puppeteer Cons:

  1. Learning Curve: Puppeteer, while powerful, can be slightly more challenging to pick up initially, especially for those unfamiliar with JavaScript and Node.js.
  2. Performance: Puppeteer can be a little resource-intensive, particularly when handling numerous pages concurrently.

Making the Choice

The choice between Puppeteer and Beautiful Soup primarily depends on your project needs and familiarity with the languages involved.

Beautiful Soup is a straightforward, effective choice if you're dealing with static websites and prefer working in Python. However, if the website is JavaScript-heavy and dynamically generates content, Puppeteer might be a better tool for the job due to its ability to interact with JavaScript and mimic a real user's actions.

In conclusion, there's no "one size fits all" tool. Both Puppeteer and Beautiful Soup have their strengths and are excellent in their own right. Assess your needs, experiment with both, and pick the one that works best for your project.

Practical Example of Web Scraping with Puppeteer

Here's a step-by-step guide on how to use Puppeteer Python API (Pyppeteer) to extract article titles, links, and dates from the blog posts on https://www.forloop.ai/blog.

Step 1: Install Puppeteer Python API (Pyppeteer)

First, make sure you have Python and pip (Python package manager) installed in your environment. If not, you can download Python here.

Next, install Pyppeteer using pip by running the following command in your terminal:

pip install pyppeteer

Step 2: Import Required Libraries

Create a new Python script and import the necessary libraries:

import asyncio
from pyppeteer import launch

asyncio is a library to write single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources. We will use it to manage our asynchronous tasks.

Step 3: Create an Asynchronous Function to Scrape Data

Define an asynchronous function called scrape_data() that will take care of launching the browser, opening a new page, and navigating to the blog.

async def scrape_data():
browser = await launch()
page = await browser.newPage()
await page.goto('<https://www.forloop.ai/blog>')

Step 4: Scrape the Blog Titles, Links, and Dates

Now, add the code to scrape the blog titles, links, and dates. The page.evaluate() method is used to execute JavaScript code in the context of the page.

titles = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.blog-card .title')).map(elem => elem.innerText);
}''')

links = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.blog-card a')).map(elem => elem.href);
}''')

dates = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.blog-card .date')).map(elem => elem.innerText);
}''')

This script extracts the text inside the blog title, link, and date elements.

Step 5: Close the Browser and Return the Data

Close the browser once the data has been extracted, and return the scraped data:

await browser.close()
return titles, links, dates

Step 6: Run the Asynchronous Function

Finally, call the scrape_data() function using the asyncio event loop:

data = asyncio.get_event_loop().run_until_complete(scrape_data())
print(data)

This will print the scraped blog titles, links, and dates.

To run the script, execute the Python file in your terminal.

Tips & Tricks for Using Puppeteer in Production

While Puppeteer is powerful and flexible, there are some best practices you should follow to ensure smooth operation, particularly in production environments:

  1. Manage memory wisely: Puppeteer can be memory-intensive, especially when dealing with numerous browser instances. Consider limiting the number of concurrent sessions and ensure that every launched browser is properly closed once it's done.
  2. Handle navigation correctly: Use waitUntil option in page.goto() to ensure that the page is fully loaded before proceeding. You can set waitUntil to 'networkidle0' or 'networkidle2' to wait until there are no more than 0 or 2 network connections for at least 500 ms.
  3. Error Handling: Always put a try/catch block around your Puppeteer code to handle errors gracefully.
  4. Use headless mode in production: Running Puppeteer in headless mode (without a visible UI) is more resource-efficient and is generally recommended in a production environment.
  5. Take advantage of browser contexts: Puppeteer's 'browser contexts' are isolated environments within a browser instance. You can use different contexts to simulate multiple tabs or users, without the overhead of launching multiple browsers.

Conclusion

Puppeteer is a powerful tool that allows you to explore the vast world of web scraping in Python. It offers many possibilities, from simple data extraction to more complex tasks like automated testing and form submission.

As you’ve seen in our hands-on example, extracting data from a webpage is quite straightforward with Puppeteer. It's all about understanding the webpage's structure and using Puppeteer’s API to navigate and extract the necessary data.

It's important to remember that while web scraping is a powerful tool, it's also your responsibility to use it ethically and respect the terms of service of the websites you are scraping.

By now, you should understand how to use Puppeteer for your web scraping tasks. It’s time to apply what you’ve learned. Start your journey into web scraping by creating your project, experimenting with different websites, and see what amazing data you can uncover.

For more detailed information, refer to the official Puppeteer documentation. It's an excellent resource, with comprehensive explanations of Puppeteer's API and more complex examples.

Keep exploring, keep learning, and most importantly, keep scraping!