web scra

Web Scraping with Node.js and Puppeteer

41 VIEWS

·

Web Scraping is the technique of extracting information from websites using scripts/code. This technique has a myriad of uses: collection of data (especially when no API has been provided), comparing pricing data across various e-commerce platforms, and so on.

(A quick note: Screen scraping can violate the terms of service of many sites. This post is for educational purposes only. If you are considering building a screen-scraping application, make sure to check the terms of service of the site before running it.)

In this tutorial, we will be using JavaScript (Node.js) and the headless browser module, Puppeteer, to automatically extract episode data and download links from a podcast’s page on Podbean.com.

Step 0: Setup

This tutorial assumes you have a fair knowledge of HTML and the DOM and Javascript (Node.js)

  1. Install Node.js and npm, if you haven’t already.
  2. Install Puppeteer with npm install puppeteer –save

Step 1: Accessing a podcast’s page

Copy and paste the following code into a JS file. We’ll call scrape.js.

Run this program in the command line: node scrape.js
On running, a Chrome browser will open and the podcast page will load.

Step 2: Developing our algorithm using Developer Tools for inspection

Open the podcast’s page in your browser, and then open “Developer Tools.”

You’ll notice that:

  1. All episodes are in a table element with class ‘items’
  2. Each episode has its title in an anchor element with class ‘listen-now’
  3. Each episode has its release date in a span element with class ‘datetime’
  4. Each episode has a link that points to a page where you can download that episode. The link is in an href with class ‘download’

Step 3: Extracting episode information

Now, let’s update the scrapeEpisodeLinks function with the algorithm:
const puppeteer = require(“puppeteer”); // import the puppeteer module

Run this code, and you should see output in your terminal like this:

Step 4: Extracting the download link

You’ll notice that the URL we extracted in Step 3 is not the actual link to the audio file of the podcast, but to the download page of that episode. As such, we would need to visit each episode’s download page and extract the download link.

The audio URL for the episode can be found in an anchor element with class ‘download-btn’

We will create a new function to handle that:

Step 5: Tying it all together

Now that we have created a function to extract the actual download link, we can use it in our main code:

Running this code will give us our final result:

Note: The information presented in this blog post/tutorial is for educational and informational purposes only.

Do you think you can beat this Sweet post?

If so, you may have what it takes to become a Sweetcode contributor... Learn More.

Kevin de Youngster is a CS major at Ashesi University with 2 years experience in coding. He has experience with python, web development (CSS, HTML, JS) and NodeJS. Having a keen interest in how software is used to enhance other industries, he has interned with a number of companies and is currently working at Chalkboard Education. When he isn't immersed in building projects, he spends his time watching nature videos and making illustrations.


Discussion

Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published. Required fields are marked *

Menu