Article Timeline

7 Steps To Crawl And Scrape Websites In JavaScript

Reading time: 6 min read

Techjury is supported by its audience. When you purchase through links on our site, we may earn an affiliate commission. Learn more.

Web scraping is the process of selectively extracting data from a website or any online source. It is closely related to web crawling, which aims to follow web pages through hyperlinks and use the information for indexing.

In a survey, 65% of developers stated that they use JavaScript. It is arguably the most commonly used programming language in the world since websites rely on JavaScript to provide their functions.

In this article, you will learn how to crawl and scrape websites in JavaScript using Node.js. Read on to learn more.

How To Crawl And Scrape Websites In JavaScript?

Web scraping and crawling will be easy with this beginner-friendly guide. However, before you start scraping, there are a few requirements that you must have.

These requirements are the following:

Node.js – This enables JavaScript to run outside of the Chrome browser console since JavaScript only ran on Chrome back in the day. You can go to their site and download the latest version.
Code editor or IDE – It is the place where you can write your JavaScript codes. It is recommended to use VS Code for simplicity. However, you can use any code editor you are familiar with.

Make sure to test first to see if those two are installed properly. You can do this in your Command Prompt on Windows or Terminal on Mac.

Just type in the command node -v, and you should get this result:

Just type in the command node -v, and you should get this result:

Now you can proceed to create your own JavaScript scraper by following the steps below:

NOTE: To demonstrate the whole process, this guide will scrape our website—TechJury.net.

1. Create A Folder For Your Project

You have to create a folder where you can save your JavaScript files for scraping. To do that, follow these steps:

1. Create a folder anywhere you like. This will be your workspace for your web scraper project. You can name it Webscraper-project.

2. Launch your VS Code.

3. Click on Open Folder, then choose the folder you created

Click on Open Folder, then choose the folder you created.

4. To create your first JavaScript file, click on the symbol for New File. You can name it whatever you want.

Make sure to add the .js extension (example: webscraper.js). This file should open automatically.

5. To make it easier, you can use the built-in Terminal of VS Code. Click on Terminal on the top options, then click on New Terminal.

Click on Terminal on the top options, then click on New Terminal.

Your workspace should look like this:

2. Install The Libraries For Scraping

A library contains a collection of pre-written codes to make programming faster and more efficient. There are many libraries in Node.js for scraping.

Here are some of the most popular ones:

Cheerio – used to extract data by parsing an HTML code into its elements.
Axios – this is used for making HTTP requests, which makes it good for crawling.
Puppeteer – simulates a real Chrome but as a headless browser, meaning no interface. You can use Puppeteer to automate web crawling activities like you are browsing in real Google Chrome.

The following step will use Cheerio and Axios. To install them, type this command on your VS Code terminal:

npm install cheerio axios

After this, you can move on to the actual code.

3. Send An HTTP Request

The first thing that your scraper code has to do is import Cheerio and Axios to access their functionalities.

This will be the beginning of the code:

const axios = require('axios');

const cheerio = require('cheerio');

Let’s say we’re planning to scrape information about the names and descriptions of TechJury authors.

To do this, your program has to send an HTTP request to the target website, as shown below:

async function scrapeData() {

try {

const url = 'https://techjury.net/'; // Replace with the URL of the website you want to scrape

const headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'

};

const response = await axios.get(url, { headers });

const $ = cheerio.load(response.data);

const authors = [];

The URL to send the request is set to https://techjury.net/. An HTTP header with the user-agent is also added to make it look like a valid request from a real browser.

The “const authors” is used to save the arrays of results that you are going to get. The next thing to do now is to pinpoint the relevant data from the HTML content.

4. Select Elements In HTML

Go to TechJury’s home page. Get a grasp of what it looks like. The section where you see the list of authors and their descriptions is what you need to scrape.

To fully analyze the structure of the page, you need to open your browser’s DevTools.

Right-click on a blank space and select the Inspect option. It should open the DevTool under the Elements section. You will see there the HTML structure of the website as rendered by the browser.

This is how it looks at the time of writing:

open the DevTool under the Elements section

Remember that you are trying to get the names of the authors and the descriptions. To do this, you need to find the CSS selectors that contain the information that you need.

CSS selectors are identifiers where the CSS style scripts should be applied (like a specific font for a specific section). This will allow Cheerio to pick those elements and extract their data.

In this screenshot, the names of the authors are given the class “username” and the author descriptions are in the class “description.” They both fall under the “card__author” container.

5. Scrape The Relevant Data

Now that you have identified the necessary selectors, it is time to input the code that will extract those data.

Here is a sample code:

// Use Cheerio selectors to target the elements containing author information

$('.card__author').each((index, element) => {

const authorName = $(element).find('.username').text();

const authorDescription = $(element).find('.description').text();

authors.push({

name: authorName,

description: authorDescription

});

From card_author, choose the selectors username and description to extract the relevant data accordingly.

authors.push helps load the gathered information to the authors array. Now, all you have to do is organize the collected data.

6. Convert The Results To JSON

The next piece of code in your web scraping program will convert the extracted data into JSON format. This will come in handy for data transfers on APIs.

// Convert the scraped data to JSON

const jsonData = JSON.stringify(authors, null, 2);

console.log('Scraped data (JSON):\n', jsonData);

Here’s another piece for catching errors:

catch (error) {

console.error('Error:', error);

}

The final code should help display any errors. You may encounter errors on sites that use anti-scraping measures, like single page application (SPA) web pages such as Facebook or Google.

7. Run The Code

It is now time to run the code. However, make sure to review the overall code before running it.

Here is how it should look like:

const axios = require('axios');

const cheerio = require('cheerio');

async function scrapeData() {

try {

const url = 'https://techjury.net/'; // Replace with the URL of the website you want to scrape

const headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'

};

const response = await axios.get(url, { headers });

const $ = cheerio.load(response.data);

const authors = [];

// Use Cheerio selectors to target the elements containing author information

$('.card__author').each((index, element) => {

const authorName = $(element).find('.username').text();

const authorDescription = $(element).find('.description').text();

authors.push({

name: authorName,

description: authorDescription

});

// Convert the scraped data to JSON

const jsonData = JSON.stringify(authors, null, 2);

console.log('Scraped data (JSON):\n', jsonData);

}

catch (error) {

console.error('Error:', error);

}

scrapeData();

You can run the code in your VS Code’s built-in Terminal. Just type in node webscraper.js. You can use the name of the JavaScript file in case you chose a different one.

The result should look like this:

type in node webscraper.js.

Once done, you can conclude that your first web scraping project in JavaScript is complete.

Conclusion

Sites are constantly changing. Remember that the same method may not work in the future.

The good thing is that the steps above should give you a general idea of how web scraping and crawling work in JavaScript.

FAQs.

How do I scrape an entire website?

To get the entire website, you can use Node.js in JavaScript to download the entire HTML content.

Is web scraping easy?

It is easy and hard at the same time. Paid web scraping services made it easy. It can be hard if the target site has strong anti-scraping measures.

Leave your comment

Your email address will not be published.