Web scraping is selectively extracting data from a website or any online source. It is closely related to web crawling, which aims to follow web pages through hyperlinks and use the information for indexing.
In a survey, 65% of developers stated that they use JavaScript. It is arguably the most commonly used programming language worldwide since websites rely on JavaScript to provide their functions.
In this article, you will learn how to crawl and scrape websites in JavaScript using Node.js. Read on to learn more.
🔑 Key Takeaways
- NNode.js enables JavaScript execution outside the Chrome browser console, previously its only platform.
- The request URL is https://techjury.net/, accompanied by an HTTP header mimicking a genuine browser request.
- Run the code in VS Code’s Terminal by typing “node webscraper.js.” Use the JavaScript file name you’ve chosen.
- Remember, websites evolve, and methods may lose efficacy over time.
How To Crawl And Scrape Websites In JavaScript?
Web scraping and crawling will be easy with this beginner-friendly guide. However, before you start scraping, there are a few requirements that you must have.
✅ Pro Tip Web scraping fetches data and grants control but can be tricky at scale. Enter Bright Data’s Scraping Browser – an advanced tool using an automated browser to surpass traditional limits. |
These requirements are the following:
- Node.js – This enables JavaScript to run outside of the Chrome browser console since JavaScript only ran on Chrome back in the day. You can go to their site and download the latest version.
- Code editor or IDE – It is where you can write your JavaScript codes. It is recommended to use VS Code for simplicity. However, you can use any code editor you are familiar with.
Make sure to test first to see if those two are installed properly. You can do this in your Command Prompt on Windows or Terminal on Mac.
Just type in the command node -v, and you should get this result:
Now you can create your own JavaScript scraper by following the steps below:
🗒️ Note To demonstrate the whole process, this guide will scrape our website |
1. Create A Folder For Your Project
💡 Did You Know? JavaScript has a strong 89.77% market share, making it the frontrunner in Document Standards, particularly in the US. |
You must create a folder to save your JavaScript files for scraping. To do that, follow these steps:
1. Create a folder anywhere. This step will be your workspace for your web scraper project. You can name it Webscraper-project.
2. Launch your VS Code.
3. Click on Open Folder, then choose the folder you created
4. To create your first JavaScript file, click on the symbol for New File. You can name it whatever you want.
Make sure to add the .js extension (example: webscraper.js). This file should open automatically.
5. To make it easier, use the built-in Terminal of VS Code. Click on Terminal on the top options, then click on New Terminal.
Your workspace should look like this:
2. Install The Libraries For Scraping
A library contains a collection of pre-written codes to make programming faster and more efficient. There are many libraries in Node.js for scraping.
🎉 Fun Fact Node.js hosts CORS Anywhere, a popular open-source module. It redirects requests to a local proxy that fetches data from the target origin, delivering it to the browser with the correct headers. This free tool is popular among developers and site owners but requires continuous Node.js operation. |
Here are some of the most popular ones:
- Cheerio – used to extract data by parsing an HTML code into its elements.
- Axios – this is used for making HTTP requests, which makes it good for crawling.
- Puppeteer – simulates a real Chrome but as a headless browser, meaning no interface. You can use Puppeteer to automate web crawling activities like browsing in real Google Chrome.
The following step will use Cheerio and Axios. To install them, type this command on your VS Code terminal:
npm install cheerio axios |
After this, you can move on to the actual code.
3. Send An HTTP Request
The first thing that your scraper code has to do is import Cheerio and Axios to access their functionalities. This will be the beginning of the code:
const axios = require(‘axios’); const cheerio = require(‘cheerio’); |
Let’s say we’re planning to scrape information about the names and descriptions of TechJury authors.
To do this, your program has to send an HTTP request to the target website, as shown below:
async function scrapeData() { try { const url = ‘https://techjury.net/‘; // Replace with the URL of the website you want to scrape const headers = { ‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36’ }; const response = await axios.get(url, { headers }); const $ = cheerio.load(response.data); const authors = []; |
The URL to send the request is set to https://techjury.net/. An HTTP header with the user-agent is also added to make it look like a valid request from a real browser.
The “const authors” save the arrays of results you will get. The next thing to do is pinpoint the relevant data from the HTML content.
4. Select Elements In HTML
Go to TechJury’s home page. Get a grasp of what it looks like. The section where you see the list of authors and their descriptions is what you need to scrape.
To fully analyze the page’s structure, you need to open your browser’s DevTools.
Right-click on a blank space and select the Inspect option. It should open the DevTool under the Elements section. You will see the HTML structure of the website as rendered by the browser.
This is how it looks at the time of writing:
Remember that you are trying to get the authors’ names and descriptions. You need to find the CSS selectors containing the necessary information to do this.
CSS selectors are identifiers where the CSS style scripts should be applied (like a specific font for a specific section). This will allow Cheerio to pick those elements and extract their data.
In this screenshot, the author’s names are given the class “username,” and the author descriptions are in the class “description.” They both fall under the “card__author” container.
5. Scrape The Relevant Data
Now that you have identified the necessary selectors, it is time to input the code to extract those data.
Here is a sample code:
// Use Cheerio selectors to target the elements containing author information $(‘.card__author‘).each((index, element) => { const authorName = $(element).find(‘.username‘).text(); const authorDescription = $(element).find(‘.description‘).text(); authors.push({ name: authorName, description: authorDescription }); }); |
From card_author, choose the selector’s username and description to extract the relevant data accordingly.
authors.push helps load the gathered information to the authors array. Now, all you have to do is organize the collected data.
6. Convert The Results To JSON
The next piece of code in your web scraping program will convert the extracted data into JSON format. This will come in handy for data transfers on APIs.
💡 Did You Know? JSON is a lightweight, human, and machine-readable format that seamlessly interfaces with various programming languages, including Python. |
// Convert the scraped data to JSON const jsonData = JSON.stringify(authors, null, 2); console.log(‘Scraped data (JSON):\n’, jsonData); |
Here’s another piece for catching errors:
catch (error) { console.error(‘Error:’, error); } |
The final code should help display any errors. You may encounter errors on sites that use anti-scraping measures, like single page application (SPA) web pages like Facebook or Google.
7. Run The Code
It is now time to run the code. However, make sure to review the overall code before running it.
Here is what it should look like:
const axios = require(‘axios’); const cheerio = require(‘cheerio’); async function scrapeData() { try { const url = ‘https://techjury.net/’; // Replace with the URL of the website you want to scrape const headers = { ‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36’ }; const response = await axios.get(url, { headers }); const $ = cheerio.load(response.data); const authors = []; // Use Cheerio selectors to target the elements containing author information $(‘.card__author’).each((index, element) => { const authorName = $(element).find(‘.username’).text(); const authorDescription = $(element).find(‘.description’).text(); authors.push({ name: authorName, description: authorDescription }); }); // Convert the scraped data to JSON const jsonData = JSON.stringify(authors, null, 2); console.log(‘Scraped data (JSON):\n’, jsonData); } catch (error) { console.error(‘Error:’, error); } } scrapeData(); |
You can run the code in your VS Code’s built-in Terminal. Just type in node webscraper.js. You can use the name of the JavaScript file in case you choose a different one.
The result should look like this:
Once done, you can conclude that your first web scraping project in JavaScript is complete.
Conclusion
Sites are constantly changing. Remember that the same method may not work in the future.
The steps above should give you an idea of how web scraping and crawling work in JavaScript.
🗒️ Related Articles Check out these articles to discover how to gather useful website data for research, analysis, or automation. They cover web scraping techniques, tools, and ethical considerations. |
FAQs
How do I scrape an entire website?
To get the entire website, you can use Node.js in JavaScript to download the entire HTML content.
Is web scraping easy?
It is easy and hard at the same time. Paid web scraping services made it easy. It can be hard if the target site has strong anti-scraping measures.
Timeline Of The Article
Darko founded WhatToBecome.com, a comprehensive career guidance platform for beginners in various popular fields. With a focus on remote working scenarios, workplace technology, emerging trends, and common challenges, Darko shares his valuable experiences and insights with our readers here on Techjury. Through his informative articles, Darko equips readers with the necessary knowledge and wisdom to thrive in their professional endeavors.