Article Timeline

How to Crawl a Site Without Getting Blocked?

Reading time: 8 min read

Techjury is supported by its audience. When you purchase through links on our site, we may earn an affiliate commission. Learn more.

Table of Contents

What Is Web Crawling?
- Why You May Get Blocked While Crawling a Site
- Top 5 Techniques to Avoid Blocks
Conclusion

When crawling a site, you are likely to get blocked. The reason is that the more pages your automated software visits, the more it exposes itself to the anti-bot systems adopted by the site. This increases the chances that they will identify and block its requests.

The question is, is there a way to avoid those blocks? Yes, of course, there is! In this article, you will see the most effective techniques you can implement in your web crawler to avoid getting blocked.

But first, let's look at what web crawling is and why you need to worry about anti-bot solutions!

What Is Web Crawling?

Web crawling refers to the process of programmatically exploring the Internet to discover new pages. Search engines like Google use it to index all public pages available on the Web. Another popular use case of web crawling is web scraping. In this case, crawling is applied to a specific site to discover all pages of interest. For example, to find the URLs of all products of a particular category in an e-commerce platform. Learn more about the differences between web crawling vs. web scraping.

Usually, a web crawler is an automated script that requires only one or more URLs as input. To discover new pages, it keeps following new links until it finds all the pages of interest. As sites evolve over time, it is critical to perform web crawling frequently.

Why You May Get Blocked While Crawling a Site

“The world’s most valuable resource is no longer oil, but data” states a 2017 article from The Economist. This should not surprise you, as some of the most valuable companies in the world are tech giants whose core business is data. Now, everyone knows how valuable data is.

Companies want to protect their data at all costs, even if it is publicly accessible on their site. Data is money, and you cannot give it away for free. That is why more and more websites are adopting anti-bot technologies. Their goal is to prevent bots from flooding a site with requests and perform malicious actions such as stealing data.

Note that a scraping script, whether working on a single page or visiting multiple pages, is automated software. In other words, it is a bot. Therefore, anti-bot measures can detect and block its requests, preventing it from accessing the site.

When targeting a single page, you can use specific workarounds to bypass the anti-scraping systems in place. However, when targeting numerous pages as with web crawling, you need a different approach. The site can monitor how your script behaves on the site, giving it more chances of marking it as malicious. Plus, different pages can have different anti-bot measures, which makes everything even harder.

Here is why you need to implement various general techniques to avoid getting blocked when crawling a site.

Top 5 Techniques to Avoid Blocks

Let's take a look at some of the most effective techniques you can adopt in your crawling logic to avoid getting blocked.

1. Use Real-World Headers

Anti-bot systems, both simple and advanced, focus on incoming requests. Specifically, they analyze the HTTP headers to determine whether the request is legitimate or not. How is this possible? Well, take a look at the example below.

This is what the User-Agent header automatically set by the current version of Chrome on Windows:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36

The above string contains information about the OS version, device type, and browser the request originates from.

Instead, this is the User-Agent set by default by requests, one of the most popular HTTP client libraries:

python-requests/2.31.0

As you can see, it is not difficult to tell which of the two requests comes from an automated script.

User-Agent is typically the main header anti-bot solutions monitor, but they may also be looking at other ones. The easiest way to avoid getting blocked because of that is to set real-world headers used by browsers. That way, your automated requests will appear as coming from a browser.

Usually, HTTP client libraries allow you to use custom headers. Visit a site like HTTPBin to find out what headers your browser sets by default. Then, set them in your HTTP client.

2. Randomize Your Requests

When a request fails because of an anti-bot measure, you cannot expect to repeat it as is and get a different result. That is why it is so important to randomize requests in your code, especially after a failure. First, take into account that some requests will fail. So, implement retry logic with random timeouts. Second, you have to make sure that the new request will have different headers than the previous one.

To make crawling requests more difficult to track, you should apply randomization logic to each request. For example, you could randomly pick the header values from a list of real-world values. Each automated request executed by your script will now appear as coming from a different device.

Keep in mind that randomizing requests is only the first step to avoid getting blocked because of fingerprinting. More complex anti-bot technologies not only focus on headers but also monitor the IP of requests. That is not something you can change in the code, as it depends on the device the crawling script is executed on. The solution? A web proxy!

3. Use Premium Proxies

A web proxy acts as an intermediary between your application and the target site. When routing a request through a proxy server, the following happens:

The request made by your application is intercepted by the proxy server
The proxy server forwards the request to the destination server
The destination server responds with the desired data to the proxy server
The proxy server forwards the response back to your application

In other terms, the target site will see your requests as coming from the proxy server. In particular, the tracking system will see the IP and location of the proxy server, not yours. This is a great mechanism to protect privacy and ensure anonymity.

When it comes to web crawling, proxies are essential to visit several pages without exposing your IP. If you have access to a pool of proxies, you can distribute your requests over them to visit different pages in parallel. That will lead to improved performance.

There are many proxy providers online, but not all of them are reliable. As a rule of thumb, stay away from free proxies because they are short-lived and raise data harvesting concerns. Trying all the premium proxy providers will take months and cost you a lot of money. Forget about that and go directly for the best solution, Bright Data!

Bright Data’s proxies are available in over 195 countries, offer IP rotation, and guarantee 99.9% uptime and success rate. Overall, that is one of the largest, fastest, and most effective proxy infrastructures on the market. Learn more about Bright Data's proxy services.

4. Avoid Honeypots

In the world of bot prevention, a honeypot is a trap intentionally left on a site to spot automated behavior. For example, it may be one or more invisible links. Human users visiting the site in a browser will not be able to see them. Thus, they will never click those links. However, a crawling script that parses the HTML content of web pages will treat them like any other link. When the crawler follows those links, it gets recognized as a bot and blocked.

In some cases, the primary goal of honeypot traps is not to block bots. Developers might create fake websites or sections of an existing site, make them attractive to scrapers, and build advanced tracking systems. As a result, they can collect data about bots to study their behavior and train anti-bot solutions.

As a general rule to avoid getting blocked, avoid invisible or suspicious links or sites.

5. Use a Headless Browser

Crawling scripts usually rely on an HTTP client to retrieve the HTML content of a page, which is then fed to an HTML parser. Next, you can use the API offered by the parsing library to get the data of interest from the DOM tree.

That process is quite different from what happens when a human user visits a webpage. In this case, the browser performs the HTTP request to the specified URL and then renders the HTML content returned by the server.

So, parsers do not render HTML documents, while browsers do. Actually, only browsers can render HTML pages and execute their JavaScript. Anti-bot solutions exploit that to introduce JavaScript challenges that only browsers can overcome.

The solution? A headless browser! If you are not familiar with this technology, it is nothing more than a controllable browser that comes with no UI. Popular web browsers like Chrome and Firefox support headless mode. Libraries like Selenium or Playwright allow you to instruct a headless browser to perform specific actions on a page via code, simulating user interaction.

The other benefit of using a headless browser is that you can forget about setting real-world headers. Also, you will not be stopped by sites that use JavaScript for data retrieval or DOM manipulation.

Conclusion

In this article, you learned what web crawling is and why companies want to prevent it. You now know the best techniques to implement to avoid getting blocked.

Keep in mind that no matter how sophisticated your crawling logic is, a complete anti-bot solution like a WAF can block you. The best way to get around it is Bright Data's Web Unlocker, an all-in-one solution with capabilities to resolve CAPTCHAs, IP rotations, and JavaScript rendering.

Thanks for reading! We hope that found this article useful!

Leave your comment

Your email address will not be published.