Google Search is one of the world’s top search engines, which is capable of doing almost 7 billion search queries daily.
It contains billions of pages, indexed on a cloud of clustered databases. This data can be valuable for a variety of purposes which we’ll cover in the following sections.
In this tutorial, you’ll learn how to scrape Google Search results. The guide covers how to use Python to parse the search results and extract the necessary data. Read on.
Types Of Data That Can Be Scraped On Google Search
Google Search results page is enriched with lots of information. Many factors – search term, location, device, etc. – determine how it will look on the user’s screen.
What’s more, Google Search has evolved tremendously since the late 90s. If, at first, you could only navigate links, now the engine also provides different types of data such as images, videos, books, etc.
As such, the key elements get from scraping Google Search pages are:
1. Search Results
The search page contains the most relevant search results ranked by relevance. Each result has a link, a title, and a short description, if available. This data can be easily scraped.
2. Related Searches
Apart from the search results, Google Search also shows related search queries. This information can be handy for generating different yet relevant search term ideas.
3. Ranking
Another key element you can get when you scrape Google Search data is ranking. The ranking of the search result also has a key significance.
It helps understand the relevance of the content and is one of the main metrics SEO specialists take into account when implementing their strategies.
4. Featured Snippets
For some specific and frequent search queries, Google Search also provides a short snippet. It’s a way of answering the question in a more efficient and dynamic way with the help of definitions, lists, tables, etc.
5. Filtered Search Results
Search results can also be filtered using types of results: Images, Videos, News, Maps, Books, Finance, etc. This feature can be useful for fine-tuning and narrowing down the information.
Why is This Data Important and Useful?
Scraping Google Search results can be used for a variety of revenue-generating processes. It can be leveraged for such things as market research, competitor analysis, product development, SEO optimization, content creation, ad campaign analysis, etc.
By using the scraped data wisely, businesses can enhance their business operations and gain a competitive edge.
Scraping Google Search Results
Now that we have a bit more information about Google Search and its various features, let’s see how to scrape it.
In this section, you’ll learn how to set up a Google Search scraper using the keyword Python.
You’ll begin with installing the necessary dependencies and setting up the environment. Afterward, you’ll explore more complex elements such as making HTTP requests, parsing, pagination, and more.
1. Setting up the environment
Scraping Google Search Results starts with setting up the environment. Here’s how you can do it:
- Visit the official website to download and install the latest version of Python.
- Run the following command in your terminal to install the necessary dependencies:
pip install requests bs4 |
This command will install the requests and Beautiful Soup modules, which are necessary to proceed. The requests module is needed to send GET requests, whereas Beautiful Soup will come in handy when parsing the data.
2. Making HTTP requests
Start by making use of the requests library to send a GET request to Google Search. First, import the libraries:
import requests from bs4 import BeautifulSoup |
Then, you can use the get() method to make a request:
search_term = “Python” response = requests.get(“https://www.google.com/search?q={}”.format(search _term)) print(response.status_code) |
Keep in mind that you’re storing the result in the response object and printing the status code. If you run the above code, you may see the output 200.
Yet, you’re likely to receive a 429 Too Many Requests status code as Google identifies this request as coming from a robot.
In this case, we need to adjust our code so that Google does not see us as a robot.
3. Adding a User-Agent
You can bypass the above error temporarily by adjusting the User-Agent. A User-Agent tells Google Search which browser and device you are accessing from.
Come back to the part where you’ve identified the search term and add User-Agent to your requests header as follows:
ua = “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36″ headers = { ‘User-Agent’: ua } search_term = “Python” page = “https://www.google.com/search?q={}”.format(search_term) response = requests.get(page, headers=headers) |
As you can see, the get method takes an additional parameter called headers, which can set a custom User-Agent using dict.
4. Adding Proxies
A custom User-Agent might work for small searches, but when you start going big, Google Search will start to block the request again.
That’s why when scraping search results, you’ll have to use proxies and rotate them frequently to bypass the CAPTCHA protection.
Doing all these things manually will be cumbersome, and few companies approach it like that. Instead, you can leverage premium proxy services to do all the hard work for you.
There are numerous solutions out there but one of the most highly recommended ones is Oxylabs Web Unblocker. It’s specifically designed with difficult targets in mind and is super easy to implement.
As they currently offer a free trial, we’ll use it to thoroughly test your scraper.
Once you sign up and get a sub-account credential for the Web Unblocker Proxy, you can use your proxy by making simple changes to the existing code.
username, password = “USERNAME”, “PASSWORD” proxies = { ‘http’: ‘http://{}:{}@unblock.oxylabs.io:60000’.format(username, password), ‘https’: ‘http://{}:{}@unblock.oxylabs.io:60000’.format(username, password) } |
Let’s see what happened in the code above. You’ve created two variables with your sub-account username & password. You also created the proxy dict using those credentials.
After doing that, you can pass the proxies to the get() method as another additional parameter.
response = requests.get(page, headers=headers, proxies=proxies, verify=False) |
And that’s it! Web Unblocker will automatically rotate the proxies for you, so you don’t have to do it manually.
However, you can also manually rotate your proxies by obtaining a list of proxy IPs and configure them to work at random or fixed intervals.
5. Parsing the Search Results
Parsing means transforming the structure of the scraped data to make it readable. Once you make the HTTP request and get a response, you need to find a way to get data into a usable format.
When you scrape with Python, your solution will be the Beautiful Soup library. It’s a Python library often implemented to extract data from HTML.
To start, enter the following command:
soup = BeautifulSoup(response.content, “html.parser“) |
The soup object contains the parsed Google Search result HTML page. Now, use a browser to inspect the HTML page.
You’ll notice all the search results are wrapped in a div with a common class g. So, let’s grab those divs using the find_all method:
results = soup.find_all(“div“, {“class“:“g“}) |
Now, you can simply iterate over the divs and parse all the search results one by one.
data = []for result in results: title = result.find(“h3”).text url = result.find(“a”).attrs[“href”] description = “”elem = result.find(“div”, {“class”:”lEBKkf”}) if elem: span = elem.find(“span”) if span: description = span.text print(title, url, description) data.append({ ‘title’: title, ‘url’: url, ‘description’: description, }) |
Moreover, the search results will be stored inside the list data and for each result you’ll get a dict object containing title, description, and url.
📝 Note: If you want to make the process easier, you can automate parsing by using third-party tools. Check out our list of the 10 most effective data parsing tools that you can choose from. |
6. Handling Pagination
To grab more than ten search results, you’ll have to use pagination. Google Search provides a table of the next pages that you can utilize for this purpose.
You’ll first need to grab the table using the below code:
tds = soup.find(“table”, {“class”: “AaVjTc”}).find_all(“td”) pages = [td.find(‘a’)[‘href’] for td in tds if td.find(‘a’)] |
Now, you can scrape data from all these pages by iterating and parsing each page as below:
for page in pages: response = requests.get(“https://www.google.com{}”.format(page), headers=headers, proxies=proxies, verify=False) soup = BeautifulSoup(response.content, “html.parser”) results = soup.find_all(“div”, {“class”:”g”}) for result in results: title = result.find(“h3”).text url = result.find(“a”).attrs[“href”] elem = result.find(“div”, {“class”:”lEBKkf”}) if elem: span = elem.find(“span”) if span: description = span.text print(title, url, description) data.append({ ‘title’: title, ‘url’: url, ‘description’: description, }) |
Conclusion
Scraping Google Search data can be an amazing opportunity to tap into information about market trends, competitors, customer needs, product insights, etc. It can help you gain valuable insights and get a competitive edge.
By using the techniques described in this article, you can easily scrape Google search data and use it to make great business decisions.