How to scrape Booking.com?

Reading time: 7 min read
Artem Minaev
Written by
Artem Minaev

Updated · Nov 17, 2023

Artem Minaev
Management Consultant | Joined May 2023 | Twitter LinkedIn
Artem Minaev

Artem is a management consultant with a strong background in marketing and branding. As a valuable m... | See full bio

Girlie Defensor
Edited by
Girlie Defensor

Editor

Girlie Defensor
Joined June 2023
Girlie Defensor

Girlie is an accomplished writer with an interest in technology and literature. With years of experi... | See full bio

Techjury is supported by its audience. When you purchase through links on our site, we may earn an affiliate commission. Learn more.

In the era of big data, scraping web content has become a significant process in data collection. Many industries, including the hotel sector, can significantly benefit from web scraping, a method that fetches key data from websites. 

This tutorial delves into using Python to scrape data from Booking.com. Such a task is crucial for anyone needing to track competition or monitor the influx of new properties in specific areas. 

Requests vs. Selenium 

This tutorial will use Selenium over traditional HTTP request-based libraries like requests due to Selenium’s ability to handle dynamic content. Booking.com is JavaScript-heavy, meaning the page's content is rendered dynamically. Libraries like requests can fetch the HTML of the page, but they aren’t able to interact with JavaScript-rendered content. 

Note that you’ll be fetching data from all the search results pages, meaning you also need to click the Next button. This is where Selenium comes in. It allows interactions with the web page, making it an ideal tool for this tutorial. 

Depending on many factors, you may need to set up user agents, use more complex headers, and route your requests through a proxy in order to not get blocked. In many cases, using and rotating different user agents may do the trick, but in other situations, proxy servers are a lifesaver. Should you find yourself on a quest for proxies, look no further, as Oxylabs has one of the best-quality proxy servers you can find in the market. 

Set the environment 

Open your terminal, create a new directory, and then create a virtual environment:

Unset 

$ mkdir scraper 

$ cd scraper 

$ python3 -m venv venv 

Now activate this virtual environment and install the dependencies: 

Python 

$ source venv/bin/activate 

$ pip install selenium webdriver-manager 

Selenium library provides the API for accessing the driver. The webdriver-manager library allows you to install and update the Chrome driver from the code, which would otherwise be a manual process. 

Data points to be extracted 

While following this tutorial, you’ll collect the following data points: 

  • Name of the hotel 
  • Rating 
  • Review count 
  • Address of the hotel 
  • Price 

You can easily customize this scraper to collect any other data points you may need. 

Quick Overview of CSS selectors 

Before you delve into the actual scraping, let's discuss CSS selectors, a crucial concept in web scraping. CSS selectors are patterns used to select elements you want to style on a web page. They can select elements based on their ID, 

class, attribute, or relative position in the HTML document. Compared to XPath, CSS selectors are generally preferred due to their readability, simplicity, and speed.

You can build CSS selectors by following the document structure and identifying unique attributes or combinations that point to the element you want to access. For example, div[data-type="hotel-card"] will select all the div elements with an attribute data-type with the value hotel-card

To create the selectors, open the following page in Chrome: 

https://www.booking.com/searchresults.html?ss=New+York&checkin=2 023-12-31&checkout=2024-01-01 

This URL opens the search results for hotels in New York, with dates set as the new year eve of 2024. 

Once the page loads, right-click the hotel name and select Inspect. It opens the developer tools, where you can build and test your CSS selector. 

test your CSS selector

You can press Ctrl + F (Windows) or Cmd + F (macOS)on your keyboard while the Developer Tools are open, which will enable you to search the page with the constructed CSS selector expression. After spending time on this page, you’ll know that the following CSS selectors work best for the required data points:

Element CSS Selector

name div[data-testid="title"]

review information [data-testid="review-score"]

price [data-testid="price-and-discoun ted-price"]

address [data-testid="address"]

The review information contains both ratings in the review count. 

Handling pagination 

For pagination, we can look for the next page button. The following CSS selector matches the next page: 

button[aria-label="Next page"] 

With this in mind, let's move on to writing the code. 

Approach to web scraping 

To effectively scrape Booking.com, you'll have to perform the following steps: 

  1. Initialize the Selenium web driver 
  2. Open the Booking.com search results page 
  3. Determine the total number of result pages 
  4. Extract the desired data from each hotel listing 
  5. Navigate through all the result pages and repeat step 4 

To get a clearer picture, let's start with the skeleton of our code:

Python 

def init_driver(): 

pass 

def get_total_pages(driver): 

pass 

def extract_hotel_info(driver): 

pass 

def extract_hotel_data(hotel): 

pass 

def navigate_pages(driver, total_pages): 

pass 

def main(): 

pass 

if __name__ == "__main__"

main() 

These are the core functions that will make up your script. You'll flesh out each of these functions by following the next tutorial steps. 

Building the functions 

Initializing the driver 

The below function will set up the Selenium web driver for you. The ChromeDriverManager automatically downloads the driver binary required for Selenium to interact with Chrome. 

Python 

from selenium import webdriver

from selenium.webdriver.chrome.service import Service 

from webdriver_manager.chrome import ChromeDriverManager 

def init_driver(): 

driver_service = Service(ChromeDriverManager().install()) 

return webdriver.Chrome(service=driver_service) 

Getting the total number of pages 

This function fetches the total number of pages in the search results. 

This step is important as it determines the number of times the loop should run: 

Python 

from selenium.common.exceptions import NoSuchElementException 

from selenium.webdriver.common.by import By 

def get_total_pages(driver): 

try

total_pages = int

driver.find_element( 

By.CSS_SELECTOR, 'div[data-testid="pagination"] li:last-child

).text 

except NoSuchElementException as e: 

print("Error finding total pages: ", e) 

total_pages = 0 

return total_pages

Extracting the hotel container 

First, you need to extract the container that holds one hotel listing. Once you have this, you can run a loop over each item and extract individual hotel information. 

This function extracts the data of all hotels on the current page. It first targets all hotel cards using a CSS selector and then extracts the relevant data from each hotel. 

Instead of using multiple try-catch blocks, we’ll use the contextlib library in this code, which informs Python runtime that NoSuchElementException should be suppressed. 

Python 

from contextlib import suppress 

def extract_hotel_info(driver): 

data = [] 

all_hotels = driver.find_elements( 

By.CSS_SELECTOR, 'div[data-testid="property-card"]' 

for hotel in all_hotels: 

with suppress(NoSuchElementException): 

result = extract_hotel_data(hotel) 

data.append(result) 

return data 

Extracting individual hotel information 

Once you have an HTML block containing exactly one hotel information, you can use the following function to extract the data you’re looking for.

Note that this is the most critical function, which does the job of scraping data. If you need more data points, this is where you make changes: 

Python 

def extract_hotel_data(hotel): 

result = {} 

result["name"] = hotel.find_element( 

By.CSS_SELECTOR, 'div[data-testid="title"]' 

).text 

review_score, _, review_count = hotel.find_element( 

By.CSS_SELECTOR, '[data-testid="review-score"]' 

).text.split("\n"

result["review_score"] = review_score 

result["review_count"] = review_count 

result["price"] = hotel.find_element( 

By.CSS_SELECTOR, '[data-testid="price-and-discounted-price"]' 

).text 

result["address"] = hotel.find_element( 

By.CSS_SELECTOR, '[data-testid="address"]' 

).text 

return result 

Lastly, you also need to access other search results pages. The following code snippet achieves this: 

Python 

import time 

def navigate_pages(driver, total_pages): 

try

decline_cookies = driver.find_element(By.CSS_SELECTOR, 

'[id="onetrust-reject-all-handler"]'

decline_cookies.click()

except NoSuchElementException: 

print("No cookies to decline."

data = [] 

for _ in range(total_pages)[:2]: # limit to first 2 pages for this example data.extend(extract_hotel_info(driver)) 

try

next_page_btn = driver.find_element( 

By.CSS_SELECTOR, 'button[aria-label="Next page"]' 

next_page_btn.click() 

time.sleep(5) # wait for the next page to load 

except NoSuchElementException as e: 

print("Error finding next page button: ", e) 

break 

return data 

Note that the pages are limited to 2 just for this example. You can delete the [:2] to collect the data from all 13 pages. 

Exporting data 

If you want to export the data to a CSV, you can add a function as follows: 

Python 

def export_data(data): 

csv_file = "property_data.csv

with open(csv_file, "w", newline="", encoding="utf-8") as file: 

writer = csv.DictWriter(file, fieldnames=data[0].keys()) 

writer.writeheader()

writer.writerows(data) 

Executing all the functions 

Finally, the main() function sets up the driver, fetches the total number of pages, scrapes all hotels, and exports the scraped data into a CSV file: 

Python 

def main(): 

url = 

"https://www.booking.com/searchresults.html?ss=New+York&checkin=2023-12-31&chec kout=2024-01-01

driver = init_driver() 

driver.get(url) 

total_pages = get_total_pages(driver) 

print(f"Total pages: {total_pages}") 

data = navigate_pages(driver, total_pages) 

driver.quit() 

print(data) 

if __name__ == "__main__"

main() 

The complete Booking.com scraper script looks like this: 

Python 

import time 

from selenium import webdriver 

from selenium.webdriver.chrome.service import Service 

from webdriver_manager.chrome import ChromeDriverManager 

from selenium.common.exceptions import NoSuchElementException 

from selenium.webdriver.common.by import By 

from contextlib import suppress

def init_driver(): 

driver_service = Service(ChromeDriverManager().install()) 

return webdriver.Chrome(service=driver_service) 

def get_total_pages(driver): 

try

total_pages = int

driver.find_element( 

By.CSS_SELECTOR, 'div[data-testid="pagination"] li:last-child' ).text 

except NoSuchElementException as e: 

print("Error finding total pages: ", e) 

total_pages =

return total_pages 

def extract_hotel_info(driver): 

data = [] 

all_hotels = driver.find_elements( 

By.CSS_SELECTOR, 'div[data-testid="property-card"]' 

for hotel in all_hotels: 

with suppress(NoSuchElementException): 

result = extract_hotel_data(hotel) 

data.append(result) 

return data

def extract_hotel_data(hotel): 

result = {} 

result["name"] = hotel.find_element( 

By.CSS_SELECTOR, 'div[data-testid="title"]' 

).text 

review_score, _, review_count = hotel.find_element( 

By.CSS_SELECTOR, '[data-testid="review-score"]' 

).text.split("\n"

result["review_score"] = review_score 

result["review_count"] = review_count 

result["price"] = hotel.find_element( 

By.CSS_SELECTOR, '[data-testid="price-and-discounted-price"]' ).text 

result["address"] = hotel.find_element( 

By.CSS_SELECTOR, '[data-testid="address"]' 

).text 

return result 

def navigate_pages(driver, total_pages): 

try

decline_cookies = driver.find_element(By.CSS_SELECTOR, 

'[id="onetrust-reject-all-handler"]'

decline_cookies.click() 

except NoSuchElementException: 

print("No cookies to decline."

data = [] 

for _ in range(total_pages)[:2]: # limit to first 2 pages for this example data.extend(extract_hotel_info(driver)) 

try:

next_page_btn = driver.find_element( 

By.CSS_SELECTOR, 'button[aria-label="Next page"]' 

next_page_btn.click() 

time.sleep(5) # wait for the next page to load 

except NoSuchElementException as e: 

print("Error finding next page button: ", e) 

break 

return data 

def export_data(data): 

csv_file = "property_data.csv" 

with open(csv_file, "w", newline="", encoding="utf-8") as file: writer = csv.DictWriter(file, fieldnames=data[0].keys()) 

writer.writeheader() 

writer.writerows(data) 

def main(): 

url = 

"https://www.booking.com/searchresults.html?ss=New+York&checkin=2023-12-31&chec kout=2024-01-01

driver = init_driver() 

driver.get(url) 

total_pages = get_total_pages(driver) 

print(f"Total pages: {total_pages}") 

data = navigate_pages(driver, total_pages) 

driver.quit() 

print(data)

if __name__ == "__main__"

main() 

Conclusion 

And there you have it, a step-by-step guide to scraping Booking.com using Python and Selenium. Following this approach, you can extend the script to scrape other dynamic websites. Remember that it’s essential to approach this endeavour with respect to the terms and conditions of the website you’re scraping. Happy coding, and enjoy the journey!

SHARE:

Facebook LinkedIn Twitter
Leave your comment

Your email address will not be published.