Article Timeline

Wiki Guide: How to Extract Data from Wikipedia Using an API?

Reading time: 8 min read

Techjury is supported by its audience. When you purchase through links on our site, we may earn an affiliate commission. Learn more.

Table of Contents

🔑 Key Takeaways
Wikipedia Data Extraction Using an API

Wikipedia is the world’s largest online encyclopedia, holding over 59.7 million articles in different languages and topics. All pages are free, making knowledge more accessible. This is why scrapers see the site as a “treasure trove of information.”

However, manually extracting data from multiple Wikipedia pages takes a lot of work. Going through lengthy articles can take forever before you can get the necessary information. Fortunately, there is a solution: APIs.

Discover how to extract data from Wikipedia using an API in this article. Dive in!

🔑 Key Takeaways

Wikipedia API streamlines data extraction, offering a time-saving alternative to manual methods.

Python (version 3.6 onwards) and the Requests library are recommended tools for interacting with Wikipedia's API.

Given Wikipedia's high traffic, ethical data extraction is crucial. Users should limit requests, prevent excessive traffic, and credit the source appropriately.

Wikipedia Data Extraction Using an API

Getting data from Wikipedia can be challenging and tedious for scrapers due to the heavy volume of pages on the site. That is why most of them automate the data extraction process to save time.

The good thing is that Wikipedia has its own API to help you with your data extraction projects. It is free and easy to use. The following sections will discuss the prerequisites and steps on how to use Wikipedia API to extract data.

Read on.

📝 Note

API is different from web scraping. While both are data extraction methods, the former collects data directly. Meanwhile, the latter provides a structured way to access specific data. The two methods have distinct advantages based on project needs.

Requirements for Extracting Wikipedia Data

Before you start extracting Wikipedia using an API, make sure you have the following prerequisites:

Python - has libraries suitable for data extraction and is compatible with Wikipedia API. Using Python 3.6 onwards is highly recommended.
PIP - a package manager based on Python. It is responsible for installing the packages on the local system.

📝 Note: Python version 2.7.9 onwards comes with a pre-installed PIP.
Requests library - a Python library responsible for HTTP client-like requests.
Code Editor or IDE - a software application for writing or developing code. You can use any code editor of your choice.

💡Did You Know?

The English Wikipedia is the largest Wikipedia edition, holding around 6.8 million articles. It releases an average of 542 new articles daily. Following it is the Cebuano Wikipedia, which has 6.1 million articles.

Three Ways to Extract Data From Wikipedia Using an API

Here is an illustration of the general coding process for extracting data from Wikipedia using Python and the Wikipedia API:

Coding Steps to Extract Wikipedia with an API

Alt tag: Coding Steps to Extract Wikipedia with an API

There are different ways to extract data from Wikipedia since its API has numerous modules. The code for each method depends on what data you want to extract.

Below are different guides on how to extract data from Wikipedia using an API Python based on three types of data:

Abstract of a Wikipedia Page

You can get the gist of any Wikipedia page by extracting its abstract. An abstract gives you a preview of the topic, its key points, and other relevant ideas. Doing so reduces the tedious work of reading lengthy articles.

Below are the steps to extract the abstract of any Wikipedia article:

Import the Requests library for HTTP queries.

import requests
Define the topic.

subject = ‘Web scraping’
Call the endpoint to access Wikipedia.

url = 'https://en.wikipedia.org/w/api.php'

🗒️ Note

To access Wikipedia pages and other Wikimedia projects, use “Wikipedia api.php.” The “api.php” bit is a request that the Wikipedia API reads and responds to.

1. Set the parameters.

params = {

'action': 'query',

'format': 'json',

'titles': subject,

'prop': 'extracts',

'exintro': True,

'explaintext': True,

}

2. Initiate an HTTP get request to the Wikipedia API using the set parameters.

response = requests.get(url, params=params)

3. Set the response data in JSON format.

data = response.json()

4. Iterate each piece of data on every page.

for page in data['query']['pages'].values():

5. Display the extracted data on the terminal or console. For this sample, limit it to 227 characters.

print(page['extract'][:227])

🗒️ Note

You can display all the text or data on the terminal or console. Use the following code: print(page['extract'])

Final Code

Consolidate all the codes. You should have a final code that will look like this:

import requests

subject = 'Web scraping'

url = 'https://en.wikipedia.org/w/api.php'

params = {

'action': 'query',

'format': 'json',

'titles': subject,

'prop': 'extracts',

'exintro': True,

'explaintext': True,

}

response = requests.get(url, params=params)

data = response.json()

for page in data['query']['pages'].values():

print(page['extract'][:227])

Here is the scraped abstract using the codes above:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.

Number of Pages in Categories

The Wikipedia AP also lets you extract how many pages are in a Wikipedia category. Knowing the number of pages lets you calculate the depth of the available information about a particular topic.

In addition, it helps researchers see how data is spread across various fields. Here's how to use Wikipedia API to get the number of pages in a category:

1. Import the Requests library.

import requests

2. Define the topic.

subject = ‘Web scraping’

3. Call the endpoint to access Wikipedia.

url = 'https://en.wikipedia.org/w/api.php'

4. Set the parameters.

params = {

'action': 'query',

'format': 'json',

'titles': f'Category: {subject}',

'prop': 'categoryinfo'

}

💡 Did You Know?

Formatted string literals (or f-strings) embed Python expressions inside string literals. In Python 2, f-strings do not exist and are only available in Python 3.6.

1. Initiate an HTTP get request to the Wikipedia API.

response = requests.get(url, params=params)

2. Set the response data in JSON format.

data = response.json()

3. Go through all the data on every page.

for page, pages in data['query']['pages'].items():

4. Display the extracted data on the terminal or console. If no data is available, it will return “Invalid.”

try:

print(pages["title"] + " has " + str(pages["categoryinfo"]["pages"]) + " pages.")

except Exception:

print("Invalid")

Final Code

Consolidate all the codes. Your final code should look like this:

import requests

subject = 'Web scraping'

url = "https://en.wikipedia.org/w/api.php"

params = {

'action': 'query',

'format': 'json',

'titles': f'Category: {subject}',

'prop': 'categoryinfo'

}

response = requests.get(url, params=params)

data = response.json()

for page, pages in data['query']['pages'].items():

try:

print(pages["title"] + " has " + str(pages["categoryinfo"]["pages"]) + " pages.")

except Exception:

print("Invalid")

The codes will produce a result like this:

Category: Web scraping has 31 pages.

Besides the abstract and pages in a category, you can also extract the related topics from any Wikipedia article. Knowing the associated concepts will help you better understand your main topic. It will give you a better view of the relationship between your subject and other concepts.

Follow the steps below to extract the related topics from a Wikipedia page:

1. Import Requests library for HTTP queries.

import requests

2. Define the subject.

subject = ‘Web scraping’

3. Call the endpoint to access Wikipedia.

url = 'https://en.wikipedia.org/w/api.php'

4. Set the parameters to get the links for the defined topic.

params = {

'action':'query',

'format':'json',

'list':'search',

'srsearch':subject

}

5. Initiate an HTTP get request and set the response data in JSON format.

response = requests.get(url, params=params)

data = response.json()

6. Iterate each title on every page.

for titles in data['query']['pages']:

7. Display the extracted data on the terminal or console.

try:

print(titles['title'])

except Exception:

print("Invalid")

Final Code

Consolidate all the codes. You should have a final code that will look like this:

import requests

subject = 'Web scraping'

url = 'https://en.wikipedia.org/w/api.php'

params = {

'action':'query',

'format':'json',

'list':'search',

'srsearch':subject

}

response = requests.get(url, params=params)

data = response.json()

for titles in data['query']['search']:

try:

print(titles['title'])

except Exception:

print("Invalid")

Following the codes above will give you this result:

Web scraping

Data scraping

Web crawler

Contact scraping

Beautiful Soup (HTML parser)

Alternative data (finance)

HiQ Labs v. LinkedIn

Scrape

Proxy server

List of web testing tools

Best Practices for Wikipedia Data Extraction

Given Wikipedia's extensive database, users and scrapers flock to the website each day. It suffers from daily network congestion, reaching over 25 billion page views in a month.

Extracting data adds to the traffic, so it is important to maintain ethical data extraction. You can do that by monitoring your extracting activities and implementing the following best practices:

		Limit your requests and be considerate. Scrape data at a reasonable number in a controllable request to avoid being tagged as a possible DDoS attack. Excessive requests can also cause data congestion and take down a site.
		In December 2023, Wikipedia garnered 10.7 billion page views from desktop and 14.6 billion from mobile. Such numbers create heavy traffic. When extracting data, minimize the traffic by requesting multiple items in one request.
		If you have already sent a request, be patient enough to finish the preceding request before sending a new one.
		Minimize high edit rates. Ensure as well that the edits are credible and of high quality. Remember that Wikipedia has millions of active users. An unrestrained number of revisions can cause the servers to lag.
		When extracting data from Wikipedia, always remember to give credit where credit is due. Although the data is free with no other requirement, it is best practice to put a reference to the borrowed content.
		For apps using data from Wikipedia, authenticate the requests using OAuth 2.0 client credentials or authorization code flow. Authentication provides a secure method for logging in to a Wikipedia or Wikimedia account.

Conclusion

Wikipedia is one of the most visited sites on the Internet. It is a massive repository of knowledge on different topics. That’s why it is a popular site for data extraction.

Manual data extraction is tedious and challenging due to the millions of pages on the website. However, the Wikipedia API makes the data extraction process automated and efficient.

Although Wikipedia data is free and accessible, practicing ethical data extraction is still necessary. Avoid sending multiple requests simultaneously, and always set references for the content.