How to Parse HTML With RegEx?

Reading time: 6 min read
Muninder Adavelli
Written by
Muninder Adavelli

Updated · Nov 17, 2023

Muninder Adavelli
Digital Growth Strategist | Joined October 2021 | Twitter LinkedIn
Muninder Adavelli

Muninder Adavelli is a core team member and Digital Growth Strategist at Techjury. With a strong bac... | See full bio

Girlie Defensor
Edited by
Girlie Defensor

Editor

Girlie Defensor
Joined June 2023
Girlie Defensor

Girlie is an accomplished writer with an interest in technology and literature. With years of experi... | See full bio

Techjury is supported by its audience. When you purchase through links on our site, we may earn an affiliate commission. Learn more.

Parsing HTML is a popular way of extracting relevant data from a website. There are many HTML parsing techniques available, each with different levels of complexity. 

One of them is through RegEx. It is a character sequence that defines a search pattern in a given text. 

This article covers steps to parse HTML with RegEx—including sample codes and some best practices that you should know.

Read on.

HTML Parsing With RegEx

Regular Expression (RegEx) is often used to search, extract, or replace specific strings of text in a larger body of data, like HTML. 

However, it is more powerful and highly customizable. Below is an example of a RegEx pattern for selecting tags and their content in HTML:

<(\w+)\s*([^>]*)>(.*?)<\/\1>

Familiarizing yourself with each character sequence’s function will help make sense of a given pattern. 

Here is the breakdown:

<(\w+)\s*

  • It contains the first angle bracket (<) of the opening tag. 
  • (\w+) matches one or more word characters in the tag name. 
  • \s* matches zero or more whitespaces after the tag name. 

([^>]*)>

  • ([^>]*) matches zero or more characters that are not a closing angle bracket (>). 
  • The ^ symbol makes the exception. 
  • * is a “greedy” match set to capture attributes and values. 
  • The last angle bracket (>) closes the opening tag.

(.*?)

  • It is a non-greedy match that captures the content of the tag. 

<\/\1>

  • This matches the closing tag by back-referencing the first capture group (\w+) through the use of \1
  • The symbols <\/ literally capture </ of the closing tag. 
  • The last angle bracket (>) closes the closing tag.

RegEx may vary depending on the programming language. In the next section, you will learn how to create an HTML parser with RegEx using Python scripts.

How To Use RegEx To Parse HTML?

RegEx can be used in different programming languages such as JavaScript, C++, and C. 

Python supports RegEx natively through the re library, which will be used in this guide

🎉 Fun Fact: Do you know that 2.1% of websites use JavaScript as their server-side programming language? Meanwhile, less than 0.1% of sites use C and C++.

Before getting into the actual Python script, try to get used to some sample RegEx patterns.

Basic RegEx Patterns for HTML Parsing

There is no need to memorize everything in the process. You can use the basic RegEx patterns here to parse HTML content. 

Take a look at each one.

To match attribute values, you can use:

<\w+\s+(\w+)=[\'"]([^\'"]*)[\'"]>

For extracting all HTML comments, use:

<!--(.*?)-->

To get all URLs from anchors, you can run:

<a[^>]*href=[\'"]([^\'"]*)[\'"][^>]*>

To acquiring all emails in an HTML file, enter:

\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b

Now, it’s time to move on to the suggested scripts in Python.

Setting Up RegEx HTML Parser Using Python

Say you want to extract all the URLs from anchor tags within a webpage. The first part of the code will consist of the importation of re and urllib.request libraries. 

Both are native to Python, so there’s no need to download anything. You can start with this:

import re

import urllib.request

The next set will decode the request-response into UTF-8. That is:

def get_anchor_urls(url):

    try:

        response = urllib.request.urlopen(url)

        html_content = response.read().decode('utf-8')

Then, the pattern variable for getting the anchor tags’ URLs. The re.findall() function will find all non-overlapping matches in the requested HTML content.

pattern = r'<a[^>]*href=[\'"]([^\'"]*)[\'"][^>]*>'

anchor_tags = re.findall(pattern, html_content)

There will be an issue here. This will also capture the URLs from scripts. 

To resolve this, another pattern must be added to capture URLs from script tags and filter them out.

script_pattern = r'<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>'

anchor_urls = [

            url for url in anchor_tags if not re.search(script_pattern, url)]

Here’s the code for returning the anchor URLs or error messages if the request fails:

        return anchor_urls

    except urllib.error.URLError as e:

        print('Error retrieving webpage:', e)

        return None

To assign the specific URL to send the request, add:

webpage_url = 'https://example.com/'

urls = get_anchor_urls(webpage_url)

Finally, print the results using:

if urls:

    print('Anchor URLs:')

    for url in urls:

        print(url)

Coming all together, here’s how it should look:

import re

import urllib.request

def get_anchor_urls(url):

    try:

        response = urllib.request.urlopen(url)

        html_content = response.read().decode('utf-8'# Decode the response as UTF-8

        # Regex pattern to match anchor tags and capture the URLs

        pattern = r'<a[^>]*href=[\'"]([^\'"]*)[\'"][^>]*>'

        # Exclude URLs from script tags using negative lookahead

        script_pattern = r'<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>'

        # Find all anchor tags in the HTML content

        anchor_tags = re.findall(pattern, html_content)

        # Filter out URLs from script tags

        anchor_urls = [

            url for url in anchor_tags if not re.search(script_pattern, url)]

        return anchor_urls

    except urllib.error.URLError as e:

        print('Error retrieving webpage:', e)

        return None

# Example usage

webpage_url = 'https://example.com/'

urls = get_anchor_urls(webpage_url)

if urls:

    print('Anchor URLs:')

    for url in urls:

        print(url)

After inputting the scripts, you now have a RegEx HTML parser that you can use.

You can also check out data parsing tools that can help automate the process, saving you the time and effort of building your own parser. 

Best Practices for HTML Parsing

The process does not end with simply setting up a parser of your own. There are a few more things that you should remember when parsing HTML content with RegEx. 

These are:

1. Limit the use of RegEx to simple HTML files

Although powerful, RegEx is not completely geared to handle the full complexity of HTML. 

Some of the top web browsers are capable of rendering “broken” HTML scripts, and this can be an issue with RegEx.

A few developers leave their HTML codes “broken” or poorly structured intentionally. This makes it hard for RegEx to match the relevant items.

Nested tags are also very complex to match. Creating specific RegEx patterns to match them can be time-consuming due to the need for a lot of trial and error.

✅Pro-tip: When parsing complex HTML, you can use the BeautifulSoup Python library. Another option is a JavaScript HTML parser using Node.js with Cheerio.

2. Use a RegEx online tester

With RegEx patterns, trial and error do not have to come from repeatedly running the code. There are testers for RegEx patterns online, and they are free.

You can choose between RegExR or RegEx101. Both tools let you test RegEx patterns against text input to correct or debug them. 

They also allow you to choose a specific programming language for compatibility.

3. Respect websites’ TOS and robots.txt files

This is an important rule that you should always keep in mind to avoid any legal issues when scraping. 

A website’s Terms of Service may include specifications regarding the responsible use of data on their platform. Make sure to check each site’s TOS.

On the other hand, robot.txt files include all the data that the website does not allow to be scraped. You can get access information by adding robot.txt to the site’s URL. 

For example:

https://www.example.com/robot.txt

Going against the set rules in robot.txt can really mean legal trouble, so you should be mindful of this.

Conclusion

RegEx is just one of the ways to parse HTML content. While it is a powerful method, it can still fall short when dealing with complex HTML files.

Following the steps above lets you parse any HTML material. However, if you dealing with complex HTML, try other tools like BeautifulSoup or Node.js for HTML parsing.

FAQs.


Can you parse invalid HTML with RegEx?

Yes, but it is not recommended to do so. If you use RegEx to parse invalid HTML, expect some missing information depending on its level of irregularity.

How to use RegEx in HTML tags?

Create a RegEx pattern to match the desired HTML tag. For example, the RegEx pattern <(\w+)[^>]*> will match an HTML opening tag.

Can I use RegEx in Chrome?

Yes. You can use RegEx in the Chrome DevTools when searching through the website’s sources.

SHARE:

Facebook LinkedIn Twitter
Leave your comment

Your email address will not be published.