🇧🇷 Leia em Português

Counting HTML tags with HTMLParser

a blackboard with 3 chalk lines

I fell into a case where I wanted to count the tags that were present in an HTML file and I didn’t want to download any library (like BeautifulSoup) to do so. I searched online and realized I could use the HTMLParser to do that.

The problem was that I found this library to be very unintuitive and it took me forever to understand how to do that. I will explain the solution step by step but you can skip to the end to see the final result 👾

My problem with HTMLParser

The problem I had was that my first intuition was to do this:

from html.parser import HTMLParser

parser = HTMLParser()
parser.feed(html) # html is a string

…and nothing happened. I looked around and there wasn’t any method that could help me do the count. I searched online and all the tutorials and answers was telling me to create a new class, but I didn’t understand why.

After some time questioning my sanity, I realized that I was expecting HTMLParser to be just like BeautifulSoup which translates the HTML into a structure I can search on. However, HTMLParser doesn’t do that. It’s actually iterating over the HTML tags but doesn’t do anything with it. The reason why you need to implement a class to inherit from the HTMLParser is to actually implement the methods!

The reason I was not able to do anything with the parser once I fed it the HTML is because the HTML is parsed once I fed it, but there isn’t anything to do with it after it was parsed. Because I hand’t implemented anything…. I couldn’t see anything!

What I needed to do is actually implement a class and everytime I found a new tag, I would increase a counter… something like this:

count_h1 = 0 

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
    	if tag == 'h1':
        	count_h1 += 1

Solution

Finally I decided to use a defaultdict so I could count every tag once it appeared. The final solution was this:

from html.parser import HTMLParser
from collections import defaultdict

class MyHTMLParser(HTMLParser):
    def __init__(self):
        self.count = defaultdict(int)
        super().__init__()

    def handle_starttag(self, tag, attrs):
        self.count[tag] += 1

    def handle_startendtag(self, tag, attrs):
        self.count[tag] += 1

def count_tags(html):
  parser = MyHTMLParser()
  parser.feed(html)
  return parser.count

The handle_starttag investigates tags that have an opening and a closing tag (like <h1></h1>) while the handle_startendtag is used in tags that don’t have a closing argument (like <link />).

Result

If I take this html:

html = """
<html>
  <head>
    <link rel="stylesheet" type="text/css" href="style.css"/>
  </head>
  <body>
        <nav class='navbar navbar-dark bg-dark'>
            <div class='ms-auto'>
                <a href="/smart/notes" class="btn btn-outline-light me-1">Home</a>
                <a href="/smart/notes/new" class="btn btn-outline-light me-1">Create</a>
                <a href="/logout" class="btn btn-outline-light me-1">Logout</a>
            </div>
        </nav>
        <div class="my-5 text-center container">
            <h1 class="my-5">These are the notes:</h1>
                <div class="row row-cols3 g-2">
                    <div class="col">
                        <div class="p-3 border">
                            <a href="/smart/notes/1" class="text-dark text-decoration-non"><h3>An unique note title</h3></a>
                                some text
                        </div>
                    </div>
                    <div class="col">
                        <div class="p-3 border">
                            <a href="/smart/notes/2" class="text-dark text-decoration-non"><h3>Anoter note</h3></a>
                                another text
                        </div>
                    </div>
                </div>
        </div>
    </body>
</html>
"""

And pass it to the function we just created, the result will be:

tags = count_tags(html)
print(tags) # defaultdict(<class 'int'>, {'html': 1, 'head': 1, 'link': 1, 'body': 1, 'nav': 1, 'div': 7, 'a': 5, 'h1': 1, 'h3': 2})

And I can access any HTML tag to see the count:

print(tags['html']) # 1

And because we used a defaultdict, we can actually try to get an HTML tag that isn’t there, and it won’t fail:

print(tags['h6']) # 0

Photo by Miguel Á. Padriñán


Cheers!
Letícia

Comments