How to extract the id attribute values โ€‹โ€‹of an element from HTML

I'm trying to work out the overhead of auto-naming ASP.NET server elements. I have a page that contains 7000 lines of HTML rendered from hundreds of nested ASP.NET controls, many of which have id / name attributes that are hundreds of characters long.

What I would ideally like is to extract every HTML attribute value that starts with "ctl00" into a list. The regex search feature in Notepad ++ would be ideal if I knew what a regexp should be?

As an example, if HTML:
<input name = "ctl00 $ Title $ Search $ Keywords" type = "text" maxlength = "50" class = "search" />

I would like the output to be something like:
name = "ctl00 $ Header $ Search $ Keywords"
More advanced searches might include an element name (eg control type):
input | name = "ctl00 $ Title $ Search $ Keywords"

To deal with the Id and Name attributes, I'll just restart the search looking for Id instead of name (i.e. I don't need something that will search both at the same time).

The end result will be an Excel report that lists the number of server controls on the page and the length of the name of each, possibly sorted by control type.

0


source to share


4 answers


Answering my own question, the easiest way to do this is to use BeautifulSoup, a "dirty HTML" Python parser whose tagline is:

"You didn't write this awful page. You're just trying to get some data out of it. Right now, you don't care what the HTML should look like, or this parser."



It works and it's available here - http://crummy.com/software/BeautifulSoup

0


source


Fast and Dirty:

Search



\w+\s*=\s*"ctl00[^"]*"

      

This will match any text that looks like an attribute, eg. name="ctl00test"

or attr = "ctl00longer text"

. It won't check if this is actually happening in the HTML tag - which is a little more difficult to do and perhaps not necessary? It will also not check for escaped quotes in the tag name. As usual with regular expressions, the complexity required depends on what exactly you want to match and what your input looks like.

+1


source


"7000"? "A hundred"? Dear God.

Since you're just looking at the source in a text editor, try this ... / (id | name) = "ct [^"] * "/

0


source


I suggest xpath like in this question

-1


source







All Articles