Python: SGMLParser cannot get line number
I wrote a simple class that inherits SGMLParser. The main idea of this class is to collect all the links from the html page and print the line number where this link can be found.
The class looks like this:
class HtmlParser(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.links = []
def start_a(self, attr):
href = [v for k, v in attr if k == "href"]
self.links.append(href[0])
print(self.getpos())
The problem is that getpos () returns (1,0) on every link. Therefore, if you run the following code:
parser = HtmlParser()
parser.feed('''
<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="UTF-8">
<title></title>
</head>
<body>
<a href="www.foo-bar.com"></a>
<a href="http://foo.bar.com"></a>
<a href="www.google.com"></a>
</body>
</html>''')
parser.close()
print(parser.links)
The output will be:
(1, 0)
(1, 0)
(1, 0)
['www.foo-bar.com', 'http://foo.bar.com', 'www.google.com']
Question: why can't I get the line number for links?
+3
source to share
1 answer
You can't get the line number because sgmllib doesn't work .
Alternatively, you can use HTMLParser like this:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def reset(self):
HTMLParser.reset(self)
self.links = []
def handle_starttag(self, tag, attr):
if tag == 'a':
href = [v for k, v in attr if k == "href"]
self.links.append(href[0])
print(self.getpos())
parser = MyHTMLParser()
parser.feed('''
<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="UTF-8">
<title></title>
</head>
<body>
<a href="www.foo-bar.com"></a>
<a href="http://foo.bar.com"></a>
<a href="www.google.com"></a>
</body>
</html>''')
parser.close()
print(parser.links)
Which outputs what is expected:
(9, 12)
(10, 12)
(11, 12)
['www.foo-bar.com', 'http://foo.bar.com', 'www.google.com']
+1
source to share