HTML parsing using Python

I need to parse a web page and extract some values ​​from it. So I created a python parser like this:

from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print "Data     :", data

f=open("result.html","r")
s=f.read()
parser = MyHTMLParser()
parser.feed(s)

      

The program reads the html file and outputs data from it.

I passed the following .html result, here the parser works fine

<tr class='trmenu1'>
<td>Marks Obtained: </td><td colspan=1>75.67 Out of 100</td>
</tr>
<tr class='trmenu1'>
<td>GATE Score: </td><td colspan=1>911</td>
</tr>
<tr class='trmenu1'>
<td>All India Rank: </td><td colspan=1>34</td>
</tr>

      

After going through the above html, the output is:

Data:

Data: Received characters:
Data: 75.67 Out of 100 data:

Data:

Data:

Data: GATE indicator:
Data: 911
Data:

Data:

Data:

Data: All India Rank:
Data: 34

But the analyzer has to read a larger file, and the above code is a small part of that larger file. The file is too big to insert here. So I downloaded it from the following link: http://www.mediafire.com/?dsgr1gdjvs59c7c When a larger file is transferred, the parser does not read all the records, leaving empty records in the output. Part of the output is shown below:

Data: Syllaby

Data:

Data: GATE indicator

Data:

Data: GATE results

Data:

Observe the blank entry in the line below "Gate score", which was 911 on the previous exit.

The parser works fine with a small file, but not a large file. Why is this happening? I am using Python 2.7

+3


source to share


2 answers


My preferred solution for parsing HTML or XML is lxml

and xpath

.

A quick and dirty use case xpath

:

from lxml import etree
data = open('result.html','r').read()
doc = etree.HTML(data)

for tr in doc.xpath('//table/tr[@class="trmenu1"]'):
  print tr.xpath('./td/text()')

      



Productivity:

['Registration Number: ', ' CS 2047103']
['Name of the Candidate: ', 'PATIL SANTOSH KUMARRAO        ']
['Examination Paper: ', 'CS - Computer Science and Information Technology']
['Marks Obtained: ', '75.67 Out of 100']
['GATE Score: ', '911']
['All India Rank: ', '34']
['No of Candidates Appeared in CS: ', '156780']
['Qualifying Marks for CS: ', '\r\n\t\t\t\t\t']
['General', 'OBC ', '(Non-Creamy)', 'SC / ST / PD ']
['31.54', '28.39', '21.03 ']

      

This code generates ElementTree

from HTML data. Using xpath

, it selects all elements <tr>

where the attribute is class="trmenu1"

. Then, for each, <tr>

it selects and prints the text of any children <td>

.

+7


source


If you take a close look at the html page on mediafire, you will notice that you have two text blocks that contain "GATE Score"

 line 162: <tr><td class='qlink4' background='webimages/blkbuttona3.jpg' onMouseOut="background='webimages/blkbuttona3.jpg'" onMouseOver="background='webimages/blkbuttonb3.jpg'">&nbsp;<a class="dark2" href="gscore.php" title="GATE Score">GATE Score</a></td></tr>

 line 192: <tr class='trmenu1'><td>GATE Score: </td><td colspan=1>911</td></tr>

      



The problem you are having is probably due to a bug in the full html page you are trying to parse, so you can only see one "GATE Score" event.

As suggested in the comments, use BeautifulSoup which is more tolerant of malformed html.

+2


source







All Articles