Parsing fixed format data embedded in HTML in python

Question

Parsing fixed format data embedded in HTML in python

I am using google appengine api

from google.appengine.api import urlfetch

to get the webpage. Result

result = urlfetch.fetch("http://www.example.com/index.html")

is the html content string (in result.content). The problem is that the data I want to parse is not actually in HTML form, so I don't think using a python HTML parser will work for me. I need to parse all text in the body of an html document. The only problem is urlfetch returns one line of the entire HTML document, stripping all newlines and extra spaces.

EDIT: Ok, I tried selecting a different url and apparently urlfetch does not split newlines, it was the original webpage I was trying to parse, which served the HTML file this way ... END EDIT

If the document looks something like this:

<html><head></head><body>
AAA 123 888 2008-10-30 ABC
BBB 987 332 2009-01-02 JSE
...
A4A       288        AAA
</body></html>

result.content will be this, after urlfetch fetches it:

'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA</body></html>'

Using an HTML parser won't help me with data between body tags, so I was going to use regular expressions to parse my data, but as you can see, the last part of one line is concatenated with the first part of the next line, and I don't know how to split it. I tried

result.content.split('\n')

and

result.content.split('\r')

but as a result the list was only 1 item. I don't see any parameters in the urlfetch google function to not remove newlines.

Any ideas how I can analyze this data? Maybe I need to get it in a different way?

Thanks in advance!

0

python html google-app-engine parsing html-content-extraction

BrianH 03 jan. 09 at 20:32

source to share

5 answers

The only suggestion I can think of is to parse it as if it had fixed width columns. Newlines are ignored for HTML.

If you are in control of the original data, put it in a text file, not HTML.

+2

Jimmy2Times 03 jan. 09 at 20:36

source to share

When you have body text as a single long line, you can break it up like this. This assumes that each entry is 26 characters long.

body= "AAA 123 888 2008-10-30 ABCBBB 987     2009-01-02 JSE...A4A     288            AAA"
for i in range(0,len(body),26):
    line= body[i:i+26]
    # parse the line

+1

S.Lott 04 jan. '09 at 12:18

source to share

EDIT: Reading comprehension is a desirable thing. I missed the bit about the lines running together with no separator between them, which would be just the subject of this, right? Therefore, don't believe my answer, it is not relevant.

If you know that each row consists of 5 space-separated columns, then (once you've removed the html) you could do something like (untested):

def generate_lines(datastring):
    while datastring:
        splitresult = datastring.split(' ', 5)
        if len(splitresult) >= 5:
            datastring = splitresult[5]
        else:
            datastring = None
        yield splitresult[:5]

for line in generate_lines(data):
    process_data_line(line)

Of course, you can change the shared character and number of columns as needed (perhaps even pass them to the generator function as additional parameters) and add error handling as needed.

0

Jeff shannon 04 jan. '09 at 1:08

source to share

Further suggestions for splitting a string s

into 26-character blocks:

As a list:

>>> [s[x:x+26] for x in range(0, len(s), 26)]
['AAA 123 888 2008-10-30 ABC',
 'BBB 987     2009-01-02 JSE',
 'A4A     288            AAA']

As a generator:

>>> for line in (s[x:x+26] for x in range(0, len(s), 26)): print line
AAA 123 888 2008-10-30 ABC
BBB 987     2009-01-02 JSE
A4A     288            AAA

Replace range()

with xrange()

in Python 2.x if s

very long.

0

akaihola Jan 27. 09:00

source to share

Roberto liffredo · Accepted Answer · 2009-01-03T21:13:07+0000

I understand that the document format is the one you posted. In this case, I agree that a parser like Beautiful Soup might not be the best solution.

I am assuming you are already getting interesting data (between BODY tags) with a regex like

import re
data = re.findall('<body>([^\<]*)</body>', result)[0]

then it should be as simple as:

start = 0
end = 5
while (end<len(data)):
   print data[start:end]
   start = end+1
   end = end+5
print data[start:]

(note: I have not tested this code for edge cases and I expect it to fail. This is only here to show the general idea)

Parsing fixed format data embedded in HTML in python

More articles: