Parsing fixed format data embedded in HTML in python
I am using google appengine api
from google.appengine.api import urlfetch
to get the webpage. Result
result = urlfetch.fetch("http://www.example.com/index.html")
is the html content string (in result.content). The problem is that the data I want to parse is not actually in HTML form, so I don't think using a python HTML parser will work for me. I need to parse all text in the body of an html document. The only problem is urlfetch returns one line of the entire HTML document, stripping all newlines and extra spaces.
EDIT: Ok, I tried selecting a different url and apparently urlfetch does not split newlines, it was the original webpage I was trying to parse, which served the HTML file this way ... END EDIT
If the document looks something like this:
<html><head></head><body>
AAA 123 888 2008-10-30 ABC
BBB 987 332 2009-01-02 JSE
...
A4A 288 AAA
</body></html>
result.content will be this, after urlfetch fetches it:
'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987 2009-01-02 JSE...A4A 288 AAA</body></html>'
Using an HTML parser won't help me with data between body tags, so I was going to use regular expressions to parse my data, but as you can see, the last part of one line is concatenated with the first part of the next line, and I don't know how to split it. I tried
result.content.split('\n')
and
result.content.split('\r')
but as a result the list was only 1 item. I don't see any parameters in the urlfetch google function to not remove newlines.
Any ideas how I can analyze this data? Maybe I need to get it in a different way?
Thanks in advance!
source to share
I understand that the document format is the one you posted. In this case, I agree that a parser like Beautiful Soup might not be the best solution.
I am assuming you are already getting interesting data (between BODY tags) with a regex like
import re
data = re.findall('<body>([^\<]*)</body>', result)[0]
then it should be as simple as:
start = 0
end = 5
while (end<len(data)):
print data[start:end]
start = end+1
end = end+5
print data[start:]
(note: I have not tested this code for edge cases and I expect it to fail. This is only here to show the general idea)
source to share
EDIT: Reading comprehension is a desirable thing. I missed the bit about the lines running together with no separator between them, which would be just the subject of this, right? Therefore, don't believe my answer, it is not relevant.
If you know that each row consists of 5 space-separated columns, then (once you've removed the html) you could do something like (untested):
def generate_lines(datastring):
while datastring:
splitresult = datastring.split(' ', 5)
if len(splitresult) >= 5:
datastring = splitresult[5]
else:
datastring = None
yield splitresult[:5]
for line in generate_lines(data):
process_data_line(line)
Of course, you can change the shared character and number of columns as needed (perhaps even pass them to the generator function as additional parameters) and add error handling as needed.
source to share
Further suggestions for splitting a string s
into 26-character blocks:
As a list:
>>> [s[x:x+26] for x in range(0, len(s), 26)]
['AAA 123 888 2008-10-30 ABC',
'BBB 987 2009-01-02 JSE',
'A4A 288 AAA']
As a generator:
>>> for line in (s[x:x+26] for x in range(0, len(s), 26)): print line
AAA 123 888 2008-10-30 ABC
BBB 987 2009-01-02 JSE
A4A 288 AAA
Replace range()
with xrange()
in Python 2.x if s
very long.
source to share