Is there a more Pythonic way to combine two HTML header lines with colspans?
I am using BeautifulSoup in Python to parse some HTML. One of the problems I am dealing with is that I have situations where colspans differ across the header lines. (Header lines are the rows that need to be concatenated to get the column headers in my lingo). It is a single column that can span multiple columns above or below it, and words have to be added or added based on spanning. Below is the routine for this. I am using BeautifulSoup to pull the caps and pull the contents of every cell in every row. longHeader is the content of the header line with the most elements, spanLong is the list with colspans of each element in the line. It works, but it doesn't look very Pythonic.
Alos - it won't work if diff is <0, I can fix this with the same approach I used to get this to work. But before I do that, I am wondering if someone can take a quick look at this and suggest a more pythonic approach. I've been a SAS programmer for a long time and so I struggle to break the mold. I will write the code as if I were writing a SAS macro.
longHeader=['','','bananas','','','','','','','','','','trains','','planes','','','','']
shortHeader=['','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
combinedHeader=[]
sumSpanLong=0
sumSpanShort=0
spanDiff=0
longHeaderCount=0
for each in range(len(shortHeader)):
sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
sumSpanShort=sumSpanShort+spanShort[each]
spanDiff=sumSpanShort-sumSpanLong
if spanDiff==0:
combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
longHeaderCount=longHeaderCount+1
continue
for i in range(0,spanDiff):
combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
longHeaderCount=longHeaderCount+1
sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
spanDiff=sumSpanShort-sumSpanLong
if spanDiff==0:
combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
longHeaderCount=longHeaderCount+1
break
print combinedHeader
source to share
In this example, you are really doing a lot.
-
You have redesigned the Beautiful Soup Tag objects for creating lists. Leave them as tags.
-
All of these merge algorithms are complex. It helps to consider two things that merge symmetrically.
Here's a version that should work directly with Beautiful Soup Tag objects. In addition, this version does not suggest anything about the length of two lines.
def merge3( row1, row2 ):
i1= 0
i2= 0
result= []
while i1 != len(row1) or i2 != len(row2):
if i1 == len(row1):
result.append( ' '.join(row1[i1].contents) )
i2 += 1
elif i2 == len(row2):
result.append( ' '.join(row2[i2].contents) )
i1 += 1
else:
if row1[i1]['colspan'] < row2[i2]['colspan']:
# Fill extra cols from row1
c1= row1[i1]['colspan']
while c1 != row2[i2]['colspan']:
result.append( ' '.join(row2[i2].contents) )
c1 += 1
elif row1[i1]['colspan'] > row2[i2]['colspan']:
# Fill extra cols from row2
c2= row2[i2]['colspan']
while row1[i1]['colspan'] != c2:
result.append( ' '.join(row1[i1].contents) )
c2 += 1
else:
assert row1[i1]['colspan'] == row2[i2]['colspan']
pass
txt1= ' '.join(row1[i1].contents)
txt2= ' '.join(row2[i2].contents)
result.append( txt1 + " " + txt2 )
i1 += 1
i2 += 1
return result
source to share
Here is a modified version of your algorithm. zip is used to iterate over the length and headers of the short ones , and the class object is used to count and iterate over the elements of the long , and to concatenate the headers. but more suitable for the inner loop. (omit names that are too short).
class collector(object):
def __init__(self, header):
self.longHeader = header
self.combinedHeader = []
self.longHeaderCount = 0
def combine(self, shortValue):
self.combinedHeader.append(
[self.longHeader[self.longHeaderCount]+' '+shortValue] )
self.longHeaderCount += 1
return self.longHeaderCount
def main():
longHeader = [
'','','bananas','','','','','','','','','','trains','','planes','','','','']
shortHeader = [
'','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
sumSpanLong=0
sumSpanShort=0
combiner = collector(longHeader)
for sLen,sHead in zip(spanShort,shortHeader):
sumSpanLong += spanLong[combiner.longHeaderCount]
sumSpanShort += sLen
while sumSpanShort - sumSpanLong > 0:
combiner.combine(sHead)
sumSpanLong += spanLong[combiner.longHeaderCount]
combiner.combine(sHead)
return combiner.combinedHeader
source to share
Maybe look at the zip function for parts of the problem:
>>> execfile('so_ques.py')
[[' '], [' '], ['bananas bunches'], [' '], [' cars'], [' cars'], [' cars'], [' '], [' trucks'], [' trucks'], [' trucks'], [' '], ['trains freight'], [' '], ['planes cargo'], [' '], [' all other'], [' '], [' ']]
>>> zip(long_header, short_header)
[('', ''), ('', ''), ('bananas', 'bunches'), ('', ''), ('', 'cars'), ('', ''), ('', 'trucks'), ('', ''), ('', 'freight'), ('', ''), ('', 'cargo'), ('', ''), ('trains', 'all other'), ('', ''), ('planes', '')]
>>>
enumerate
can help avoid some complex indexing with counters:
>>> diff_list = []
>>> for place, header in enumerate(short_header):
diff_list.append(abs(span_short[place] - span_long[place]))
>>> for place, num in enumerate(diff_list):
if num:
new_shortlist.extend(short_header[place] for item in range(num+1))
else:
new_shortlist.append(short_header[place])
>>> new_shortlist
['', '', 'bunches', '', 'cars', 'cars', 'cars', '', 'trucks', 'trucks', 'trucks', '',...
>>> z = zip(new_shortlist, long_header)
>>> z
[('', ''), ('', ''), ('bunches', 'bananas'), ('', ''), ('cars', ''), ('cars', ''), ('cars', '')...
Also, a more pythonic notation can add clarity:
for each in range(len(short_header)):
sum_span_long += span_long[long_header_count]
sum_span_short += span_short[each]
span_diff = sum_span_short - sum_span_long
if not span_diff:
combined_header.append...
source to share
I guess I'm going to answer my own question, but I got a lot of help. Thanks for the help. I made S.LOTT work after a few small fixes. (They can be so small that they cannot be seen (inside the joke)). So now the question is, why is this more Pythonic? I think I can see it is less dense / works with original inputs instead of derivations / I can't judge if it's easier to read ---> although it's easy to read
S.LOTT Answer Corrected
row1=headerCells[0]
row2=headerCells[1]
i1= 0
i2= 0
result= []
while i1 != len(row1) or i2 != len(row2):
if i1 == len(row1):
result.append( ' '.join(row1[i1]) )
i2 += 1
elif i2 == len(row2):
result.append( ' '.join(row2[i2]) )
i1 += 1
else:
if int(row1[i1].get("colspan","1")) < int(row2[i2].get("colspan","1")):
c1= int(row1[i1].get("colspan","1"))
while c1 != int(row2[i2].get("colspan","1")):
txt1= ' '.join(row1[i1]) # needed to add when working adjust opposing case
txt2= ' '.join(row2[i2]) # needed to add when working adjust opposing case
result.append( txt1 + " " + txt2 ) # needed to add when working adjust opposing case
print 'stayed in middle', 'i1=',i1,'i2=',i2, ' c1=',c1
c1 += 1
i1 += 1 # Is this the problem it
elif int(row1[i1].get("colspan","1"))> int(row2[i2].get("colspan","1")):
# Fill extra cols from row2 Make same adjustment as above
c2= int(row2[i2].get("colspan","1"))
while int(row1[i1].get("colspan","1")) != c2:
result.append( ' '.join(row1[i1]) )
c2 += 1
i2 += 1
else:
assert int(row1[i1].get("colspan","1")) == int(row2[i2].get("colspan","1"))
pass
txt1= ' '.join(row1[i1])
txt2= ' '.join(row2[i2])
result.append( txt1 + " " + txt2 )
print 'went to bottom', 'i1=',i1,'i2=',i2
i1 += 1
i2 += 1
print result
source to share
Well, now I have an answer. I thought about this and decided that I need to use parts of each answer. I still need to figure out if I want a class or a function. But I have an algorithm that I think is more Pythonic than any other. But that depends a lot on the answers that some very generous people have provided. I really appreciate it because I learned a lot.
To save time running the test cases, I'm going to paste in the complete code I removed in IDLE and follow along with a sample HTML file. Aside from deciding on a class / function (and I need to think about how I use this code in my program), I would be happy to see any improvements that make the code more Pythonic.
from BeautifulSoup import BeautifulSoup
original=file(r"C:\testheaders.htm").read()
soupOriginal=BeautifulSoup(original)
all_Rows=soupOriginal.findAll('tr')
header_Rows=[]
for each in range(len(all_Rows)):
header_Rows.append(all_Rows[each])
header_Cells=[]
for each in header_Rows:
header_Cells.append(each.findAll('td'))
temp_Header_Row=[]
header=[]
for row in range(len(header_Cells)):
for column in range(len(header_Cells[row])):
x=int(header_Cells[row][column].get("colspan","1"))
if x==1:
temp_Header_Row.append( ' '.join(header_Cells[row][column]) )
else:
for item in range(x):
temp_Header_Row.append( ''.join(header_Cells[row][column]) )
header.append(temp_Header_Row)
temp_Header_Row=[]
combined_Header=zip(*header)
for each in combined_Header:
print each
Test file content window below. Sorry, I tried to attach them, but could not do this:
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
<TR valign="bottom">
<TD width="40%"> </TD>
<TD width="5%"> </TD>
<TD width="3%"> </TD>
<TD width="3%"> </TD>
<TD width="1%"> </TD>
<TD width="5%"> </TD>
<TD width="3%"> </TD>
<TD width="3%"> </TD>
<TD width="1%"> </TD>
<TD width="5%"> </TD>
<TD width="3%"> </TD>
<TD width="1%"> </TD>
<TD width="1%"> </TD>
<TD width="5%"> </TD>
<TD width="3%"> </TD>
<TD width="1%"> </TD>
<TD width="1%"> </TD>
<TD width="5%"> </TD>
<TD width="3%"> </TD>
<TD width="3%"> </TD>
<TD width="1%"> </TD>
</TR>
<TR style="font-size: 10pt" valign="bottom">
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">FOODS WE LIKE</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2"> </TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2"> </TD>
<TD> </TD>
</TR>
<TR style="font-size: 10pt" valign="bottom">
<TD> </TD>
<TD> </TD>
<TD nowrap align="CENTER" colspan="6">SILLY STUFF</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">OTHER THAN</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="CENTER" colspan="6">FAVORITE PEOPLE</TD>
<TD> </TD>
</TR>
<TR style="font-size: 10pt" valign="bottom">
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">MONTY PYTHON</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">CHERRYPY</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">APPLE PIE</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">MOTHERS</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">FATHERS</TD>
<TD> </TD>
</TR>
<TR style="font-size: 10pt" valign="bottom">
<TD nowrap align="left">Name</TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">SHOWS</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">PROGRAMS</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">BANANAS</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">PERFUME</TD>
<TD> </TD>
<TD> </TD>
<TD nowrap align="right" colspan="2">TOOLS</TD>
<TD> </TD>
</TR>
</TABLE>
source to share