BeautifulSoup scraping nested tables

I am trying to clear data from a site that has a large number of tables. I'm researching the beautifulsoup documentation as well as here on stackoverflow but I'm still lost.

Below is the table:

      <form action="/rr/" class="form">
        <table border="0" width="100%" cellpadding="2" cellspacing="0" align="left">
          <tr bgcolor="#6699CC">
            <td valign="top"><font face="arial"><b>Uesless Data</b></font></td>
    
            <td width="10%"><br /></td>
    
            <td align="right"><font face="arial">Uesless Data</font></td>
          </tr>
    
          <tr bgcolor="#DCDCDC">
            <td> <input size="12" name="s" value="data:" onfocus=
            "this.value = '';" /> <input type="hidden" name="d" value="research" />
    				
            <input type="submit" value="Date" /></td>
    
            <td width="10%"><br /></td>
    
          </tr>
        </table>
      </form>
    
      <table border="0" width="100%">
        <tr>
          <td></td>
        </tr>
      </table><br />
      <br />
    
      <table border="0" width="100%">
        <tr>
          <td valign="top" width="99%">
            <table cellpadding="2" cellspacing="0" border="0" width="100%">
              <tr bgcolor="#A0B8C8">
                <td colspan="6"><b>Data to be pulled</b></td>
              </tr>
    
              <tr bgcolor="#DCDCDC">
                <td><font face="arial"><b>Data to be pulled</b></font></td>
    
                <td><font face="arial"><b>Data to be pulled</b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
    
                <td align="center"><font face="arial"><b>Data to be pulled
                </b></font></td>
              </tr>
    
              <tr>
                <td>Data to be pulled</td>
    
                <td align="center">Data to be pulled</td>
    
                <td align="center">Data to be pulled</td>
    
                <td align="center">Data to be pulled</td>
    
                <td align="center"><br /></td>
              </tr>
    	    </table>
    	  </td>
    	</tr>
      </table>
      

Run codeHide result


There are quite a few tables out there, and none of them really have distinctive IDs or tags. My last try:

table = soup.find('table', attrs={'border':'0', 'width': "100%'})

      

Which only pulls out the first empty table. I feel like the answer is simple and I am thinking about it.

+3


source to share


1 answer


If you are just looking for all tables and not the first one, you just want find_all

instead find

.

If you are trying to find a specific table, for example, one nested inside another, and the page uses a 90s style design that makes it impossible to search through id

or other attributes, the only option is to search by structure:

for table in soup.find_all('table'):
    for subtable in table.find_all('table'):
        # Found it!

      

And of course, you can flatten this into a single understanding if you really want to:



subtable = next(subtable for table in soup.find_all('table') 
                for subtable in table.find_all('table'))

      

Note that I settled on attrs

. If every table on the page has a superset of the same attributes, you're not helping anything by specifying them.

This whole thing is obviously ugly and fragile ... but there really is no way not to be fragile in that manner.

Using another library, such as lxml.html

one that allows you to search for XPath, might make it a little more compact, but it will do the same in the long run.

+4


source







All Articles