Can't find and find find_all in BeautifulSoup

I have a book and docs about BeautifulSoup. Both say that I should be able to chain find / find_all and use indices to get exactly what I want from the same page. This does not seem to be the case. Consider the following table.

<tr>
<td><span style="display:none;" class="sortkey">Dresser !</span><span class="sorttext">**<a href="/wiki/Louise_Dresser" title="Louise Dresser">Louise Dresser</a>**</span></td>
<td><span style="display:none;" class="sortkey">Ship !</span><span class="sorttext"><i><a href="/wiki/A_Ship_Comes_In" title="A Ship Comes In">A Ship Comes In</a></i></span></td>
<td><span style="display:none;" class="sortkey">Pleznik !</span><span class="sorttext">Mrs. Pleznik</span></td>
</tr>
<tr>
<td><span style="display:none;" class="sortkey">Swanson !</span><span class="sorttext"><a href="/wiki/Gloria_Swanson" title="Gloria Swanson">Gloria Swanson</a></span></td>
<td><i><a href="/wiki/Sadie_Thompson" title="Sadie Thompson">Sadie Thompson</a></i></td>
<td><span style="display:none;" class="sortkey">Thompson !</span><span class="sorttext">Sadie Thompson</span></td>
</tr>
<tr>
<th scope="row" rowspan="6" style="text-align:center"><a href="/wiki/1928_in_film" title="1928 in film">1928</a>/<a href="/wiki/1929_in_film" title="1929 in film">29</a><br />
<small><a href="/wiki/2nd_Academy_Awards" title="2nd Academy Awards">(2nd)</a></small></th>
<td style="background:#FAEB86"><b><span style="display:none;" class="sortkey">Pickford !</span><span class="sorttext">**<a href="/wiki/Mary_Pickford" title="Mary Pickford">Mary Pickford</a>**</span> <img alt="Award winner" src="//upload.wikimedia.org/wikipedia/commons/f/f9/Double-dagger-14-plain.png" width="9" height="14" data-file-width="9" data-file-height="14" /></b></td>

      

For each row of the table, I need to grab the first element and then the text inside the first nested tag. Lousie Dresser will be the first data point, followed by Gloria Swanson and then Mary Pickford.

I thought the following would take me there, but I was wrong and after 6 hours I was gone.

def getActresses(URL):
    try:
        html = urlopen(URL)
    except HTTPError:
        print("Page not found.")
        return None
    try:
        bsObj = BeautifulSoup(html, "lxml")
        soup = bsObj.find("table", {"class":"wikitable sortable"})
    except AttributeError:
        print("Error creating/navigating soup object")
    data = soup.find_all("tr").find_all("td").find("a").get_text()
    print(data)


getActresses("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress")

      

This is not the only code I've tried. I've tried looping through rows, then table data cells, and then accessing tags. I've tried asking for tags and then regexing them, only to say that I couldn't have the text I wanted. The most common error I got when trying chained operations (as above) is AttributeError: 'ResultSet' object has no attribute 'find'.

. Subscribers absolutely do not work, even when replicating book examples (go fig ?!). Also, I had processes that interrupted themselves, which I didn't know was possible.

Thoughts about what is happening and why something that should be so simple seems to be such an event would be highly appreciated.

+3


source to share


1 answer


import requests
from bs4 import BeautifulSoup

def getActresses(URL):
    res = requests.get(URL)

    try:
        soup = BeautifulSoup(res.content, "lxml")
        table = soup.find("table", {"class":"wikitable sortable"})
    except AttributeError:
        print("Error creating/navigating soup object")

    tr = table.find_all("tr")

    for _tr in tr:
        td = _tr.find_all("td")
        for _td in td:
            a = _td.find_all("a")
            for _a in a:
                print(_a.text.encode("utf-8"))

getActresses("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress")

      

use text

instead get_text()

and sorry i used module requests

for demo



find_all

always returns a list so you can skip it

I'm sorry, I'm new to stackoverflow, I don't know how to write answers. Anyway, I believe the code will clear your doubts.

+6


source







All Articles