Beautiful Soup - Selecting Classes from HTML File

I have an HTML file and I want to take the text from this block shown here:

 <strong class="fullname js-action-profile-name">User Name</strong>
    <span>&rlm;</span>
    <span class="username js-action-profile-name"><s>@</s><b>UserName</b></span>

      

I want it to display as:

User Name
@UserName

      

How do I do this with Beautiful Soup?

+3


source to share


3 answers


Use the attribute "text". Example:

>>> b = BeautifulSoup.BeautifulStoneSoup(open('/tmp/x.html'), convertEntities=BeautifulSoup.BeautifulStoneSoup.HTML_ENTITIES)

>>> print b.find(attrs={"id": "container"}).text
User Name‏@UserName

      



In x.html I have a div containing the html you specified with the id "container". Note that I am converting & rlm; using BeautifulStoneSoup. To insert a new line (which will not be entered by the browser), simply replace u '\ u200f' with "\ n".

+1


source


from bs4 import BeautifulSoup

html = '''<strong class="fullname js-action-profile-name">User Name</strong>
    <span>&rlm;</span>
    <span class="username js-action-profile-name"><s>@</s><b>UserName</b></span>'''

soup = BeautifulSoup(html)

username = soup.find(attrs={'class':'username js-action-profile-name'}).text
fullname = soup.find(attrs={'class':'fullname js-action-profile-name'}).text

print fullname
print username

      

Outputs:

User Name
@UserName

      



Two notes:

  • Use bs4

    if you're starting out / just learning BS.

  • You will probably load your HTML from an external file, so replace with a html

    file object.

+1


source


This assumes index.html contains the markup from the question:

import BeautifulSoup

def displayUserInfo():

    soup = BeautifulSoup.BeautifulSoup(open("index.html"))
    fullname_ele = soup.find(attrs={"class": "fullname js-action-profile-name"})
    fullname = fullname_ele.contents[0]
    print fullname

    username_ele = soup.find(attrs={"class": "username js-action-profile-name"})
    username = ""
    for child in username_ele.findChildren():
        username += child.contents[0]
    print username

if __name__ == '__main__':
    displayUserInfo()

# prints:
# User Name
# @UserName

      

0


source







All Articles