BeautifulSoup: extracting value from nodes for children

Question

BeautifulSoup: extracting value from nodes for children

I have the following html:

<td class="section">
    <div style="margin-top:2px; margin-bottom:-10px; ">
    <span class="username"><a href="user.php?id=xx">xxUsername</a></span>
    </div>
    <br>
<span class="comment">
A test comment
</span>
</td>

All I want is to get the xxUsername and comment text in the SPAN tag. So far I've done this:

results = soup.findAll("td", {"class" : "section"})

It extracts ALL html blocks of the template mentioned above. Now I want to get the value of all children in one loop? Is it possible? If not, how do I get information about the child nodes?

+3

python python-2.7 beautifulsoup

Volatil3 Jan 27. 13 at 2:12

source to share

2 answers

To get text from elements username

or comment

<span>

:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
for el in soup('span', ['username', 'comment']):
    print el.string,

Output

xxUsername 
A test comment

+1

jfs Jan 27. At 4:55 am

source to share

RocketDonkey · Accepted Answer · 2013-01-27T02:28:43+0000

You can try something like this. It basically does what you did above - it first iterates through all section

-classed td

and then iterates through all the text span

inside. This prints out the class, just in case you need to be more strict:

In [1]: from bs4 import BeautifulSoup

In [2]: html = # Your html here

In [3]: soup = BeautifulSoup(html)

In [4]: for td in soup.find_all('td', {'class': 'section'}):
   ...:     for span in td.find_all('span'):
   ...:         print span.attrs['class'], span.text
   ...:         
['username'] xxUsername
['comment'] 
A test comment

Or more complex than necessary, one layer that will keep everything on your list:

In [5]: results = [span.text for td in soup.find_all('td', {'class': 'section'}) for span in td.find_all('span')]

In [6]: results
Out[6]: [u'xxUsername', u'\nA test comment\n']

Or on the same topic, a dictionary with keys, which is a tuple of classes, and the values are the text itself:

In [8]: results = dict((tuple(span.attrs['class']), span.text) for td in soup.find_all('td', {'class': 'section'}) for span in td.find_all('span'))

In [9]: results
Out[9]: {('comment',): u'\nA test comment\n', ('username',): u'xxUsername'}

Assuming this bit is closer to what you want, I would suggest rewriting as:

In [10]: results = {}

In [11]: for td in soup.find_all('td', {'class': 'section'}):
   ....:     for span in td.find_all('span'):
   ....:         results[tuple(span.attrs['class'])] = span.text
   ....:         

In [12]: results
Out[12]: {('comment',): u'\nA test comment\n', ('username',): u'xxUsername'}

BeautifulSoup: extracting value from nodes for children

Output

More articles: