Select specific children with BeautifulSoup

Question

Select specific children with BeautifulSoup

I am reading up on BeautifulSoup to screen some pretty heavy html pages. Looking through the BeautifulSoup documentation I can't seem to find an easy way to select the children.

Given the html:

<div id="top">
  <div>Content</div>
  <div>
    <div>Content I Want</div>
  </div>
</div>

I need an easy way to get the "Content I Want" if I have the top of the object. Coming to BeautifulSoup I thought it would be easy and something like topobj.nodes [1] .nodes [0] .string. Instead, I only see variables and functions that also return elements along with text nodes, comments, etc.

Am I missing something? Or do I really need to resort to long form using .find (), or worse, using lists in the .contents variable.

The reason is that I don't trust the webpage's space to be the same, so I want to ignore it and only navigate through the elements.

+2

python html-parsing beautifulsoup

driax 15 oct. '09 at 11:12

source to share

1 answer

van · Accepted Answer · 2009-10-15T11:34:56+0000

You are more flexible with find

, and to get what you want, you just need to run:

node = p.find('div', text="Content I Want")

But since that might not be how you want to get there, the following options might work better for you:

xml = """<div id="top"><div>Content</div><div><div>Content I Want</div></div></div>"""
from BeautifulSoup import BeautifulSoup
p = BeautifulSoup(xml)

# returns a list of texts
print p.div.div.findNextSibling().div.contents
# returns a list of texts
print p.div.div.findNextSibling().div(text=True)
# join (and strip) the values
print ''.join(s.strip() for s in p.div.div.findNextSibling().div(text=True))

Select specific children with BeautifulSoup

More articles: