BeautifulSoup only gets the "generic" text in the td tag, nothing in the nested tags

Question

BeautifulSoup only gets the "generic" text in the td tag, nothing in the nested tags

Say my html looks like this:

<td>Potato1 <span somestuff...>Potato2</span></td>
...
<td>Potato9 <span somestuff...>Potato10</span></td>

I have beautifulsoup this:

for tag in soup.find_all("td"):
    print tag.text

And i get

Potato1 Potato2
....
Potato9 Potato10

Is it possible to just get the text inside the tag, but not the text nested inside the span tag?

+3

python beautifulsoup

Stupid.Fat.Cat 07 jul. 15 at 17:06

source to share

2 answers

Another method, which, by contrast tag.contents[0]

, ensures that the text is NavigableString

, and not the text from the child Tag

, is:

[child for tag in soup.find_all("td") 
 for child in tag if isinstance(child, bs.NavigableString)]

Here's an example that highlights the difference:

import bs4 as bs

content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
soup = bs.BeautifulSoup(content)

print([tag.contents[0] for tag in soup.find_all("td")])
# [u'Potato1 ', <span>FOO</span>, <span>Potato10</span>]

print([child for tag in soup.find_all("td") 
       for child in tag if isinstance(child, bs.NavigableString)])
# [u'Potato1 ', u'Potato9']

Or, using lxml, you can use XPath td/text()

:

import lxml.html as LH

content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
root = LH.fromstring(content)

print(root.xpath('td/text()'))

gives

['Potato1 ', 'Potato9']

+1

unutbu 07 jul. 15 at 17:28

source to share

nu11p01n73R · Accepted Answer · 2015-07-07T17:16:08+0000

You can use .contents

like

>>> for tag in soup.find_all("td"):
...     print tag.contents[0]
...
Potato1
Potato9

What is he doing?

Tag detectives are available as a list using .contents

.

>>> for tag in soup.find_all("td"):
...     print tag.contents
...
[u'Potato1 ', <span somestuff...="">Potato2</span>]
[u'Potato9 ', <span somestuff...="">Potato10</span>]

since we are only interested in the first element, we go for

print tag.contents[0]

BeautifulSoup only gets the "generic" text in the td tag, nothing in the nested tags

More articles: