
BeautifulSoup only gets the "generic" text in the td tag, nothing in the nested tags

Say my html looks like this:

<td>Potato1 <span somestuff...>Potato2</span></td>
<td>Potato9 <span somestuff...>Potato10</span></td>


I have beautifulsoup this:

for tag in soup.find_all("td"):
    print tag.text


And i get

Potato1 Potato2
Potato9 Potato10


Is it possible to just get the text inside the tag, but not the text nested inside the span tag?


source to share

2 answers

You can use .contents


>>> for tag in soup.find_all("td"):
...     print tag.contents[0]


What is he doing?

Tag detectives are available as a list using .contents


>>> for tag in soup.find_all("td"):
...     print tag.contents
[u'Potato1 ', <span somestuff...="">Potato2</span>]
[u'Potato9 ', <span somestuff...="">Potato10</span>]


since we are only interested in the first element, we go for

print tag.contents[0]




Another method, which, by contrast tag.contents[0]

, ensures that the text is NavigableString

, and not the text from the child Tag

, is:

[child for tag in soup.find_all("td") 
 for child in tag if isinstance(child, bs.NavigableString)]


Here's an example that highlights the difference:

import bs4 as bs

content = '''
<td>Potato1 <span>Potato2</span></td>
soup = bs.BeautifulSoup(content)

print([tag.contents[0] for tag in soup.find_all("td")])
# [u'Potato1 ', <span>FOO</span>, <span>Potato10</span>]

print([child for tag in soup.find_all("td") 
       for child in tag if isinstance(child, bs.NavigableString)])
# [u'Potato1 ', u'Potato9']


Or, using lxml, you can use XPath td/text()


import lxml.html as LH

content = '''
<td>Potato1 <span>Potato2</span></td>
root = LH.fromstring(content)




['Potato1 ', 'Potato9']




All Articles