BeautifulSoup only gets the "generic" text in the td tag, nothing in the nested tags
Say my html looks like this:
<td>Potato1 <span somestuff...>Potato2</span></td>
...
<td>Potato9 <span somestuff...>Potato10</span></td>
I have beautifulsoup this:
for tag in soup.find_all("td"):
print tag.text
And i get
Potato1 Potato2
....
Potato9 Potato10
Is it possible to just get the text inside the tag, but not the text nested inside the span tag?
source to share
You can use .contents
like
>>> for tag in soup.find_all("td"):
... print tag.contents[0]
...
Potato1
Potato9
What is he doing?
Tag detectives are available as a list using .contents
.
>>> for tag in soup.find_all("td"):
... print tag.contents
...
[u'Potato1 ', <span somestuff...="">Potato2</span>]
[u'Potato9 ', <span somestuff...="">Potato10</span>]
since we are only interested in the first element, we go for
print tag.contents[0]
source to share
Another method, which, by contrast tag.contents[0]
, ensures that the text is
NavigableString
, and not the text from the child Tag
, is:
[child for tag in soup.find_all("td")
for child in tag if isinstance(child, bs.NavigableString)]
Here's an example that highlights the difference:
import bs4 as bs
content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
soup = bs.BeautifulSoup(content)
print([tag.contents[0] for tag in soup.find_all("td")])
# [u'Potato1 ', <span>FOO</span>, <span>Potato10</span>]
print([child for tag in soup.find_all("td")
for child in tag if isinstance(child, bs.NavigableString)])
# [u'Potato1 ', u'Potato9']
Or, using lxml, you can use XPath td/text()
:
import lxml.html as LH
content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
root = LH.fromstring(content)
print(root.xpath('td/text()'))
gives
['Potato1 ', 'Potato9']
source to share