Potato1

BeautifulSoup only gets the "generic" text in the td tag, nothing in the nested tags

Say my html looks like this:

<td>Potato1 <span somestuff...>Potato2</span></td>
...
<td>Potato9 <span somestuff...>Potato10</span></td>

      

I have beautifulsoup this:

for tag in soup.find_all("td"):
    print tag.text

      

And i get

Potato1 Potato2
....
Potato9 Potato10

      

Is it possible to just get the text inside the tag, but not the text nested inside the span tag?

+3


source to share


2 answers


You can use .contents

like

>>> for tag in soup.find_all("td"):
...     print tag.contents[0]
...
Potato1
Potato9

      

What is he doing?

Tag detectives are available as a list using .contents

.



>>> for tag in soup.find_all("td"):
...     print tag.contents
...
[u'Potato1 ', <span somestuff...="">Potato2</span>]
[u'Potato9 ', <span somestuff...="">Potato10</span>]

      

since we are only interested in the first element, we go for

print tag.contents[0]

      

+5


source


Another method, which, by contrast tag.contents[0]

, ensures that the text is NavigableString

, and not the text from the child Tag

, is:

[child for tag in soup.find_all("td") 
 for child in tag if isinstance(child, bs.NavigableString)]

      


Here's an example that highlights the difference:

import bs4 as bs

content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
soup = bs.BeautifulSoup(content)

print([tag.contents[0] for tag in soup.find_all("td")])
# [u'Potato1 ', <span>FOO</span>, <span>Potato10</span>]

print([child for tag in soup.find_all("td") 
       for child in tag if isinstance(child, bs.NavigableString)])
# [u'Potato1 ', u'Potato9']

      




Or, using lxml, you can use XPath td/text()

:

import lxml.html as LH

content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
root = LH.fromstring(content)

print(root.xpath('td/text()'))

      

gives

['Potato1 ', 'Potato9']

      

+1


source







All Articles