Beautifulsoup splits the text in the <br/"> tag

Question

Beautifulsoup splits the text in the <br/"> tag

Is it possible to split the tag text into br tags?

I have this tag content: [u'+420 777 593 531', <br/>, u'+420 776 593 531', <br/>, u'+420 775 593 531']

And I only want to get numbers. Any advice?

EDIT:

[x for x in dt.find_next_sibling('dd').contents if x!=' <br/>']

Doesn't work at all.

+3

python text newline tags beautifulsoup

Milano slesarik 07 June 15 at 14:17

source to share

1 answer

Martijn pieters · Answer 1 · 2015-06-07T14:20:37+0000

You need to test tags that are modeled as instances Element

. Element

objects have an attribute name

, whereas text elements are not (which are instances NavigableText

):

[x for x in dt.find_next_sibling('dd').contents if getattr(x, 'name', None) != 'br']

Since you only have text and <br />

elements in that element <dd>

, you can simply get all the contained lines instead:

list(dt.find_next_sibling('dd').stripped_strings)

Demo:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <dt>Term</dt>
... <dd>
...     +420 777 593 531<br/>
...     +420 776 593 531<br/>
...     +420 775 593 531<br/>
... </dd>
... ''')
>>> dt = soup.dt
>>> [x for x in dt.find_next_sibling('dd').contents if getattr(x, 'name', None) != 'br']
[u'\n    +420 777 593 531', u'\n    +420 776 593 531', u'\n    +420 775 593 531', u'\n']
>>> list(dt.find_next_sibling('dd').stripped_strings)
[u'+420 777 593 531', u'+420 776 593 531', u'+420 775 593 531']

Beautifulsoup splits the text in the <br/"> tag

More articles: