BeautifulSoup 4: Working With URLs Containing <br/">
I am dealing with html / xhtml links with beautifulsoup 4.3.2 and am running into some weirdness related to elements in elements.
from bs4 import BeautifulSoup
html = BeautifulSoup('<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000<br /></a></body></html>')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
Gives an empty list.
As I already found, this is caused by the br tag appearing in the tag. Hmm. Well, let's replace it with a new line as someone here suggested.
html.find('br').replaceWith('\n')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
Again empy list, damn it.
May be,
html.find('br').replaceWith('')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
The same result.
But
html = BeautifulSoup('<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000</a></body></html>')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
[<a href="/track?no=ABCD0000000">ABCD0000000</a>]
- Works great.
So, as I see it, there is no way around this other than to clear or replace the br before feeding the data to bs4.
import re
re.sub(re.compile('<br\s*/>', re.IGNORECASE), '\n', '<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000<br /></a></body></html>')
Or anyone?
Thanks for the suggestions and additions.
Regards, ~ S.
source to share
One option is to remove all tags br
using extract()
and then search:
import re
from bs4 import BeautifulSoup
html = BeautifulSoup('<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000<br /></a></body></html>')
for br in html('br'):
br.extract()
print html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))
Printing
[<a href="/track?no=ABCD0000000">ABCD0000000</a>]
Another option is to check that the attribute href
ends with ABCD0000000
using CSS Selector
:
html.select('a[href$="ABCD0000000"]')
Another option is to use a function and check that the link text starts with ABCD0000000
:
html.find_all(lambda tag: tag.name == 'a' and tag.text.startswith('ABCD0000000'))
source to share