I am dealing with html / xhtml links with beautifulsoup 4.3.2 and am running into some wei...">

BeautifulSoup 4: Working With URLs Containing <br/">

I am dealing with html / xhtml links with beautifulsoup 4.3.2 and am running into some weirdness related to elements in elements.

from bs4 import BeautifulSoup

html = BeautifulSoup('<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000<br /></a></body></html>')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))

      

Gives an empty list.

As I already found, this is caused by the br tag appearing in the tag. Hmm. Well, let's replace it with a new line as someone here suggested.

html.find('br').replaceWith('\n')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))

      

Again empy list, damn it.

May be,

html.find('br').replaceWith('')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))

      

The same result.

But

html = BeautifulSoup('<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000</a></body></html>')
html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))

[<a href="/track?no=ABCD0000000">ABCD0000000</a>]

      

- Works great.

So, as I see it, there is no way around this other than to clear or replace the br before feeding the data to bs4.

import re
re.sub(re.compile('<br\s*/>', re.IGNORECASE), '\n', '<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000<br /></a></body></html>')

      

Or anyone?

Thanks for the suggestions and additions.

Regards, ~ S.

+3


source to share


1 answer


One option is to remove all tags br

using extract()

and then search:

import re
from bs4 import BeautifulSoup

html = BeautifulSoup('<html><head></head><body><a href="/track?no=ABCD0000000">ABCD0000000<br /></a></body></html>')

for br in html('br'):
    br.extract()

print html.find_all('a', text=re.compile('ABCD0000000', re.IGNORECASE))

      

Printing

[<a href="/track?no=ABCD0000000">ABCD0000000</a>]

      




Another option is to check that the attribute href

ends with ABCD0000000

using CSS Selector

:

html.select('a[href$="ABCD0000000"]')

      




Another option is to use a function and check that the link text starts with ABCD0000000

:

html.find_all(lambda tag: tag.name == 'a' and tag.text.startswith('ABCD0000000'))

      

+2


source







All Articles