Remove tag from text using BeautifulSoup

Question

Remove tag from text using BeautifulSoup

There are many questions here with a similar title, but I am trying to remove the tag from the soup object itself.

I have a page that contains among others this div

:

<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>

I can select <div id="content">

with soup.find('div', id='content')

, but I want to remove from it <div id="blah">

.

+3

python html beautifulsoup

Juicy 16 jul. 15 at 10:24

source to share

2 answers

The methodTag.decompose

removes tag

from the tree. So find the tag div

:

div = soup.find('div', {'id':'content'})

Flip all children, but first:

for child in list(div)[1:]:

and try to decompose the children:

    try:
        child.decompose()
    except AttributeError: pass

import bs4 as bs

content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
soup = bs.BeautifulSoup(content)
div = soup.find('div', {'id':'content'})
for child in list(div)[1:]:
    try:
        child.decompose()
    except AttributeError: pass
print(div)

gives

<div id="content">
I want to keep this
</div>

The equivalent using lxml would be

import lxml.html as LH

content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
root = LH.fromstring(content)

div = root.xpath('//div[@id="content"]')[0]
for child in div:
    div.remove(child)
print(LH.tostring(div))

+3

unutbu 16 jul. 15 at 10:36

source to share

styvane · Accepted Answer · 2015-07-16T10:32:49+0000

You can use if you want to remove a tag or line from the tree. extract

In [13]: soup = BeautifulSoup("""<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>""")

In [14]: soup = BeautifulSoup("""<div id="content">
   ....: I want to keep this<br /><div id="blah">I want to remove this</div>
   ....: </div>""")

In [15]: blah = soup.find(id='blah')

In [16]: _ = blah.extract()

In [17]: soup
Out[17]: 
<html><body><div id="content">
I want to keep this<br/>
</div></body></html>

Remove tag from text using BeautifulSoup

More articles: