Correct way to trim tags except some in python

For example, I have HTML code which contains codes like this

<a href="some" class="some" onclick="return false;">anchor</a>
<table id="some">
    <tr>
        <td class="some">
        </td>
    </tr>
</table>
<p class="" style="">content</p>

      

And I want to remove all tag attributes and only keep the tags (like delete table, tr, tr, th tags), so I want to get something like this.

<a href="some">anchor</a>
<table>
    <tr>
        <td>

        </td>
    </tr>
</table>
<p>content</p>

      

I am using for loop, but my code fetches each tag and clears it. I think my path is slow.

What can you suggest me? Thank.

Update # 1

In my solution I am using this code to remove tags (stolen from django)

def remove_tags(html, tags):
    """Returns the given HTML with given tags removed."""
    tags = [re.escape(tag) for tag in tags.split()]
    tags_re = '(%s)' % '|'.join(tags)
    starttag_re = re.compile(r'<%s(/?>|(\s+[^>]*>))' % tags_re, re.U)
    endtag_re = re.compile('</%s>' % tags_re)
    html = starttag_re.sub('', html)
    html = endtag_re.sub('', html)
    return html

      

And this code to clean up HTML attributes

# But this code doesnt remove empty tags (without content ant etc.) like this `<div><img></div>`
import lxml.html.clean

html = 'Some html code'

safe_attrs = lxml.html.clean.defs.safe_attrs
cleaner = lxml.html.clean.Cleaner(safe_attrs_only=True, safe_attrs=frozenset())
html = cleaner.clean_html(html)

      

+3


source to share


2 answers


Use beautifulsoup .

html = """
<a href="some" class="some" onclick="return false;">anchor</a>
<table id="some">
    <tr>
        <td class="some">
        </td>
    </tr>
</table>
<p class="" style="">content</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

del soup.table.tr.td.attrs 
del soup.table.attrs 
print(soup.prettify())

<html>
 <body>
  <a class="some" href="some" onclick="return false;">
   anchor
  </a>
  <table>
   <tr>
    <td>
    </td>
   </tr>
  </table>
  <p class="" style="">
   content
  </p>
 </body>
</html>

      

Clear tags:



soup = BeautifulSoup(html)

soup.table.clear()
print(soup.prettify())

<html>
 <body>
  <a class="some" href="some" onclick="return false;">
   anchor
  </a>
  <table id="some">
  </table>
  <p class="" style="">
   content
  </p>
 </body>
</html>

      

To remove particulat attribute:

soup = BeautifulSoup(html)

td_tag =  soup.table.td
del td_tag['class']
print(soup.prettify())

<html>
 <body>
  <a class="some" href="some" onclick="return false;">
   anchor
  </a>
  <table id="some">
   <tr>
    <td>
    </td>
   </tr>
  </table>
  <p class="" style="">
   content
  </p>
 </body>
</html>

      

+3


source


What you are looking for is called parsing.

BeautifulSoup is one of the most popular / most used html parsing libraries. You can use it to remove tags and it's pretty well documented .



If you (for some reason) cannot use BeautifulSoup, take a look at the python module re

.

+1


source







All Articles