Correct way to trim tags except some in python
For example, I have HTML code which contains codes like this
<a href="some" class="some" onclick="return false;">anchor</a>
<table id="some">
<tr>
<td class="some">
</td>
</tr>
</table>
<p class="" style="">content</p>
And I want to remove all tag attributes and only keep the tags (like delete table, tr, tr, th tags), so I want to get something like this.
<a href="some">anchor</a>
<table>
<tr>
<td>
</td>
</tr>
</table>
<p>content</p>
I am using for loop, but my code fetches each tag and clears it. I think my path is slow.
What can you suggest me? Thank.
Update # 1
In my solution I am using this code to remove tags (stolen from django)
def remove_tags(html, tags):
"""Returns the given HTML with given tags removed."""
tags = [re.escape(tag) for tag in tags.split()]
tags_re = '(%s)' % '|'.join(tags)
starttag_re = re.compile(r'<%s(/?>|(\s+[^>]*>))' % tags_re, re.U)
endtag_re = re.compile('</%s>' % tags_re)
html = starttag_re.sub('', html)
html = endtag_re.sub('', html)
return html
And this code to clean up HTML attributes
# But this code doesnt remove empty tags (without content ant etc.) like this `<div><img></div>`
import lxml.html.clean
html = 'Some html code'
safe_attrs = lxml.html.clean.defs.safe_attrs
cleaner = lxml.html.clean.Cleaner(safe_attrs_only=True, safe_attrs=frozenset())
html = cleaner.clean_html(html)
source to share
Use beautifulsoup .
html = """
<a href="some" class="some" onclick="return false;">anchor</a>
<table id="some">
<tr>
<td class="some">
</td>
</tr>
</table>
<p class="" style="">content</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
del soup.table.tr.td.attrs
del soup.table.attrs
print(soup.prettify())
<html>
<body>
<a class="some" href="some" onclick="return false;">
anchor
</a>
<table>
<tr>
<td>
</td>
</tr>
</table>
<p class="" style="">
content
</p>
</body>
</html>
Clear tags:
soup = BeautifulSoup(html)
soup.table.clear()
print(soup.prettify())
<html>
<body>
<a class="some" href="some" onclick="return false;">
anchor
</a>
<table id="some">
</table>
<p class="" style="">
content
</p>
</body>
</html>
To remove particulat attribute:
soup = BeautifulSoup(html)
td_tag = soup.table.td
del td_tag['class']
print(soup.prettify())
<html>
<body>
<a class="some" href="some" onclick="return false;">
anchor
</a>
<table id="some">
<tr>
<td>
</td>
</tr>
</table>
<p class="" style="">
content
</p>
</body>
</html>
source to share
What you are looking for is called parsing.
BeautifulSoup is one of the most popular / most used html parsing libraries. You can use it to remove tags and it's pretty well documented .
If you (for some reason) cannot use BeautifulSoup, take a look at the python module re
.
source to share