HTML for Markdown with html2text
I can successfully convert some HTML to markdown in python using the html2text library and it looks like this:
def mark_down_formatting(html_text, url):
h = html2text.HTML2Text()
# Options to transform URL into absolute links
h.body_width = 0
h.protect_links = True
h.wrap_links = False
h.baseurl = url
md_text = h.handle(html_text)
return md_text
And that was fine for the time being, but it has limitations as I don't find any way to customize the output in the documentation .
I don't really need a lot of customization, I only need this HTML tag <span class="searched_found">example text</span>
to be converted to markdown in whatever I give. It could be+example text+
So I'm looking for a solution to my problem, since html2text is a nice library that allows me to tweak some parameters like the ones I showed using hyperlinks, it would be nice to have a solution based on this library.
UPDATE:
I have a solution using the BeautifulSoup library , but I believe this is a temporary patch as it adds another dependency and adds a lot of unnecessary processing. What I did here was to edit the HTML parsing before in markdown:
def processing_to_markdown(html_text, url, delimiter):
# Not using "lxml" parser since I get to see a lot of different HTML
# and the "lxml" parser tend to drop content when parsing very big HTML
# that has some errors inside
soup = BeautifulSoup(html_text, "html.parser")
# Finds all <span class="searched_found">...</span> tags
for tag in soup.findAll('span', class_="searched_found"):
tag.string = delimiter + tag.string + delimiter
tag.unwrap() # Removes the tags to only keep the text
html_text = unicode(soup)
return mark_down_formatting(html_text, url)
With very long HTML content, this turns out to be quite slow as we parse the HTML twice, once with BeautifulSoup and then with html2text.
source to share
No one has answered this question yet
Check out similar questions: