HTML for Markdown with html2text

I can successfully convert some HTML to markdown in python using the html2text library and it looks like this:

def mark_down_formatting(html_text, url):
    h = html2text.HTML2Text()

    # Options to transform URL into absolute links
    h.body_width = 0
    h.protect_links = True
    h.wrap_links = False
    h.baseurl = url

    md_text = h.handle(html_text)

    return md_text

      

And that was fine for the time being, but it has limitations as I don't find any way to customize the output in the documentation .

I don't really need a lot of customization, I only need this HTML tag <span class="searched_found">example text</span>

to be converted to markdown in whatever I give. It could be+example text+

So I'm looking for a solution to my problem, since html2text is a nice library that allows me to tweak some parameters like the ones I showed using hyperlinks, it would be nice to have a solution based on this library.

UPDATE:

I have a solution using the BeautifulSoup library , but I believe this is a temporary patch as it adds another dependency and adds a lot of unnecessary processing. What I did here was to edit the HTML parsing before in markdown:

def processing_to_markdown(html_text, url, delimiter):
    # Not using "lxml" parser since I get to see a lot of different HTML
    # and the "lxml" parser tend to drop content when parsing very big HTML
    # that has some errors inside
    soup = BeautifulSoup(html_text, "html.parser")

    # Finds all <span class="searched_found">...</span> tags
    for tag in soup.findAll('span', class_="searched_found"):
        tag.string = delimiter + tag.string + delimiter
        tag.unwrap()  # Removes the tags to only keep the text

    html_text = unicode(soup)

    return mark_down_formatting(html_text, url)

      

With very long HTML content, this turns out to be quite slow as we parse the HTML twice, once with BeautifulSoup and then with html2text.

+3


source to share





All Articles