How can I convert an HTML fragment to another HTML fragment?

I have a browser type editor contentEditable

where users can copy / paste or select html snippets to place inside.

These snippets can be any type of HTML, so we have to sanitize the content so that it doesn't contain security tags (eg, <script>

etc.).

I know of several sanitizer libraries that allow some kind of Whitelist policy (like JSoup on the JVM), but these rules are usually very simple, like they say tags / attributes are white and nothing else.

We need more complex rules, for example:

  • Determine which inline styles to keep or not,
  • Convert relative links to absolute links
  • Blacklist or whitelist some tags according to their className
  • Allow some URI attributes according to the URI pattern (for example, only allow links to a specific domain).
  • In some cases we want the forbidden dom nodes to be "replaced" with their children (to remove the formatting and html layout elements, but not lose the text nodes that were in the blacklisted tags

We've done some code so far to handle this, but I find it very hacky. Is there a known library, standard or algorithm to handle this kind of thing? I am not an XML parsing / transforming expert, anything I could use like XSLT, SAX or anything else that might help me solve my problem.

I am looking for solutions for both browser (JS) and JVM (Java or Scala). Any idea on how to achieve this?

+3


source to share


1 answer


Maybe Showdown.js can help you? https://github.com/showdownjs/showdown



0


source







All Articles