Regexp remove all html tags except
I am trying to do a regexp in javascript to remove ALL html tags from the input string except <br>
.
I use /(<([^>]+)>)/ig
for tags and tried a few things like adding [^ (br)] to it, but now I'm just confused.
Can anyone please help? I'm sure this will be a speed competition between SO gurus, so if the answer explains the logic of the expression, I'll pick it over the others.
Edit:
To prevent all people from "doing this", let me quote the following from Stack Overflow
While it is true that asking a regular expression to parse arbitrary HTML is similar to asking Paris Hilton to write an operating system, it is sometimes appropriate to parse a limited, known set of HTML.
In this particular case, it's a bunch of text in a div that stays consistent across many pages. I just want to get rid of a few cases (1% max) where users have included spans, strengths, and a few other formatting tags. It doesn't cost more time to reuse it as it barely happens on the thousands of pages I process. If you have any better, faster to implement the idea, feel free to post it as an answer;)
Edit 2
So many comments, I feel like I'm adding a disclaimer: Using Regexp to parse HTML is bad . This won't work consistently, and there are much better ways. Domparser is mentioned; there's Cheerio or jsdom on Node.js, and a lot more libraries that will parse an HTML document correctly (99% of the time). In this case, it looks more like a line that contains several <...>
that I need to remove.
source to share
I ended up using:
.replace('<br>','%br%').replace(/(<([^>]+)>)/g,'')
then I split by '% br%' instead of the usual br tag. It is not an HTML parser , I am sure it will not be able to parse 100% of the World Wide Web and it solves my specific problem 100% of the time (just tried and tested).
source to share
Use DOMParser
to parse your string, then traverse it (I used the code in this question ), extracting the parts you are interested in:
var str = "<div>some text <span>some more</span><br /><a href='#'>a link</a>";
var parser = new DOMParser();
var dom = parser.parseFromString(str, "text/html");
var text = "";
var walkDOM = function (node, func) {
func(node);
node = node.firstChild;
while (node) {
walkDOM(node,func);
node = node.nextSibling;
}
};
walkDOM(dom, function (node) {
if (node.tagName === 'BR') {
text += node.outerHTML;
}
else if (node.nodeType === 3) { // Text node
text += node.nodeValue;
}
});
alert(text);
source to share