Regexp remove all html tags except

Question

Regexp remove all html tags except

I am trying to do a regexp in javascript to remove ALL html tags from the input string except <br>

.

I use /(<([^>]+)>)/ig

for tags and tried a few things like adding [^ (br)] to it, but now I'm just confused.

Can anyone please help? I'm sure this will be a speed competition between SO gurus, so if the answer explains the logic of the expression, I'll pick it over the others.

Edit:

To prevent all people from "doing this", let me quote the following from Stack Overflow

While it is true that asking a regular expression to parse arbitrary HTML is similar to asking Paris Hilton to write an operating system, it is sometimes appropriate to parse a limited, known set of HTML.

In this particular case, it's a bunch of text in a div that stays consistent across many pages. I just want to get rid of a few cases (1% max) where users have included spans, strengths, and a few other formatting tags. It doesn't cost more time to reuse it as it barely happens on the thousands of pages I process. If you have any better, faster to implement the idea, feel free to post it as an answer;)

Edit 2

So many comments, I feel like I'm adding a disclaimer: Using Regexp to parse HTML is bad . This won't work consistently, and there are much better ways. Domparser is mentioned; there's Cheerio or jsdom on Node.js, and a lot more libraries that will parse an HTML document correctly (99% of the time). In this case, it looks more like a line that contains several <...>

that I need to remove.

+3

javascript html regex

xShirase Sep 16 14 at 19:36

source to share

4 answers

Try the following:

/(<((?!br)[^>]+)>)/ig

+6

G4BB3R Sep 16 14 at 19:42

source to share

Use DOMParser

to parse your string, then traverse it (I used the code in this question ), extracting the parts you are interested in:

var str = "<div>some text <span>some more</span><br /><a href='#'>a link</a>";
var parser = new DOMParser();
var dom = parser.parseFromString(str, "text/html");
var text = "";
var walkDOM = function (node, func) {
    func(node);
    node = node.firstChild;
    while (node) {
        walkDOM(node,func);
        node = node.nextSibling;
    }
};

walkDOM(dom, function (node) {
    if (node.tagName === 'BR') {
        text += node.outerHTML;
    }
    else if (node.nodeType === 3) { // Text node
        text += node.nodeValue;
    }        
});

alert(text);

Run code Hide result

+2

Tom fenech Sep 16 '14 at 20:00

source to share

It might work. But regardless of the regex, it won't be able to parse the html.

 # /(?!<\/?br\s*\/?>)<[^>]+>/g

 (?! < /? br \s* /? > )
 < [^>]+ >

0

sln Sep 16 14 at 19:44

source to share

xShirase · Accepted Answer · 2014-09-16T20:03:58+0000

I ended up using:

.replace('<br>','%br%').replace(/(<([^>]+)>)/g,'')

then I split by '% br%' instead of the usual br tag. It is not an HTML parser , I am sure it will not be able to parse 100% of the World Wide Web and it solves my specific problem 100% of the time (just tried and tested).

Regexp remove all html tags except

More articles: