Java regex matches all html elements except for one special case

I have a line with some markup that looks like this:

The quick brown <a href="www.fox.org">fox</a> jumped over the lazy <a href="entry://id=6000009">dog</a> <img src="dog.png" />.

I am trying to remove everything except anchor elements with "entry: // id =" inside. So the desired output from the above example would be:

The quick brown fox jumped over the lazy <a href="entry://id=6000009">dog</a>.

Having written this coincidence, the closest I went in like this:

<.*?>!<a href=\"entry://id=\\d+\">.*?<\\/a>

But I can't figure out why this doesn't work. Any help (other than "why don't you use a parser" :) would be greatly appreciated!

+2


source to share


3 answers


Using this:

((<a href="entry://id=\d+">.*?</a>)|<!\[CDATA\[.*?\]\]>|<!--.*?-->|<.*?>)

      

and combining it with replacing all $ 2 will work for your example. The code below proves it:



import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static org.junit.Assert.*;
import org.junit.Test;


public class TestStack1305864 {

    @Test
    public void matcherWithCdataAndComments(){
        String s="The quick <span>brown</span> <a href=\"www.fox.org\">fox</a> jumped over the lazy <![CDATA[ > ]]> <a href=\"entry://id=6000009\">dog</a> <img src=\"dog.png\" />.";
        String r="The quick brown fox jumped over the lazy <a href=\"entry://id=6000009\">dog</a> .";
        String pattern="((<a href=\"entry://id=\\d+\">.*?</a>)|<!\\[CDATA\\[.*?\\]\\]>|<!--.*?-->|<.*?>)";
        Pattern p = Pattern.compile(pattern);
        Matcher m = p.matcher(s);

        String t = s.replaceAll(pattern, "$2");
        System.out.println(t);
        System.out.println(r);
        assertEquals(r, t);
    }
}

      

The idea is to grab all the elements you are interested in to store in a specific group so you can insert them back into the string.
Thus, you can replace everything:
For each element that does not match the interesting ones, the group will be empty, and the element will be replaced with "" For interesting elements, the group will not be empty and will be added to the result String.

edit: handle inested <or> in CDATA and
edit comments : see http://martinfowler.com/bliki/ComposedRegex.html for a regex composition pattern designed to make regex more readable.

+1


source


I would not use regular expressions to parse HTML. HTML is not regular and there is no end of edge cases to turn you off.



Check JTidy instead.

+7


source


Not easy with regex. I recommend a parser that understands HTML / XML semantics.

If you insist, you can do a multi-step approach like:

  • Replace "<(a\s*href="entry:.*?/a)>"

    with"{{{{\1}}}}"

  • Replace "<(?!/a}}}})[^>]*>"

    with""

  • Replace "{{{{"

    with"<"

  • Replace "}}}}"

    with">"

Be warned that the above is error prone and will work at some point. Consider this an ugly hack, not a real solution. Something like the above works well for editing some text file once in a text editor that supports regex, but not much for reuse in the real world as part of processing data in an application.

+1


source







All Articles