What's the best way to search for specific tokens in a string (in Java)?

I have a string with markup in it that I need to find using Java.


string = abc<B>def</B>ghi<B>j</B>kl

desired output..

segment [n] = start, end

segment [1] = 4, 6
segment [2] = 10, 10



source to share

6 answers

Regular expressions should work just fine for this.

Refer to your JavaDoc for

  • java.langString.split ()
  • java.util.regex package
  • java.util.Scanner

Note. StringTokenizer is not what you want, as it breaks characters, not strings. The delim line is a list of characters, any of which will be delimited. This is useful for very simple cases such as an unambiguous comma-separated list.



StringTokenizer will provide you with separate tokens if you want to separate a string with a specific string. Or you can use split () method on String to get separate strings. To get different arrays, you need to include regex.



Given your example, I think I'll be using a regex and in particular I'll take a look at the grouping functionality Matcher provides.


String inputString = "abc<B>def</B>ghi<B>j</B>kl";

String stringPattern = "(<B>)([a-zA-Z]+)(<\\/B>)";

Pattern pattern = Pattern.compile(stringPattern);
Matcher matcher = pattern.matcher(inputString);

if (matcher.matches()) {

    String firstGroup  = matcher.group(1);
    String secondGroup = matcher.group(2); 
    String thirdGroup  = matcher.group(3);




The StringTokenizer takes the entire string as an argument, and it's actually not a good idea for large strings. You can also use StreamTokenizer

You also need to look at Scanner .



It's a bit brute force and makes some assumptions, but it works.

public class SegmentFinder

    public static void main(String[] args)
        String string = "abc<B>def</B>ghi<B>j</B>kl";
        String startRegExp = "<B>";
        String endRegExp = "</B>";
        int segmentCounter = 0;
        int currentPos = 0;
        String[] array = string.split(startRegExp);
        for (int i = 0; i < array.length; i++)
            if (i > 0) // Ignore the first one
                //this assumes that every start will have exactly one end
                String[] array2 = array[i].split(endRegExp);
                int elementLenght = array2[0].length();
                System.out.println("segment["+segmentCounter +"] = "+ (currentPos+1) +","+ (currentPos+elementLenght) );
                for(String s : array2)
                    currentPos += s.length();  
                currentPos += array[i].length();                




Does your input look like your example and you need to get text between specific tags? Then a simple StringUtils.substringsBetween (yourString, "<B>", "</B>") using the apache commons lang package ( http://commons.apache.org/lang/ ) should do the job.

If you are going to use a more general solution, for different and possibly nested tags, you might need to look at a parser that takes html input and creates an XML document from it like NekoHTML, TagSoup, jTidy. Then you can use XPath on the XML document to access the content.



All Articles