What's the best way to search for specific tokens in a string (in Java)?

I have a string with markup in it that I need to find using Java.

eg.

string = abc<B>def</B>ghi<B>j</B>kl

desired output..

segment [n] = start, end

segment [1] = 4, 6
segment [2] = 10, 10

      

+1


source to share


6 answers


Regular expressions should work just fine for this.

Refer to your JavaDoc for



  • java.langString.split ()
  • java.util.regex package
  • java.util.Scanner

Note. StringTokenizer is not what you want, as it breaks characters, not strings. The delim line is a list of characters, any of which will be delimited. This is useful for very simple cases such as an unambiguous comma-separated list.

+8


source


StringTokenizer will provide you with separate tokens if you want to separate a string with a specific string. Or you can use split () method on String to get separate strings. To get different arrays, you need to include regex.



+2


source


Given your example, I think I'll be using a regex and in particular I'll take a look at the grouping functionality Matcher provides.

Tom

String inputString = "abc<B>def</B>ghi<B>j</B>kl";

String stringPattern = "(<B>)([a-zA-Z]+)(<\\/B>)";

Pattern pattern = Pattern.compile(stringPattern);
Matcher matcher = pattern.matcher(inputString);

if (matcher.matches()) {

    String firstGroup  = matcher.group(1);
    String secondGroup = matcher.group(2); 
    String thirdGroup  = matcher.group(3);
}

      

+2


source


The StringTokenizer takes the entire string as an argument, and it's actually not a good idea for large strings. You can also use StreamTokenizer

You also need to look at Scanner .

+1


source


It's a bit brute force and makes some assumptions, but it works.

public class SegmentFinder
{

    public static void main(String[] args)
    {
        String string = "abc<B>def</B>ghi<B>j</B>kl";
        String startRegExp = "<B>";
        String endRegExp = "</B>";
        int segmentCounter = 0;
        int currentPos = 0;
        String[] array = string.split(startRegExp);
        for (int i = 0; i < array.length; i++)
        {           
            if (i > 0) // Ignore the first one
            {
                segmentCounter++;
                //this assumes that every start will have exactly one end
                String[] array2 = array[i].split(endRegExp);
                int elementLenght = array2[0].length();
                System.out.println("segment["+segmentCounter +"] = "+ (currentPos+1) +","+ (currentPos+elementLenght) );
                for(String s : array2)
                {
                    currentPos += s.length();  
                }
            }
            else
            {
                currentPos += array[i].length();                
            }
        }
    }
}

      

+1


source


Does your input look like your example and you need to get text between specific tags? Then a simple StringUtils.substringsBetween (yourString, "<B>", "</B>") using the apache commons lang package ( http://commons.apache.org/lang/ ) should do the job.

If you are going to use a more general solution, for different and possibly nested tags, you might need to look at a parser that takes html input and creates an XML document from it like NekoHTML, TagSoup, jTidy. Then you can use XPath on the XML document to access the content.

0


source







All Articles