How to do complex negative lookbehind for token splitting in Java?

I have a series of EDIFACT strings that need to be marked with +

. However, according to the EDIFACT specification, characters can be escaped with ?

. For example: ??

for a ?

, ?+

for a +

, ?:

for a :

. A ?+

is part of a field and therefore should not be considered a delimiter.

I used negative lookbehind to work with +

, followed by ?

:

delimiter = "\\+";
String[] tokens = data.split("(?<!\\?)" + delimiter);

      

This would divide

a+b+c

in a

, b

andc

a?+b+c

in a?+b

andc

However, it doesn't work when the escape sequence is involved ??

:

a??+b+c

It yields 2 tokens: a??+b

,c

whereas it really should be 3 tokens : a?

, b

andc

On the other hand: a???+b+c

must give two tokens: a???+b

andc

Is there a way to achieve this using negative lookbehind?

You can test the test here if you like.

import java.util.Arrays;

public class Main {
   public static void main(String[] args) {
      assertTokens("a+b+c", "a", "b", "c");
      assertTokens("a?+b+c", "a?+b", "c");
      assertTokens("a??+b+c", "a??", "b", "c");
      assertTokens("a???+b+c", "a???+b", "c");
   }

   private static void assertTokens(String data, String... expectedTokens) {
      String delimiter = "\\+";
      String[] tokens = data.split("(?<!\\?)" + delimiter);

      if(!Arrays.deepEquals(tokens, expectedTokens)) {
         throw new IllegalStateException("Not equals for " + data);
      }
   }

      

}

+3


source to share


2 answers


Instead of paginating, tokenization is easier with mapping. In your case for splitting to work you will have to use variable lookbehind length which is not supported by java.

Try the following regex:

(?:[^+:?]++|\?.)+

      

DEMO

(I used the possessive quantifier ( ++

) purely as an optimization to avoid useless returns)




If you want to match empty tokens ( a++b

yielding,, a

empty string and b

), the regex gets tricky :

(?:[^+:?\r\n]++|\?.)+|(?<=[+:]|^)(?=[+:]|$)

      

DEMO

What does

  • Either match the same as above (I just added \r\n

    to the group so the newlines don't match)
  • Or an empty string that:
    • preceded by a token separator or beginning of a string
    • and then a token separator or end of line

I added a parameter m

to make this work, that is, ^

and $

match the start and end of each line.

+4


source


For reference:



import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        assertTokens("+", "a+b+c", "a", "b", "c");
        assertTokens("+", "a?+b+c", "a?+b", "c");
        assertTokens("+", "a??+b+c", "a??", "b", "c");
        assertTokens("+", "a???+b+c", "a???+b", "c");
        assertTokens("+", "a?'??+b+c", "a?'??", "b", "c");

        assertTokens("\\:", "a???:b:c", "a???:b", "c");
        assertTokens("\\:", "a????:b:c", "a????", "b", "c");
    }

    private static void assertTokens(String delim, String data, String... expectedTokens) {
        Pattern pattern = Pattern.compile("(?:[^" + delim + ":?]++|\\?.)+");
        Matcher matcher = pattern.matcher(data);

        List<String> tokens = new ArrayList<>();
        while (matcher.find()) {
            tokens.add(matcher.group());
        }

        if(!Arrays.deepEquals(tokens.toArray(), expectedTokens)) {
            for (String token: tokens) {
                System.out.println(token);
            }
            throw new IllegalStateException("Not equals for " + data);
        }
    }
}

      

0


source







All Articles