RegEx splits / tokenizes a string when a character within a string changes

Question

RegEx splits / tokenizes a string when a character within a string changes

Folding my brain on this. I have a string for example. MON-123ABC / 456 78 # AbCd

What I want is an array or list as shown below.

[0] = MON
[1] = -
[2] = 123
[3] = ABC
[4] = /
[5] = 456
[6] =  ' '  (space character that is between 6 and 7 in the example string)
[7] = 78
[8] = #
[9] = A
[10] = b
[11] = C
[12] = d

Basically I want to split any string input when there is a transition from one type of character (alpha, numeric, not alpha / numeric, upper to lower case) to another

RegExp or C # code will be used. I have a simple regex 0+ | (? <= ([1-9])) (? = [1-9]) (?! \ 1) but this is only a numeric division, my regex is not that good. I have been playing around with some C # code to encode a string, but I have a problem transitioning between character types.

Example 2: Another example input string could be 123qaz ZBC / 45678 # Ab-Cd

This imposes on each transition NOT the position of that key. In example 2, there are two spaces between z and Z. As I said before, it is a transition between types that the key is.

+3

string c # regex

Rory 12 nov. 14 at 13:08

source to share

4 answers

A bit uggly split by this regex:

(?<=[a-z])(?=[^a-z])|(?<=[A-Z])(?=[^A-Z])|(?<=[0-9])(?=[^0-9])|(?<=[a-zA-Z0-9])(?=[^a-zA-Z0-9])|(?<=[^a-z])(?=[a-z])|(?<=[^A-Z])(?=[A-Z])|(?<=[^0-9])(?=[0-9])|(?<=[^a-zA-Z0-9])(?=[a-zA-Z0-9])

Details:

(?<=[a-z])(?=[^a-z])                : split between lc alpha and not alpha
|(?<=[A-Z])(?=[^A-Z])               : or split between UC alpha and not alpha
|(?<=[0-9])(?=[^0-9])               : or split between digit and not digit
|(?<=[a-zA-Z0-9])(?=[^a-zA-Z0-9])   : or split between alphanum and not alphanum
|(?<=[^a-z])(?=[a-z])               : reverse of above
|(?<=[^A-Z])(?=[A-Z])
|(?<=[^0-9])(?=[0-9])
|(?<=[^a-zA-Z0-9])(?=[a-zA-Z0-9])

This gives me:

("MON", "-", 123, "ABC", "/", 456, " ", 78, "#", "A", "b", "C", "d")

+1

Toto 12 nov. 14 at 13:17

source to share

What about ([A-Z]+|[a-z]+|\d+|[^\da-zA-Z]+)

int i = 0;
foreach(Match match in Regex.Matches(@"MON-123ABC/456 78#AbCd", @"([A-Z]+|[a-z]+|\d+|[^\da-zA-Z]+)"))
{
    if (match.Success)
    {
        Console.WriteLine("{0}\t{1}", ++i, match.Groups[0]);
    }
}

For

1       MON
2       -
3       123
4       ABC
5       /
6       456
7
8       78
9       #
10      A
11      b
12      C
13      d

+1

Alex K. 12 nov. 14 at 13:21

source to share

(?=-|\/|\s|#)|(?<=-|\/|\s|#)|(?!\d)(?=\d)|(?<=\d)(?=[^\d])|(?<=[a-z])|(?=[a-z])

Try it. Check out the demo. Replace\n

http://regex101.com/r/tF5fT5/55

0

vks 12 nov. '14 at 13:30

source to share

juharr · Accepted Answer · 2014-11-12T13:27:16+0000

There is a solution here that doesn't use regular expressions.

public static IEnumerable<string> SplitOnType(string str)
{
    StringBuilder builder = new StringBuilder();
    int previousType = -1;
    foreach (char c in str)
    {
        int type;
        if ('a' <= c && c <= 'z')
            type = 0;
        else if ('A' <= c && c <= 'Z')
            type = 1;
        else if ('0' <= c && c <= '9')
            type = 2;
        else
            type = 3;

        if (previousType != -1 && type != previousType)
        {
            yield return builder.ToString();
            builder.Clear();
        }

        builder.Append(c);
        previousType = type;
    }

    if (builder.Length > 0)
        yield return builder.ToString();
}

Note that this will group all non-alphanumeric characters together for description, but can be changed to any additional group simply by adding additional sentences else if

.

RegEx splits / tokenizes a string when a character within a string changes

More articles: