RegEx splits / tokenizes a string when a character within a string changes
Folding my brain on this. I have a string for example. MON-123ABC / 456 78 # AbCd
What I want is an array or list as shown below.
[0] = MON
[1] = -
[2] = 123
[3] = ABC
[4] = /
[5] = 456
[6] = ' ' (space character that is between 6 and 7 in the example string)
[7] = 78
[8] = #
[9] = A
[10] = b
[11] = C
[12] = d
Basically I want to split any string input when there is a transition from one type of character (alpha, numeric, not alpha / numeric, upper to lower case) to another
RegExp or C # code will be used. I have a simple regex 0+ | (? <= ([1-9])) (? = [1-9]) (?! \ 1) but this is only a numeric division, my regex is not that good. I have been playing around with some C # code to encode a string, but I have a problem transitioning between character types.
Example 2: Another example input string could be 123qaz ZBC / 45678 # Ab-Cd
This imposes on each transition NOT the position of that key. In example 2, there are two spaces between z and Z. As I said before, it is a transition between types that the key is.
source to share
There is a solution here that doesn't use regular expressions.
public static IEnumerable<string> SplitOnType(string str)
{
StringBuilder builder = new StringBuilder();
int previousType = -1;
foreach (char c in str)
{
int type;
if ('a' <= c && c <= 'z')
type = 0;
else if ('A' <= c && c <= 'Z')
type = 1;
else if ('0' <= c && c <= '9')
type = 2;
else
type = 3;
if (previousType != -1 && type != previousType)
{
yield return builder.ToString();
builder.Clear();
}
builder.Append(c);
previousType = type;
}
if (builder.Length > 0)
yield return builder.ToString();
}
Note that this will group all non-alphanumeric characters together for description, but can be changed to any additional group simply by adding additional sentences else if
.
source to share
A bit uggly split by this regex:
(?<=[a-z])(?=[^a-z])|(?<=[A-Z])(?=[^A-Z])|(?<=[0-9])(?=[^0-9])|(?<=[a-zA-Z0-9])(?=[^a-zA-Z0-9])|(?<=[^a-z])(?=[a-z])|(?<=[^A-Z])(?=[A-Z])|(?<=[^0-9])(?=[0-9])|(?<=[^a-zA-Z0-9])(?=[a-zA-Z0-9])
Details:
(?<=[a-z])(?=[^a-z]) : split between lc alpha and not alpha
|(?<=[A-Z])(?=[^A-Z]) : or split between UC alpha and not alpha
|(?<=[0-9])(?=[^0-9]) : or split between digit and not digit
|(?<=[a-zA-Z0-9])(?=[^a-zA-Z0-9]) : or split between alphanum and not alphanum
|(?<=[^a-z])(?=[a-z]) : reverse of above
|(?<=[^A-Z])(?=[A-Z])
|(?<=[^0-9])(?=[0-9])
|(?<=[^a-zA-Z0-9])(?=[a-zA-Z0-9])
This gives me:
("MON", "-", 123, "ABC", "/", 456, " ", 78, "#", "A", "b", "C", "d")
source to share
What about ([A-Z]+|[a-z]+|\d+|[^\da-zA-Z]+)
int i = 0;
foreach(Match match in Regex.Matches(@"MON-123ABC/456 78#AbCd", @"([A-Z]+|[a-z]+|\d+|[^\da-zA-Z]+)"))
{
if (match.Success)
{
Console.WriteLine("{0}\t{1}", ++i, match.Groups[0]);
}
}
For
1 MON
2 -
3 123
4 ABC
5 /
6 456
7
8 78
9 #
10 A
11 b
12 C
13 d
source to share