In RegEx, how do you find a string that is at most three unique characters?

I am going through a large text file and I am looking for lines containing at most three different characters (these characters, however, can be repeated an unlimited number of times). I guess the best way to do this is with some kind of regex.

All help is appreciated.

(I am writing a script in PHP if that helps)

+2


source to share


4 answers


Maybe this will work:

preg_match("/^(.)\\1*(.)?(?:\\1*\\2*)*(.)?(?:\\1*\\2*\\3*)*$/", $string, $matches);
// aaaaa:Pass
// abababcaaabac:Pass
// aaadsdsdads:Pass
// aasasasassa:Pass
// aasdasdsadfasf:Fail

      

Explaination:

/
 ^                 #start of string
 (.)               #match any character in group 1
 \\1*              #match whatever group 1 was 0 or more times
 (.)?              #match any character in group 2 (optional)
 (?:\\1*\\2*)*     #match group 1 or 2, 0 or more times, 0 or more times 
                   #(non-capture group)
 (.)?              #match any character in group 3 (optional)
 (?:\\1*\\2*\\3*)* #match group 1, 2 or 3, 0 or more times, 0 or more times
                   #(non-capture group)
 $                 #end of string
/

      



Added benifit $matches[1], [2], [3]

will contain the three characters you want. The regex searches for the first character, then stores it and matches it until it finds something other than that character, catches it as the second character, matches any of those characters as many times as it can, catches the third character, and matches all three until the match will not complete or the string will not end and the test will pass.

EDIT

This regex will be much faster due to the way the parsing and backtracking engine works, read bobince's answer for an explanation:

/^(.)\\1*(?:(.)(?:\\1|\\2)*(?:(.)(?:\\1|\\2|\\3)*)?)?$/

      

+4


source


Optimizing regular exercise for kids! The gnarf regex is used as a starting point:

^(.)\1*(.)?(?:\1*\2*)*(.)?(?:\1*\2*\3*)*$

      

I noticed there are nested and sequential * here, which can lead to a lot of churn. For example, in 'abcaaax' it will try to match this last string 'as a single \ 1 * of length 3, a \ 1 * of length 2, followed by one \ 1, a \ 1 followed by 2 of length \ 1 *, or three singles matches \ 1s. This problem gets much worse when you have longer strings, especially when nothing stops \ 1 from the same character as \ 2 because of the regex.

^(.)\1*(.)?(?:\1|\2)*(.)?(?:\1|\2|\3)*$

      

It was twice as fast as the original, tested on Python PCRE. (This is faster than setting it up in PHP, sorry.)

This issue still has a problem that (.)?

it can't match anything and then continue with the rest of the match. \1|\2

It will continue to meet the \ 1, even if there is no match \ 2, which leads to a potential rollback of attempting to submit proposals \1|\2

and \1|\2|\3

earlier, when they can not lead to a coincidence. This can be solved by moving the option ?

across all other sentences:

^(.)\1*(?:(.)(?:\1|\2)*(?:(.)(?:\1|\2|\3)*)?)?$

      

It was twice as fast.

There is still a potential problem that any of \ 1, \ 2, and \ 3 can be the same character, potentially causing more bounce when the expression does not match. This will stop it using negative to not match the previous character:



^(.)\1*(?:(?!\1)(.)(?:\1|\2)*(?:(?!\1|\2)(.)(?:\1|\2|\3)*)?)?$

      

However, in Python with my random test data, I didn't notice any significant speedup from this. Your mileage may vary depending on the PHP data, but it may already be good enough. The assignment (* +) might help if it was available here.

No regex works better than the easier-to-use Python alternative:

len(set(s))<=3

      

A similar method in PHP would probably be with count_chars :

strlen(count_chars($s, 3))<=3

      

I haven't tested the speed, but I would very much expect it to be faster than a regex, in addition to being much better than reading.

So basically I was just completely wasting my time playing with regexes. Don't waste your time, try simple string methods first before resorting to regex!

+7


source


At the risk of getting downvoted, I suggest that regular expressions are not designed to handle this situation.

You can match a character or character set, but you cannot remember which characters in the set have already been found to exclude them from further matching.

I suggest you keep the character set, you reset it before starting on a new line, and add items there as you step down that line. Once the number of elements in the set exceeds 3, you discard the current row and move on to the next.

+6


source


to me - as a programmer with fair enough knowledge of regular expressions, this doesn't sound like a problem that you can only solve with Regexp.

You will most likely need to construct the key of the hashMap / array data structure: character value: count and iterate over a large text file, rebuild the map for each line. for each new character, check if the already encountered character has the number 2, if yes, skip the current line.

but I really want to be surprised if one crazy hacker hacker comes up with a solution.

0


source







All Articles