Regex - accept AZ excluding some letters for a fixed length
I cannot find how to exclude some fixed length characters in a string
c ^XXX-\d{4}-(?![XYT])[A-Z]{4}$
I can exclude XYT
from the first char the last line, so
XXX-0000-AAAA is ok
XXX-0000-XAAA is not ok
my problem is that I do not want to X
, Y
or T
in any part of the last segment
XXX-0000-AAXA is not ok
XXX-0000-ABXX is not ok
XXX-0000-ABCT is not ok
and so on
How can i do this?
To be more precise, I add that the XYTs are variables, so the fixed list solution works, but not convenient
source to share
Ok, this seems to be the most efficient, correct pattern I can do: ( Demo )
^X{3}-\d{4}-(?![A-Z]{0,3}[XYT])[A-Z]{4}$
I have set up a battery of strings to match, which should / should expose any flaws in the templates posted on this page. My pattern completes the test at 176 steps and ensures a correct match. This makes it the best example to use negative lookahead as requested by the OP.
Comparison of apples to apples:
original 190 steps
^XXX-\d{4}-(?![XYT])[A-Z]{4}$
user1875921 Demowrong 300 steps
XXX-\d{4}-(?!.*?[XYT])[A-Z]{4}
Sahil Gulati Demofix 140 steps
^XXX-\d{4}-[ABCDEFGHIJKLMNOPQRSUVWZ]{4}$
[commented] Demofix 245 steps
^XXX-\d{4}-((?![XYT])[A-Z]){4}$
Bohemian # 1 Demofix 279 steps
^XXX-\d{4}-(?:(?![XYT])[A-Z]){4}$
melpomene Demon / a -
^XXX-\d{4}-[A-Z&&[^XYT]]{4}$
Czech # 2
source to share
There are two "elegant" ways to do this. This is the easiest to understand:
^XXX-\d{4}-((?![XYT])[A-Z]){4}$
This is very close to what you had, but instead applies a negative outlook before each repeat character.
Another way is to use character class subtraction:
^XXX-\d{4}-[A-Z&&[^XYT]]{4}$
You rarely see this syntax used, so it can be useful to use if nothing else to impress your colleagues.
source to share
Here's a versatile and fast alternative
TL; DR; ^XXX-[0-9]{4}-[^XYT -@[-²]{4}$
The question in this question highlights the issue when using regexes in such a way that it almost requires "boolean" ways to represent character classes such as ['AZ' but not "XYZ ']. For this reason this answer is presented (as edit and update) for the benefit of others facing similar scenarios such as those described by the OP.
Given the lack of direct syntax support such as ['AZ' but not 'XYZ']; the only way to achieve this type of logic and control the precedence order for overlapping expressions in a regex is to use features like Search Assertions
However, using them ineffectively can be extremely costly, as stated in one of the other answers here.
Here are a few ways in which a drastic difference in performance criteria for an application makes it impossible to have a generic regex that achieves this
- The performance of a one-line match, where speed differences are not noticeable, can produce a more reliable or reliable regexp. This might be in case of portability in code (i.e. almost every regex parser knows what it means
[[-\`a-~!-@]
, but some don't know what\W
or means[:punct:]
. - At the other end of the scale, where nanoseconds are critical, then you can reevaluate many other parts of the process before getting into it with a regex, but very inflexible anyway , but high performance might be preferred where the system can be compatible. if she hasn't been
- Variety of strings has a large impact on performance, and therefore, depending on the application, some parts of the string may be evaluated differently.
- For the same reason, the decision on how to structure the Lookaround must be determined by use case.
- If strings are part of a database and you are searching through an API or other built-in function, you might need to use a specific syntax or format.
- Besides the corresponding expression, various regex libraries, functions, extensions can change the entire path when the regex is executed with options. For example python
re.findall()
can be used as the operational equivalent of a positive lookbehind with unknown repeat length. - Some regex programs are much faster than others. This can more than make up for the difference in efficiency when comparing theoretical steps.
Here's a balanced approach to the question:
^XXX-[0-9]{4}-[^XYT -@[-²]{4}$
Here's an example where 10,000 rows are mapped out of 10,100: https://regex101.com/r/YJ5xME/1
Here's an example where 100 rows are mapped out of 10100: https://regex101.com/r/d1l5af/1
There's not much performance difference between the two at 10,000 lines, on the contrary, it's a regex: ^XXX-[0-9]{4}-([A-Z](?<=[^XYT])){4}$
takes twice as long to match 10,000 lines.
This is also compatible with applying an exception variable as requested
Command line example with bash
:
Take a line file stringfile
with content, for example:
-
XXX-0000-SNUR XXX-0000-FHDZ XXX-0000- + 439 XXX-0000-04X9 XXX-0000- / 1Y + XXX-0000-X / X9 XXX-0000-Y6X9 XXX-0000-XY16 XXX-0000-0T94 XXX-0000 - ++ 6Y XXX-0000-TT + 3 XXX-0000-NLNL XXX-0000-QPSE
Using a variable $exclude
like:
-
exclude = "XYT" egrep "^ XXX- [0-9] {4} - ([^ $ exclude [^ XYT - @ [-²]) {4} $" <stringfile
Correct matches:
-
XXX-0000-SNUR XXX-0000-FHDZ XXX-0000-NLNL XXX-0000-QPSE
This is also compatible with extended regular expressions
- Use cases like GNU
find
with-iregex
(when dealing with filenames) -
egrep
(grep -E
) - Anything that supports Modern Regular Expressions as defined in POSIX 1003.2
Is it possible to make the correct expression before matching thousands of lines?
And here, where efficiency meets precision. Command line example:
exclude="XYT"
customCharClass="$(
alpha=ABCDEFGHIJKLMNOPQRSTUVWXYZ
echo "${alpha[@]}" \
| sed -E -e "s/[$exclude]//g")"
egrep "^XXX-[0-9]{4}-([$customCharClass]){4}$" < stringfile
Now this is the regex applied to the string file:
^XXX-[0-9]{4}-([ABCDEFGHIJKLMNOPQRSUVWZ]){4}$
(Please note: no X Y or T )
- The example above provides easy portability, high performance, and accounts for general use when extended character sets are not involved.
- An example here, where the program decides the most effective search criteria, is guaranteed to account for all scenarios
- Compromise and performance criteria are precedent. That is, generating a custom search string will certainly take longer for just one string and, by contrast, is sloppy when searching for thousands of files.
Another answer to this question that complements this answer.
This regex was posted by @mickmackusa
^X{3}-\d{4}-(?![A-Z]{0,3}[XYT])[A-Z]{4}$
This regex works well, it is cleaner than the alternative provided in this answer (however, it requires PCRE).
This performs a little slower (but by no means ineffective or wasteful), but is guaranteed to produce only a [AZ] match (with XYT excluded).
This highlights the need to evaluate application-specific performance criteria when designing a regular expression that may require search queries.
source to share