Regex - accept AZ excluding some letters for a fixed length

Question

Regex - accept AZ excluding some letters for a fixed length

I cannot find how to exclude some fixed length characters in a string

c ^XXX-\d{4}-(?![XYT])[A-Z]{4}$

I can exclude XYT

from the first char the last line, so

XXX-0000-AAAA is ok
XXX-0000-XAAA is not ok

my problem is that I do not want to X

, Y

or T

in any part of the last segment

XXX-0000-AAXA is not ok
XXX-0000-ABXX is not ok
XXX-0000-ABCT is not ok
and so on

How can i do this?

To be more precise, I add that the XYTs are variables, so the fixed list solution works, but not convenient

+3

regex regex-negation

user1875921 06 May '17 at 9:37

source to share

5 answers

Why not just use it [A-SU-WZ]{4}

for the last part? That is, only match the letters you want in the first place.

Alternatively, do the repetition part: (?:(?![XYT])[A-Z]){4}

+3

melpomene 06 May '17 at 9:40

source to share

There are two "elegant" ways to do this. This is the easiest to understand:

^XXX-\d{4}-((?![XYT])[A-Z]){4}$

This is very close to what you had, but instead applies a negative outlook before each repeat character.

Another way is to use character class subtraction:

^XXX-\d{4}-[A-Z&&[^XYT]]{4}$

You rarely see this syntax used, so it can be useful to use if nothing else to impress your colleagues.

+3

Bohemian 06 May '17 at 10:19

source to share

Regex: XXX-\d{4}-(?!.*?[XYT])[A-Z]{4}

1. XXX-\d{4}

this will match XXX-

and thenfour digits

2.a (?!.*?[XYT])

negative view of X

Y

andT

3. [A-Z]{4}

matches the 4

characters that may include A-Z

.

Demo Regex Code

+3

Sahil gulati 06 May '17 at 10:32

source to share

Here's a versatile and fast alternative

TL; DR; ^XXX-[0-9]{4}-[^XYT -@[-²]{4}$

The question in this question highlights the issue when using regexes in such a way that it almost requires "boolean" ways to represent character classes such as ['AZ' but not "XYZ ']. For this reason this answer is presented (as edit and update) for the benefit of others facing similar scenarios such as those described by the OP.

Given the lack of direct syntax support such as ['AZ' but not 'XYZ']; the only way to achieve this type of logic and control the precedence order for overlapping expressions in a regex is to use features like Search Assertions

However, using them ineffectively can be extremely costly, as stated in one of the other answers here.

Here are a few ways in which a drastic difference in performance criteria for an application makes it impossible to have a generic regex that achieves this

The performance of a one-line match, where speed differences are not noticeable, can produce a more reliable or reliable regexp. This might be in case of portability in code (i.e. almost every regex parser knows what it means [[-\`a-~!-@]

, but some don't know what \W

or means [:punct:]

.
At the other end of the scale, where nanoseconds are critical, then you can reevaluate many other parts of the process before getting into it with a regex, but very inflexible anyway , but high performance might be preferred where the system can be compatible. if she hasn't been
Variety of strings has a large impact on performance, and therefore, depending on the application, some parts of the string may be evaluated differently.
For the same reason, the decision on how to structure the Lookaround must be determined by use case.
If strings are part of a database and you are searching through an API or other built-in function, you might need to use a specific syntax or format.
Besides the corresponding expression, various regex libraries, functions, extensions can change the entire path when the regex is executed with options. For example python re.findall()

can be used as the operational equivalent of a positive lookbehind with unknown repeat length.
Some regex programs are much faster than others. This can more than make up for the difference in efficiency when comparing theoretical steps.

Here's a balanced approach to the question:

^XXX-[0-9]{4}-[^XYT -@[-²]{4}$

Here's an example where 10,000 rows are mapped out of 10,100: https://regex101.com/r/YJ5xME/1

Here's an example where 100 rows are mapped out of 10100: https://regex101.com/r/d1l5af/1

There's not much performance difference between the two at 10,000 lines, on the contrary, it's a regex: ^XXX-[0-9]{4}-([A-Z](?<=[^XYT])){4}$

takes twice as long to match 10,000 lines.

This is also compatible with applying an exception variable as requested

Command line example with bash

:

Take a line file stringfile

with content, for example:

XXX-0000-SNUR
XXX-0000-FHDZ
XXX-0000- + 439
XXX-0000-04X9
XXX-0000- / 1Y +
XXX-0000-X / X9
XXX-0000-Y6X9
XXX-0000-XY16
XXX-0000-0T94
XXX-0000 - ++ 6Y
XXX-0000-TT + 3
XXX-0000-NLNL
XXX-0000-QPSE

Using a variable $exclude

like:

exclude = "XYT"
egrep "^ XXX- [0-9] {4} - ([^ $ exclude [^ XYT - @ [-²]) {4} $" <stringfile

Correct matches:

XXX-0000-SNUR
XXX-0000-FHDZ
XXX-0000-NLNL
XXX-0000-QPSE

This is also compatible with extended regular expressions

Use cases like GNU find

with -iregex

(when dealing with filenames)
egrep

( grep -E

)
Anything that supports Modern Regular Expressions as defined in POSIX 1003.2

Is it possible to make the correct expression before matching thousands of lines?

And here, where efficiency meets precision. Command line example:

exclude="XYT"
customCharClass="$(
alpha=ABCDEFGHIJKLMNOPQRSTUVWXYZ
echo "${alpha[@]}" \
| sed -E -e "s/[$exclude]//g")"

egrep "^XXX-[0-9]{4}-([$customCharClass]){4}$" < stringfile

Now this is the regex applied to the string file:

^XXX-[0-9]{4}-([ABCDEFGHIJKLMNOPQRSUVWZ]){4}$

(Please note: no X Y or T )

The example above provides easy portability, high performance, and accounts for general use when extended character sets are not involved.
An example here, where the program decides the most effective search criteria, is guaranteed to account for all scenarios
Compromise and performance criteria are precedent. That is, generating a custom search string will certainly take longer for just one string and, by contrast, is sloppy when searching for thousands of files.

Another answer to this question that complements this answer.

This regex was posted by @mickmackusa

^X{3}-\d{4}-(?![A-Z]{0,3}[XYT])[A-Z]{4}$

This regex works well, it is cleaner than the alternative provided in this answer (however, it requires PCRE).

This performs a little slower (but by no means ineffective or wasteful), but is guaranteed to produce only a [AZ] match (with XYT excluded).

This highlights the need to evaluate application-specific performance criteria when designing a regular expression that may require search queries.

+2

hmedia1 06 May '17 at 11:37

source to share

mickmackusa · Accepted Answer · 2017-05-07T12:29:01+0000

Ok, this seems to be the most efficient, correct pattern I can do: ( Demo )

^X{3}-\d{4}-(?![A-Z]{0,3}[XYT])[A-Z]{4}$

I have set up a battery of strings to match, which should / should expose any flaws in the templates posted on this page. My pattern completes the test at 176 steps and ensures a correct match. This makes it the best example to use negative lookahead as requested by the OP.

Comparison of apples to apples:

original 190 steps ^XXX-\d{4}-(?![XYT])[A-Z]{4}$

user1875921 Demo

wrong 300 steps XXX-\d{4}-(?!.*?[XYT])[A-Z]{4}

Sahil Gulati Demo

fix 140 steps ^XXX-\d{4}-[ABCDEFGHIJKLMNOPQRSUVWZ]{4}$

[commented] Demo

fix 245 steps ^XXX-\d{4}-((?![XYT])[A-Z]){4}$

Bohemian # 1 Demo

fix 279 steps ^XXX-\d{4}-(?:(?![XYT])[A-Z]){4}$

melpomene Demo

n / a - ^XXX-\d{4}-[A-Z&&[^XYT]]{4}$

Czech # 2

Regex - accept AZ excluding some letters for a fixed length

Here's a versatile and fast alternative

Here's a balanced approach to the question:

This is also compatible with applying an exception variable as requested

This is also compatible with extended regular expressions

Is it possible to make the correct expression before matching thousands of lines?

Another answer to this question that complements this answer.

More articles: