Regex - accept AZ excluding some letters for a fixed length

I cannot find how to exclude some fixed length characters in a string

c ^XXX-\d{4}-(?![XYT])[A-Z]{4}$

I can exclude XYT

from the first char the last line, so

XXX-0000-AAAA is ok
XXX-0000-XAAA is not ok

      

my problem is that I do not want to X

, Y

or T

in any part of the last segment

XXX-0000-AAXA is not ok
XXX-0000-ABXX is not ok
XXX-0000-ABCT is not ok
and so on

      

How can i do this?

To be more precise, I add that the XYTs are variables, so the fixed list solution works, but not convenient

+3


source to share


5 answers


Ok, this seems to be the most efficient, correct pattern I can do: ( Demo )

^X{3}-\d{4}-(?![A-Z]{0,3}[XYT])[A-Z]{4}$

      

I have set up a battery of strings to match, which should / should expose any flaws in the templates posted on this page. My pattern completes the test at 176 steps and ensures a correct match. This makes it the best example to use negative lookahead as requested by the OP.



Comparison of apples to apples:

original 190 steps ^XXX-\d{4}-(?![XYT])[A-Z]{4}$

user1875921 Demo

wrong 300 steps XXX-\d{4}-(?!.*?[XYT])[A-Z]{4}

Sahil Gulati Demo

fix 140 steps ^XXX-\d{4}-[ABCDEFGHIJKLMNOPQRSUVWZ]{4}$

[commented] Demo

fix 245 steps ^XXX-\d{4}-((?![XYT])[A-Z]){4}$

Bohemian # 1 Demo

fix 279 steps ^XXX-\d{4}-(?:(?![XYT])[A-Z]){4}$

melpomene Demo

n / a - ^XXX-\d{4}-[A-Z&&[^XYT]]{4}$

Czech # 2

+1


source


Why not just use it [A-SU-WZ]{4}

for the last part? That is, only match the letters you want in the first place.



Alternatively, do the repetition part: (?:(?![XYT])[A-Z]){4}

+3


source


There are two "elegant" ways to do this. This is the easiest to understand:

^XXX-\d{4}-((?![XYT])[A-Z]){4}$

      

This is very close to what you had, but instead applies a negative outlook before each repeat character.

Another way is to use character class subtraction:

^XXX-\d{4}-[A-Z&&[^XYT]]{4}$

      

You rarely see this syntax used, so it can be useful to use if nothing else to impress your colleagues.

+3


source


Regex: XXX-\d{4}-(?!.*?[XYT])[A-Z]{4}

1. XXX-\d{4}

this will match XXX-

and thenfour digits

2.a (?!.*?[XYT])

negative view of X

Y

andT

3. [A-Z]{4}

matches the 4

characters that may include A-Z

.

Demo Regex Code

+3


source


Here's a versatile and fast alternative

TL; DR; ^XXX-[0-9]{4}-[^XYT -@[-²]{4}$

The question in this question highlights the issue when using regexes in such a way that it almost requires "boolean" ways to represent character classes such as ['AZ' but not "XYZ ']. For this reason this answer is presented (as edit and update) for the benefit of others facing similar scenarios such as those described by the OP.

Given the lack of direct syntax support such as ['AZ' but not 'XYZ']; the only way to achieve this type of logic and control the precedence order for overlapping expressions in a regex is to use features like Search Assertions

However, using them ineffectively can be extremely costly, as stated in one of the other answers here.

Here are a few ways in which a drastic difference in performance criteria for an application makes it impossible to have a generic regex that achieves this

  • The performance of a one-line match, where speed differences are not noticeable, can produce a more reliable or reliable regexp. This might be in case of portability in code (i.e. almost every regex parser knows what it means [[-\`a-~!-@]

    , but some don't know what \W

    or means [:punct:]

    .
  • At the other end of the scale, where nanoseconds are critical, then you can reevaluate many other parts of the process before getting into it with a regex, but very inflexible anyway , but high performance might be preferred where the system can be compatible. if she hasn't been
  • Variety of strings has a large impact on performance, and therefore, depending on the application, some parts of the string may be evaluated differently.
  • For the same reason, the decision on how to structure the Lookaround must be determined by use case.
  • If strings are part of a database and you are searching through an API or other built-in function, you might need to use a specific syntax or format.
  • Besides the corresponding expression, various regex libraries, functions, extensions can change the entire path when the regex is executed with options. For example python re.findall()

    can be used as the operational equivalent of a positive lookbehind with unknown repeat length.
  • Some regex programs are much faster than others. This can more than make up for the difference in efficiency when comparing theoretical steps.

Here's a balanced approach to the question:

^XXX-[0-9]{4}-[^XYT -@[-²]{4}$

      

Here's an example where 10,000 rows are mapped out of 10,100: https://regex101.com/r/YJ5xME/1

Here's an example where 100 rows are mapped out of 10100: https://regex101.com/r/d1l5af/1

There's not much performance difference between the two at 10,000 lines, on the contrary, it's a regex: ^XXX-[0-9]{4}-([A-Z](?<=[^XYT])){4}$

takes twice as long to match 10,000 lines.

This is also compatible with applying an exception variable as requested

Command line example with bash

:

Take a line file stringfile

with content, for example:

  • XXX-0000-SNUR
    XXX-0000-FHDZ
    XXX-0000- + 439
    XXX-0000-04X9
    XXX-0000- / 1Y +
    XXX-0000-X / X9
    XXX-0000-Y6X9
    XXX-0000-XY16
    XXX-0000-0T94
    XXX-0000 - ++ 6Y
    XXX-0000-TT + 3
    XXX-0000-NLNL
    XXX-0000-QPSE
    

Using a variable $exclude

like:

  • exclude = "XYT"
    egrep "^ XXX- [0-9] {4} - ([^ $ exclude [^ XYT - @ [-²]) {4} $" <stringfile

Correct matches:

  • XXX-0000-SNUR
    XXX-0000-FHDZ
    XXX-0000-NLNL
    XXX-0000-QPSE

This is also compatible with extended regular expressions

  • Use cases like GNU find

    with -iregex

    (when dealing with filenames)
  • egrep

    ( grep -E

    )
  • Anything that supports Modern Regular Expressions as defined in POSIX 1003.2

Is it possible to make the correct expression before matching thousands of lines?

And here, where efficiency meets precision. Command line example:

exclude="XYT"
customCharClass="$(
alpha=ABCDEFGHIJKLMNOPQRSTUVWXYZ
echo "${alpha[@]}" \
| sed -E -e "s/[$exclude]//g")"

egrep "^XXX-[0-9]{4}-([$customCharClass]){4}$" < stringfile

      

Now this is the regex applied to the string file:

^XXX-[0-9]{4}-([ABCDEFGHIJKLMNOPQRSUVWZ]){4}$ 

      

(Please note: no X Y or T )

  • The example above provides easy portability, high performance, and accounts for general use when extended character sets are not involved.
  • An example here, where the program decides the most effective search criteria, is guaranteed to account for all scenarios
  • Compromise and performance criteria are precedent. That is, generating a custom search string will certainly take longer for just one string and, by contrast, is sloppy when searching for thousands of files.

Another answer to this question that complements this answer.

This regex was posted by @mickmackusa

^X{3}-\d{4}-(?![A-Z]{0,3}[XYT])[A-Z]{4}$

      

This regex works well, it is cleaner than the alternative provided in this answer (however, it requires PCRE).

This performs a little slower (but by no means ineffective or wasteful), but is guaranteed to produce only a [AZ] match (with XYT excluded).

This highlights the need to evaluate application-specific performance criteria when designing a regular expression that may require search queries.

+2


source







All Articles