Strange behavior of BASH glob / regex ranges

Question

Strange behavior of BASH glob / regex ranges

I see BASH range ranges (eg [AZ]) that behave in unpredictable ways.
Is there an explanation for this behavior, or is this a bug?

Let's say I have a variable that I want to remove all uppercase letters from:

$ var='ABCDabcd0123'
$ echo "${var//[A-Z]/}"

As a result, I get the following:

a0123

If I do it with sed

, I get the expected result:

$ echo "${var}" | sed 's/[A-Z]//g'
abcd0123

Exactly the same as for BASH inline regex:

$ [[ a =~ [A-Z] ]] ; echo $?
1
$ [[ b =~ [A-Z] ]] ; echo $?
0

If I check all lowercase letters from 'a' to 'z', it seems that only "a" is an exception:

$ for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done
a

I don't have case-insensitive matching, and even if I did, it shouldn't cause the "a" to behave differently:

$ shopt -p nocasematch
shopt -u nocasematch

For reference, I'm using Cygwin and I don't see this behavior on any other machine:

$ uname
CYGWIN_NT-6.3
$ bash --version | head -1
GNU bash, version 4.3.46(7)-release (x86_64-unknown-cygwin)
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=

EDIT:

I found the exact same issue reported here: https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687
So I'm guessing it's a bug (?) Of sorting "en_GB.UTF- 8 ", but not BASH.
The setting LC_COLLATE=C

does allow this.

+3

bash regex shell cygwin glob

Thunderbeef Apr 17 17 at 9:23 am

source to share

1 answer

Inian · Accepted Answer · 2017-04-17T09:46:07+0000

It certainly had to do with customizing your locale

. Excerpt from the GNU bash man page under Pattern Matching

[..] in default C

locale, is [a-dx-z]

equivalent [abcdxyz]

. Many locators sort characters in dictionary order, and in these locales it is [a-dx-z]

usually not equivalent [abcdxyz]

; it can be equivalent, for example [aBbCcDdxXyYz]

. To get the traditional interpretation of ranges in parenthesis expressions, you can enforce the C locale by setting an environment variable LC_COLLATE

or LC_ALL

to a value, C

or include a shell option globasciiranges

..]

In this case, use POSIX

class-classess, [[:upper:]]

or change the setting locale

LC_ALL

or LC_COLLATE

to C

as above.

LC_ALL=C var='ABCDabcd0123'
echo "${var//[A-Z]/}"
abcd0123

Also, your negative test for uppercase validation will be invalid for all lowercase letters when setting this locale to print letters.

LC_ALL=C; for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done

In addition, according to the above language setting

[[ a =~ [A-Z] ]] ; echo $?
1
[[ b =~ [A-Z] ]] ; echo $?
1

but it will be true for all lower case,

[[ a =~ [a-z] ]] ; echo $?
0
[[ b =~ [a-z] ]] ; echo $?
0

That being said, it all can be avoided by using the POSIX

specified character classes in a new shell without installing locale

,

echo "${var//[[:upper:]]/}"
abcd0123

and

for l in {a..z}; do [[ $l =~ [[:upper:]] ]] || echo $l; done

Strange behavior of BASH glob / regex ranges

More articles: