Strange behavior of BASH glob / regex ranges
I see BASH range ranges (eg [AZ]) that behave in unpredictable ways.
Is there an explanation for this behavior, or is this a bug?
Let's say I have a variable that I want to remove all uppercase letters from:
$ var='ABCDabcd0123'
$ echo "${var//[A-Z]/}"
As a result, I get the following:
a0123
If I do it with sed
, I get the expected result:
$ echo "${var}" | sed 's/[A-Z]//g'
abcd0123
Exactly the same as for BASH inline regex:
$ [[ a =~ [A-Z] ]] ; echo $?
1
$ [[ b =~ [A-Z] ]] ; echo $?
0
If I check all lowercase letters from 'a' to 'z', it seems that only "a" is an exception:
$ for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done
a
I don't have case-insensitive matching, and even if I did, it shouldn't cause the "a" to behave differently:
$ shopt -p nocasematch
shopt -u nocasematch
For reference, I'm using Cygwin and I don't see this behavior on any other machine:
$ uname
CYGWIN_NT-6.3
$ bash --version | head -1
GNU bash, version 4.3.46(7)-release (x86_64-unknown-cygwin)
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=
EDIT:
I found the exact same issue reported here:
https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687
So I'm guessing it's a bug (?) Of sorting "en_GB.UTF- 8 ", but not BASH.
The setting LC_COLLATE=C
does allow this.
source to share
It certainly had to do with customizing your locale
. Excerpt from the GNU bash man page under Pattern Matching
[..] in default
C
locale, is[a-dx-z]
equivalent[abcdxyz]
. Many locators sort characters in dictionary order, and in these locales it is[a-dx-z]
usually not equivalent[abcdxyz]
; it can be equivalent, for example[aBbCcDdxXyYz]
. To get the traditional interpretation of ranges in parenthesis expressions, you can enforce the C locale by setting an environment variableLC_COLLATE
orLC_ALL
to a value,C
or include a shell optionglobasciiranges
..]
In this case, use POSIX
class-classess, [[:upper:]]
or change the setting locale
LC_ALL
or LC_COLLATE
to C
as above.
LC_ALL=C var='ABCDabcd0123'
echo "${var//[A-Z]/}"
abcd0123
Also, your negative test for uppercase validation will be invalid for all lowercase letters when setting this locale to print letters.
LC_ALL=C; for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done
In addition, according to the above language setting
[[ a =~ [A-Z] ]] ; echo $?
1
[[ b =~ [A-Z] ]] ; echo $?
1
but it will be true for all lower case,
[[ a =~ [a-z] ]] ; echo $?
0
[[ b =~ [a-z] ]] ; echo $?
0
That being said, it all can be avoided by using the POSIX
specified character classes in a new shell without installing locale
,
echo "${var//[[:upper:]]/}"
abcd0123
and
for l in {a..z}; do [[ $l =~ [[:upper:]] ]] || echo $l; done
source to share