How to grep for exact hexadecimal value of characters
I am trying to use grep on the hexadecimal value of a UTF-8 encoded character range and I only want the specific character range to return. I currently have this:
grep -P -n "[\xB9-\xBF]" $str_st_location >> output_st.txt
But this returns every character that has any of these hexadecimal values in the ie hexadecimal representation, it returns 00B9 - FFB9 as long as B9 is present.
Is there a way that I can indicate with grep that I only want the exact / specific range of hex values that I am looking for?
Input example:
STRING_OPEN
Open
æ–å¼€
Ouvert
Abierto
Открыто
Abrir
Now using my grep statement it should return the 3rd line and 6th line, but it also contains text in my file which is Russian and Chinese because the range for languages includes the hex values I'm looking for:
断开
I am unable to emit more input, unfortunately as it is work related.
EDIT: In fact, the code snapshot is below!
grep -P -n "[\x{00B9}-\x{00BF}]" $str_st_location > output_st.txt
It found all damaged characters and there were no false positives. The only problem is that lines with damaged characters are automatically "intact", and when I open the file, grep output is the corrected version of the damaged characters. For example, it finds æ-å¼ € and in a text file it appears as 断开.
source to share
Since you are using -P
, you are probably using GNU grep, because it is a GNU grep extension. Your team is running GNU grep 2.21 with pcre 8.37 and UTF-8 locale, however there have been bugs in the past with multibyte characters and character ranges. You are probably using an older version, or perhaps your locale is set to one that uses single byte characters.
If you don't want to update, you can map this character range by matching individual bytes, which should work in older versions. You will need to convert characters to bytes and look for byte values. Assuming UTF-8, U + 00B9 is C2 B9 and U + 00BF is C2 BF. Setting LC_CTYPE
it to use single-byte characters (for example C
) ensures that it matches individual bytes even on versions that correctly support multibyte characters.
LC_CTYPE=C grep -P -n "\xC2[\xB9-\xBF]" $str_st_location >> output_st.txt
source to share