Problems with Reg expression with Cyrillic letters
I've had problems with regex and cyrillic letters in the past, so I was wondering if there is anything I am doing wrong?
Here are two reproducible examples:
Example 1 - Problem with lookahead and lookbehind assertions:
latin <- "city New York, Manhattan\n1st Avenue"
cyrilic <- " , \n1 "
stringr::str_extract(latin, pattern = "(?<=city New York, )[\\w\\s]+(?=\n)")
#returns: Manhattan
stringr::str_extract(cyrilic, pattern = "(?<= , )[\\w\\s]+(?=\n)")
stringr::str_extract(cyrilic, pattern = "(?<= , ).+(?=\n)")
#both return: NA
Example 2 - problem with grep ignore.case = TRUE:
randomWord <- ""
grep(pattern = "", x = randomWord, ignore.case = T)
#returns: integer(0)
Any ideas on how to write regular expressions to make them work in Cyrillic?
My default text encoding is UTF-8 and here is my sessionInfo:
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Bulgarian_Bulgaria.1251 LC_CTYPE=Bulgarian_Bulgaria.1251
[3] LC_MONETARY=Bulgarian_Bulgaria.1251 LC_NUMERIC=C
[5] LC_TIME=Bulgarian_Bulgaria.1251
source to share
I'm not sure why it str_extract
returns NA
in this case, as it seems like the correct expression is valid.
However, they str_locate
also str_detect
work as expected:
stringr::str_detect(cyrilic, "(?<= , )[\\w\\s]+(?=\n)")
#returns TRUE
stringr::str_locate(cyrilic, "(?<= , )[\\w\\s]+(?=\n)")
#returns the start and end positions for
A workaround for your problem would be to use substr()
in combination with str_locate
:
substr(cyrilic,
stringr::str_locate(cyrilic, "(?<= , )[\\w\\s]+(?=\n)")[1],
stringr::str_locate(cyrilic, "(?<= , )[\\w\\s]+(?=\n)")[2]
)
#returns ''
source to share
Perhaps the problem is with the way ICU deals with the pattern it got from stringr str_extract
: it seems that the lookbehind pattern it gets is no longer a known width. Or, there is another error with str_extract
.
In this case it is much safer to use str_match
, which has no problem with template length:
> str_match(cyrilic, pattern = " ,\\s*([\\w\\s]+)\n")[,2]
[1] ""
Just go to the right group, here, this is the second item in the resulting list.
Regarding the TRE regex you used with grep
, I have also observed various problems in different environments. On my Windows 7 machine, your code returns 1
. However, the TRE regex with Unicode literal letters may fail, and the PCRE regex is the best choice. To fully understand Unicode, be sure to add the (*UCP)
PCRE verb when starting the template, so that \w
, \d
etc. Could match all Unicode characters. There is no need here and
> randomWord <- ""
> grep(pattern = "", x = randomWord, ignore.case = T, perl=TRUE)
[1] 1
will work equally well.
source to share