Problems with Reg expression with Cyrillic letters

I've had problems with regex and cyrillic letters in the past, so I was wondering if there is anything I am doing wrong?

Here are two reproducible examples:

Example 1 - Problem with lookahead and lookbehind assertions:

latin <- "city New York, Manhattan\n1st Avenue"
cyrilic <- "  , \n1 "

stringr::str_extract(latin, pattern = "(?<=city New York, )[\\w\\s]+(?=\n)")
#returns: Manhattan

stringr::str_extract(cyrilic, pattern = "(?<=  , )[\\w\\s]+(?=\n)")
stringr::str_extract(cyrilic, pattern = "(?<=  , ).+(?=\n)")
#both return: NA

      

Example 2 - problem with grep ignore.case = TRUE:

randomWord <- ""

grep(pattern = "", x = randomWord, ignore.case = T)
#returns: integer(0)

      

Any ideas on how to write regular expressions to make them work in Cyrillic?

My default text encoding is UTF-8 and here is my sessionInfo:

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Bulgarian_Bulgaria.1251  LC_CTYPE=Bulgarian_Bulgaria.1251   
[3] LC_MONETARY=Bulgarian_Bulgaria.1251 LC_NUMERIC=C                       
[5] LC_TIME=Bulgarian_Bulgaria.1251 

      

+3


source to share


2 answers


I'm not sure why it str_extract

returns NA

in this case, as it seems like the correct expression is valid.

However, they str_locate

also str_detect

work as expected:

stringr::str_detect(cyrilic, "(?<=  , )[\\w\\s]+(?=\n)")
#returns TRUE
stringr::str_locate(cyrilic, "(?<=  , )[\\w\\s]+(?=\n)")
#returns the start and end positions for 

      



A workaround for your problem would be to use substr()

in combination with str_locate

:

substr(cyrilic, 
   stringr::str_locate(cyrilic, "(?<=  , )[\\w\\s]+(?=\n)")[1], 
   stringr::str_locate(cyrilic, "(?<=  , )[\\w\\s]+(?=\n)")[2]
)
#returns ''

      

+1


source


Perhaps the problem is with the way ICU deals with the pattern it got from stringr str_extract

: it seems that the lookbehind pattern it gets is no longer a known width. Or, there is another error with str_extract

.

In this case it is much safer to use str_match

, which has no problem with template length:

> str_match(cyrilic, pattern = "  ,\\s*([\\w\\s]+)\n")[,2]
[1] ""

      

Just go to the right group, here, this is the second item in the resulting list.



Regarding the TRE regex you used with grep

, I have also observed various problems in different environments. On my Windows 7 machine, your code returns 1

. However, the TRE regex with Unicode literal letters may fail, and the PCRE regex is the best choice. To fully understand Unicode, be sure to add the (*UCP)

PCRE verb when starting the template, so that \w

, \d

etc. Could match all Unicode characters. There is no need here and

> randomWord <- ""
> grep(pattern = "", x = randomWord, ignore.case = T, perl=TRUE)
[1] 1

      

will work equally well.

+1


source







All Articles