Diacritics and Regular Expressions in R

In R, I have a column that should only contain one word. It is created using the content of another column and with a regex, keeping only the last word. However, for some rows this doesn't work, in which case R simply copies the content from the first column. Here is my R

df$precedingWord <- gsub(".*?\\W*(\\w+-?)\\W*$","\\1", df$leftContext, perl=TRUE)

      

beforeWord should only contain one word. It is retrieved from leftContext with a regular expression. This works fine, but not with diacritics. Several lines in the leftContext have accented letters such as é

and à

. For some reason, R completely ignores these clauses and just copies the whole thing into beforeWord. I find this odd because it is almost impossible for a regex to match all of this - as you can see here . In the example, the test line leftContext and Substitution should be * beforeWord.

As you can see in the example above, the output in the online regex tester is different from the output I get. I just get an exact copy of the leftContext. This does not mean that the conclusion in the online test is what is needed. The tool now treats accented letters as non-word characters and thus it does not mark it as the result I want. But I really want them to threaten them as symbols of words, so they have the right to exit.

If this is the input:

Un premier projet prévoit que l'établissement verserait 11 FF par an et par élève du secondaire et 30 FF par étudiant universitaire, une somme à évaluer et à  
Outre le prêt-à- 
And à 
Sur base de ces données, on cherchera à 
Ce sera encore le cas ce vendredi 19 juillet dans l'é

      

Then this is the result that I expect

à
prêt-à-
à
à
é

      

This is the regex I already have

.*?\W*(\w+?-?)\W*$

      

I am already using stringi in my project, so if it gives a solution I can use it.

+3


source to share


1 answer


In a Perl-like regex, you can match any Unicode letter to an abbreviated class \p{L}

, and all non-Unicode characters can be matched to an inverse class \p{L}

. See regular-expressions.info :

You can match one character belonging to the "letter" category with \p{L}

. You can match one character not in this category with \p{L}

.



Thus, the regex you can use is

df$precedingWord <- gsub(".*?\\P{L}*(\\p{L}+-?)\\P{L}*$","\\1", df$leftContext, perl=TRUE)

      

+1


source







All Articles