Different results from dplyr filter starting with identical data

When I tried to answer this question , I came across some very strange behavior. Below I define the same data twice, once as data.frame

, and a second time using mutate

. I check that the results are identical. Then I try to perform the same filtering operation. For the first dataset, this works, but for the second (identical) dataset, it fails. Can anyone understand why.

It looks like the reason for this difference is the use of ñ

. But I don't understand why this is a problem for the second dataset but not the first.

# define the same data twice
datos1 <- data.frame(año = 2001:2005, gedad = c(letters[1:5]), año2 = 2001:2005)  
datos2 <- data.frame(año = 2001:2005, gedad = c(letters[1:5])) %>% mutate(año2 = año) 
# check that they are identical
identical(datos1, datos2)
# do same operation
datos1 %>% filter(año2 >= 2003)
## año gedad año2
## 1 2003     c 2003
## 2 2004     d 2004
## 3 2005     e 2005
datos2 %>% filter(año2 >= 2003)
## Error in filter_impl(.data, dots) : object 'año2' not found

      

Note. I don't believe this is a duplicate of the original question because I am asking why this difference occurs and the original post asks how to fix it.

EDIT: Since @Khashaa was unable to reproduce the error, here is my output sessionInfo()

:

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252    LC_MONETARY=German_Switzerland.1252
## [4] LC_NUMERIC=C                        LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.4.1
## 
## loaded via a namespace (and not attached):
## [1] assertthat_0.1  DBI_0.3.1       lazyeval_0.1.10 magrittr_1.5    parallel_3.1.2  Rcpp_0.11.4     tools_3.1.2  

      

+2


source to share


1 answer


I was able to reproduce the error on my machine, which has system greek, by switching the R locale to German_Switzerland.1252

. I also noticed that both the error and the variable name changed in the second case to aρo2

.

It appears to be mutate

using the system locale when creating the new column name, resulting in a conversion if it is not the same as the locale used by the console. I was able to query dato2

using the changed column name:

library(dplyr)
Sys.setlocale("LC_ALL","German_Switzerland.1252")
datos1 <- data.frame(año = 2001:2005, gedad = c(letters[1:5]), año2 = 2001:2005)  
datos2 <- data.frame(año = 2001:2005, gedad = c(letters[1:5])) %>% mutate(año2 = año) 

datos1 %>% filter(año2 >= 2003)
##   aρo gedad aρo2
## 1 2003     c 2003
## 2 2004     d 2004
## 3 2005     e 2005
datos2 %>% filter(año2 >= 2003)
##  Error in filter_impl(.data, dots) : object 'aρo2' not found
datos2 %>% filter("aρo2" >= 2003)
## aρo gedad aρo2
## 1 2001     a 2001
## 2 2002     b 2002
## 3 2003     c 2003
## 4 2004     d 2004
## 5 2005     e 2005

      

The reason ñ

appeared in both cases in the original question probably means that the machine system language system is set to 850, a Latin code page where accented characters have different codes than Windows 1252.

It is "interesting" that:

names(datos2)[[1]]==names(datos1)[[1]]
## [1] TRUE

      

Because

names(datos1)[[1]]
## [1] "aρo"

      



and

names(datos2)[[1]]
## [1] "aρo"

      

This would mean that R itself creates a mess of conversions and its filter

which does the correct conversion.

The moral of it all is, don't use non-english characters or make sure you are using the same locale as the machine (rather fragile).

UPDATE

Semi-official confirmation that R does indeed go through the system locale, as it assumes that it is in fact the language used by the system. Windows uses UTF-16 everywhere, and "System Locale" is what the shortcut in the "Locales" section says - the locale used for legacy non-Unicode applications.

If I remember correctly, "System Locale" used to be the base of the entire system (including the user interface language, etc.) prior to Windows 2000 and NT. You may currently have a different UI language for each user, but the name is stuck.

+9


source







All Articles