Non-ASCII characters in R, reading from a .sav file

I am trying to read a .sav file in RStudio. The file contains data from a Spanish poll, and when I read it in R - although my default text encoding is already set to ISO-8859-1 - the special characters are not displayed correctly.

For example, the word "Camión" is displayed as

"Cami<c3><b3>n" 

      

although it is correctly displayed as "Camión" in PSPP.

This is what I did:

install.packages("memisc")
jcv2014 <- as.data.set(spss.system.file('myfile.sav'))

      

Later I wanted to create a list of only variable labels, so I did the following:

library(foreign)
jcv2014.spss <- read.spss("myfile.sav", to.data.frame=FALSE, use.value.labels=FALSE)
jcv2014_vars <- attr(jcv2014.spss, "variable.labels")

      

(I'm not sure if this is the best way to do it, but it worked)

Anyway, this time I still didn't get the correct accents, but there was a different encoding:

The variable label, which should have been "¿Qué calificación le daría ...", appeared as

"\302\277Qu\303\251 calificaci\303\263n le dar\303\255a..."

      

I'm not sure how to get the correct characters, but they are displayed correctly in the PSPP. I've tried changing the default text encoding in R to both ISO-8859-1 and UTF-8, to no avail. I don't know what the original file was encoded in, but I guessed it would be one of them.

Any ideas?

And if that helps, I have version 3.1.1 and OS X Yosemite version 10.10.1, and I am using PSPP, not SPSS.

Thanks a lot in advance !!!

+3


source to share


1 answer


Can you just set the encoding after you've read the data?

# Here your sentence
s <- "\302\277Qu\303\251 calificaci\303\263n le dar\303\255a..."

# it has no encoding
Encoding(s)
# [1] "unknown"

# but if you specify UTF-8, then it shows up correctly
iconv(s, 'UTF-8')
# [1] "¿Qué calificación le daría..."

# This also works
Encoding(s) <- 'UTF-8'
s
# [1] "¿Qué calificación le daría..."

      

Here are the results of my call sessionInfo()

. You must also send a message.



> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reshape2_1.4     hexbin_1.27.0    ggplot2_1.0.0    data.table_1.9.2 yaml_2.1.13     
[6] redshift_0.4     RJDBC_0.2-4      rJava_0.9-6      DBI_0.3.1       

loaded via a namespace (and not attached):
 [1] colorspace_1.2-4 digest_0.6.4     grid_3.1.1       gtable_0.1.2     labeling_0.2    
 [6] lattice_0.20-29  MASS_7.3-33      munsell_0.4.2    plyr_1.8.1       proto_0.3-10    
[11] Rcpp_0.11.2      scales_0.2.4     stringr_0.6.2    tools_3.1.1  

      

Update: It looks like you may not have a locale that supports UTF-8. Here are the locale settings for each category on my system. You can try using Sys.setLocale()

and updating them one by one on your system (or just use LC_ALL

if you don't feel the need to test each step by step). ?Sys.setLocale

for more information

cat_str <- c("LC_COLLATE", "LC_CTYPE", "LC_MONETARY", "LC_NUMERIC",
             "LC_TIME", "LC_MESSAGES", "LC_PAPER", "LC_MEASUREMENT")
sapply(cat_str, Sys.getlocale)

# LC_COLLATE       LC_CTYPE    LC_MONETARY     LC_NUMERIC        LC_TIME    LC_MESSAGES 
# "en_US.UTF-8"  "en_US.UTF-8"  "en_US.UTF-8"            "C"  "en_US.UTF-8"  "en_US.UTF-8" 
# LC_PAPER LC_MEASUREMENT 
# ""             "" 

      

+2


source







All Articles