How to read a Windows-1252 file using Rcpp?

I want to format the input format when reading a file in Windows-1252 encoding along with Rcpp. I need this as I am switching between Linux / Windows environments and while the files are sequentially encoded in 1252.

How to adapt this to work:

String readFile(std::string path) {
  std::ifstream t(path.c_str());
  if (!t.good()){
    std::string error_msg = "Failed to open file ";
    error_msg += "'" + path + "'";
    ::Rf_error(error_msg.c_str());
  }

  const std::locale& locale = std::locale("sv_SE.1252");
  t.imbue(locale); 
  std::stringstream ss;
  ss << t.rdbuf();
  return ss.str();
}

      

The above failed:

Error in eval(expr, envir, enclos) : 
  locale::facet::_S_create_c_locale name not valid

      

I've also tried using "Swedish_Sweden.1252" which is the default on my system to no avail. I tried #include <boost/locale.hpp>

it but doesn't seem to be available in Rcpp (v 0.12.0) / BH boost (v. 1.58.0-1).

Update:

After digging a little deeper into this, I'm not sure if gcc (v. 4.6.3) in RTools (version 3.3) is built with locale support, this SO question points to this possibility. If there is any argument other than "or" C "works with std :: locale (), it would be interesting to know, I tried several alternatives but nothing works.

Backup solution

I'm not entirely happy, but using it seems to base::iconv()

fix any character problems regardless of the original format, thanks in large part to the argument from="WINDOWS-1252"

forcing the characters to be interpreted in the correct form, i.e. if we want to stay in Rcpp we can simply do:

String readFile(std::string path) {
  std::ifstream t(path.c_str());
  if (!t.good()){
    std::string error_msg = "Failed to open file ";
    error_msg += "'" + path + "'";
    ::Rf_error(error_msg.c_str());
  }

  const std::locale& locale = std::locale("sv_SE.1252");
  t.imbue(locale); 
  std::stringstream ss;
  ss << t.rdbuf();
  Rcpp::StringVector ret = ss.str();

  Environment base("package:base");
  Function iconv = base["iconv"];

  ret = iconv(ret, Named("from","WINDOWS-1252"),Named("to","UTF8"));

  return ret;
}

      

Note that it is preferable to wrap the function in R rather than get the function from C ++ and then call it from there, this is both less code and 2x better performance (tested with a micro lens):

readFileWrapper <- function(path){
   ret <- readFile(path)
   iconv(ret, from = "WINDOWS-1252", to = "UTF8")
}

      

+3


source to share





All Articles