How to read a Windows-1252 file using Rcpp?

I want to format the input format when reading a file in Windows-1252 encoding along with Rcpp. I need this as I am switching between Linux / Windows environments and while the files are sequentially encoded in 1252.

How to adapt this to work:

String readFile(std::string path) {
  std::ifstream t(path.c_str());
  if (!t.good()){
    std::string error_msg = "Failed to open file ";
    error_msg += "'" + path + "'";
    ::Rf_error(error_msg.c_str());
  }

  const std::locale& locale = std::locale("sv_SE.1252");
  t.imbue(locale); 
  std::stringstream ss;
  ss << t.rdbuf();
  return ss.str();
}

      

The above failed:

Error in eval(expr, envir, enclos) : 
  locale::facet::_S_create_c_locale name not valid

      

I've also tried using "Swedish_Sweden.1252" which is the default on my system to no avail. I tried #include <boost/locale.hpp>

it but doesn't seem to be available in Rcpp (v 0.12.0) / BH boost (v. 1.58.0-1).

Update:

After digging a little deeper into this, I'm not sure if gcc (v. 4.6.3) in RTools (version 3.3) is built with locale support, this SO question points to this possibility. If there is any argument other than "or" C "works with std :: locale (), it would be interesting to know, I tried several alternatives but nothing works.

Backup solution

I'm not entirely happy, but using it seems to base::iconv()

fix any character problems regardless of the original format, thanks in large part to the argument from="WINDOWS-1252"

forcing the characters to be interpreted in the correct form, i.e. if we want to stay in Rcpp we can simply do:

String readFile(std::string path) {
  std::ifstream t(path.c_str());
  if (!t.good()){
    std::string error_msg = "Failed to open file ";
    error_msg += "'" + path + "'";
    ::Rf_error(error_msg.c_str());
  }

  const std::locale& locale = std::locale("sv_SE.1252");
  t.imbue(locale); 
  std::stringstream ss;
  ss << t.rdbuf();
  Rcpp::StringVector ret = ss.str();

  Environment base("package:base");
  Function iconv = base["iconv"];

  ret = iconv(ret, Named("from","WINDOWS-1252"),Named("to","UTF8"));

  return ret;
}

      

Note that it is preferable to wrap the function in R rather than get the function from C ++ and then call it from there, this is both less code and 2x better performance (tested with a micro lens):

readFileWrapper <- function(path){
   ret <- readFile(path)
   iconv(ret, from = "WINDOWS-1252", to = "UTF8")
}

      

+3
c ++ rcpp locale windows-1252


source to share


No one has answered this question yet

See similar questions:

4
C ++ Error ": locale :: facet :: _ S_create_c_locale name is invalid" when running program from command line

or similar:

2873
How do I iterate over the words of a string?
2416
How do you set, clear, and switch one bit?
1783
C ++ 11 introduced a standardized memory model. What does it mean? And how will this affect C ++ programming?
1709
How can I profile C ++ code running on Linux?
1675
Why is reading lines from stdin much slower in C ++ than Python?
1643
Why templates can only be implemented in a header file?
541
How to fix warning about setting locale in Perl?
3
Passing std :: string reference from std :: stringstream as parameter
2
How to read utf-16 file to utf-8 std :: string line by line
0
iostream GCC errors converting to boost :: filesystem :: iostream for Windows



All Articles
Loading...
X
Show
Funny
Dev
Pics