Removing a byte order in R / C

This SO post has an example server that generates json with order stamp ... RFC7159 says:

Implementations MUST NOT add a byte order mark to the beginning of the JSON text. In the interest of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte byte rather than treat it as an error.

Currently yajl and hence jsonlite is the choke in the spec. I would like to follow the RFC suggestion and ignore the BOM from the UTF8 string if present. What is an efficient way to do this? Naive implementation:

if(substr(json, 1, 1) == "\uFEFF"){
  json <- substring(json, 2)


However, it's a substr

little slower for large lines and I'm not sure if this is the correct way to do it. Is there a more efficient way in R or C to remove the BOM if there is one?


source to share

2 answers

A simple solution:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
std::string stripBom(std::string x) {
   if (x.size() < 3)
      return x;

   if (x[0] == '\xEF' && x[1] == '\xBB' && x[2] == '\xBF')
      return x.substr(3);

   return x;

/*** R
x <- "\uFEFFabcdef"
identical(x, stripBom(x))



> x <- "\uFEFFabcdef"

> print(x)
[1] "abcdef"

> print(stripBom(x))
[1] "abcdef"

> identical(x, stripBom(x))

> utf8ToInt(x)
[1] 65279    97    98    99   100   101   102

> utf8ToInt(stripBom(x))
[1]  97  98  99 100 101 102


EDIT: What might be useful is to see how R does it internally - there are a number of situations where R splits the BOM (for example, for its scanners and file readers). Cm:



Based on Kevin Rcpp's example, I used the following C function to test bom:

SEXP R_parse(SEXP x) {
  /* get data from R */
  const char* json = translateCharUTF8(asChar(x));

  /* ignore BOM as suggested by RFC */
  if(json[0] == '\xEF' && json[1] == '\xBB' && json[2] == '\xBF'){
    warning("JSON string contains UTF8 byte-order-mark!");
    json = json + 3;

  /* parse json */
  char errbuf[1024];
  yajl_val node = yajl_tree_parse(json, errbuf, sizeof(errbuf));




All Articles