UTF-8 - UTF-32 on iterators using STL

I have a char iterator - std::istreambuf_iterator<char>

wrapped in a couple of adapters - inferior to UTF-8 bytes. I want to read one UTF-32 character (a char32_t

). Can I do this with STL? How?

There std::codecvt_utf8<char32_t>

, but apparently only works on char*

, not on iterators.

Here's a simplified version of my code:

#include <iostream>
#include <sstream>
#include <iterator>

// in the real code some boost adaptors etc. are involved
// but the important point is: we're dealing with a char iterator.
typedef std::istreambuf_iterator< char > iterator;

char32_t read_code_point( iterator& it, const iterator& end )
{
    // how do I do this conversion?
    // codecvt_utf8<char32_t>::in() only works on char*
    return U'\0';
}

int main()
{
    // actual code uses std::istream so it works on strings, files etc.
    // but that irrelevant for the question
    std::stringstream stream( u8"\u00FF" );
    iterator it( stream );
    iterator end;
    char32_t c = read_code_point( it, end );
    std::cout << std::boolalpha << ( c == U'\u00FF' ) << std::endl;
    return 0;
}

      

I know Boost.Regex uses an iterator for this, but I would like to avoid incrementing libraries that are not just headers, which is like what STL should be capable of.

+3


source to share


1 answer


I don't think you can do it directly with codecvt_utf8

or any other standard library components. To use codecvt_utf8

, you will need to copy the bytes from the iterator stream to the buffer and convert the buffer.

Something like this should work:

char32_t read_code_point( iterator& it, const iterator& end )
{
  char32_t result;
  char32_t* resend = &result + 1;
  char32_t* resnext = &result;
  char buf[7];  // room for 3-byte UTF-8 BOM and a 4-byte UTF-8 character
  char* bufpos = buf;
  const char* const bufend = std::end(buf);
  std::codecvt_utf8<char32_t> cvt;
  while (bufpos != bufend && it != end)
  {
    *bufpos++ = *it++;
    std::mbstate_t st{};
    const char* be = bufpos;
    const char* bn = buf;
    auto conv = cvt.in(st, buf, be, bn, &result, resend, resnext);
    if (conv == std::codecvt_base::error)
      throw std::runtime_error("Invalid UTF-8 sequence");
    if (conv == std::codecvt_base::ok && bn == be)
      return result;
    // otherwise read another byte and try again
  }
  if (it == end)
    throw std::runtime_error("Incomplete UTF-8 sequence");
  throw std::runtime_error("No character read from first seven bytes");
}

      

This appears to be doing more work than is necessary, re-scanning the entire UTF-8 sequence [buf, bufpos)

in each iteration (and calling the virtual function on codecvt_utf8::do_in

). In theory, an implementation codecvt_utf8::in

could read the incomplete multibyte sequence and store the state information in the argument mbstate_t

so that the next call resumes from the moment it stopped, consuming only new bytes, rather than reworking the incomplete multibyte sequence that was already visible.



However, implementations do not need to use an argument mbstate_t

to store state between calls, and in practice, at least one implementation codecvt_utf8::in

(the one I wrote for GCC) does not use it at all. From my experiments, it seems that the libc ++ implementation doesn't use it either. This means they stop converting to an incomplete multibyte sequence and leave a pointer from_next

(argument here bn

) pointing to the start of that incomplete sequence, so the next call must start at that position and (hopefully) provide enough extra bytes to complete the sequence and allow read and converting a full Unicode character tochar32_t

... Since you are only trying to read one code, that means it won't convert at all, because stopping before an incomplete multibyte sequence means stopping at the first byte.

It is possible that some implementations use an argument mbstate_t

, so you can modify the above function to handle this case, but for portability you still have to deal with implementations that ignore mbstate_t

. Supporting both types of implementation would make this feature much more complex, so I left it simple and wrote a form that should work with all implementations, even if they actually use mbstate_t

. Since you're only going to read up to 7 bytes at a time (in the worst case ... the average case might only be one or two bytes, depending on the input text), the cost of re-scanning the first few bytes each time shouldn't be huge.

To get the best performance from codecvt_utf8

, you should avoid converting the same code at a time, as it is designed to convert arrays of characters, not individual ones. Since you always need to copy to the clipboard char

, you can copy larger chunks from the iterator's input sequence and convert whole chunks. This would reduce the likelihood of detecting incomplete multibyte sequences, since only the last 1-3 bytes at the end of the chunk would need to be processed, if the chunk ends up in an incomplete sequence, everything previous in the chunk would be transformed.

To get the best performance when reading individual code points, you should probably avoid completely codecvt_utf8

and either roll your own (if you only need UTF-8 for UTF-32BE, it's not that hard) or use a third party library like ICU ...

+3


source







All Articles