How to decode ubyte [] to specified encoding?

Question

How to decode ubyte [] to specified encoding?

Problem : how to parse a file when encoding is set at runtime?

encoding can be: utf-8, utf-16, latin1 or other

The goal is to convert ubyte [] to a string from the chosen encoding. Because when you use std.stdio.File.byChunk or std.mmFile.MmFile you have ubyte [] as data.

+3

d phobos

bioinfornatics 10 Mar 12 at 20:01

source to share

4 answers

Raxillan · Answer 1 · 2012-03-11T03:17:45+0000

Are you trying to convert a text file to utf-8? If the answer is "yes", Phobos has a special feature for this: @trusted string toUTF8(in char[] s)

. See http://dlang.org/phobos/std_utf.html for details .

Sorry if this is not what you want.

bioinfornatics · Answer 2 · 2012-03-10T21:09:56+0000

I found a way, maybe using std.algorithm.reduce should be better

import std.string;
import std.stdio;
import std.encoding;
import std.algorithm;

void main( string[] args ){
    File f = File( "pathToAfFile.txt", "r" );
    size_t i;
    auto e = EncodingScheme.create("utf-8");
    foreach( const(ubyte)[] buffer; f.byChunk( 4096 ) ){
        size_t step = 0;
        if( step == 0 ) step = e.firstSequence( buffer );
        for( size_t start; start + step < buffer.length; start = start + step )
            write( e.decode( buffer[start..start + step] ) );
    }
}

Vladimir Panteleev · Answer 3 · 2012-03-11T13:21:54+0000

D strings are already UTF-8. No transcoding required. You can use validate

from std.utf

to check if the file contains valid UTF-8. If you use readText

from std.file

, it will do the validation for you.

David Eagen · Answer 4 · 2012-09-25T12:21:59+0000

File.byChunk returns a range that ubyte [] returns across the front.

A quick google search showed that UTF-8 uses 1 to 6 bytes to encode data, so just make sure you always have 6 bytes of data and you can use the std.encoding decoder to convert its dchar character. Then you can use std.utf toUFT8 to convert to normal string instead of dstring.

The transform function below will convert any unsigned range to a string.

import std.encoding, std.stdio, std.traits, std.utf;

void main()
{
    File input = File("test.txt");

    string data = convert(input.byChunk(512));

    writeln("Data: ", data);
}

string convert(R)(R chunkRange) 
in
{
    assert(isArray!(typeof(chunkRange.front)) && isUnsigned!(typeof(chunkRange.front[0])));
} 
body
{
    ubyte[] inbuffer;
    dchar[] outbuffer;

    while(inbuffer.length > 0 || !chunkRange.empty)
    {
        while((inbuffer.length < 6) && !chunkRange.empty)// Max UTF-8 byte length is 6
        {
            inbuffer ~= chunkRange.front;
            chunkRange.popFront();
        }

        outbuffer ~= decode(inbuffer);
    }

    return toUTF8(outbuffer); // Convert to string instead of dstring
}

How to decode ubyte [] to specified encoding?

More articles: