Text encoding problems

I am having problems with text encoding. Analyzing the website gives me the Data.Text string

"Project - Fran \ 195 \ 167ois Dubois",

which I need to write to a file. So I am using Data.Text.Lazy.Encoding.encodeUtf8 to convert it to Bytestring. The problem is that it distorts the output:

"Project - Francois Dubois".

What am I missing here?

+3


source to share


2 answers


If you got Fran\195\167ois

inside Data.Text

yours, you already have UTF-8 encoded François

.

This is inconvenient because it Data.Text[.Lazy]

must be UTF-16 encoded text, and the two code blocks 195 and 167 are interpreted as Unicode code 195 respectively. 167, which are "Ã" respectively. '§'. If you encode your text in UTF-8, they are converted to c383 ([195,131])

resp c2a7 ([194,167])

. Byte sequences .

The most likely way of getting into this situation is that the data received from the site was encoded in UTF-8 encoding, but was interpreted as encoded ISO-8859-1 (Latin 1) (or another 8-bit encoding; -15 also widespread).

The correct way to deal with him is to avoid the situation altogether [this may not be possible, unfortunately).

If your data source correctly identifies its encoding - as the website should - learn the encoding and interpret the data accordingly. If an incorrect encoding is specified, you are of course out of luck, and if an encoding is not specified, you should guess right (the natural guess is UTF-8 at the moment, at least for languages ​​that use a variant of the Latin alphabet).




If it is not possible to avoid the situation, the easiest ways to eliminate it are

  • replacing occurrences of the offending sequence with the desired one prior to encoding:

    encodeUtf8 $ replace (pack "Fran\195\167ois") (pack "Fran\231ois") contents
    
          

  • Assuming everything else is ASCII or unintentional UTF-8, interpret the code units Text

    as bytes:

    Data.ByteString.Lazy.Char8.pack $ Data.Text.Lazy.unpack contents
    
          

The first is more effective, but becomes inconvenient if there are many different target codes (for example, caused by different accented letters). The latter only works in an assumed situation (no code units above 255 v Text

) and is rather ineffective for long texts.

+5


source


I'm not entirely sure what less

can display UTF-8 encoded characters correctly. GVim can. You can check this link on SO to see how you can view UTF-8 data in gVim.

And as for the other issue of getting this to the graphics, I think you need to set the encoding on the command line as described in the NonAscii diagram FAQ .



From what you explain, I think there is no problem with how the data is saved. If you pass the encoding to graphviz correctly, I think your problem will be solved.

PS: Build an answer as it is easier to create descriptive links

0


source







All Articles