Text encoding problems

Question

Text encoding problems

I am having problems with text encoding. Analyzing the website gives me the Data.Text string

"Project - Fran \ 195 \ 167ois Dubois",

which I need to write to a file. So I am using Data.Text.Lazy.Encoding.encodeUtf8 to convert it to Bytestring. The problem is that it distorts the output:

"Project - Francois Dubois".

What am I missing here?

+3

haskell

Peter 08 Apr At 5:02 am

source to share

2 answers

I'm not entirely sure what less

can display UTF-8 encoded characters correctly. GVim can. You can check this link on SO to see how you can view UTF-8 data in gVim.

And as for the other issue of getting this to the graphics, I think you need to set the encoding on the command line as described in the NonAscii diagram FAQ .

From what you explain, I think there is no problem with how the data is saved. If you pass the encoding to graphviz correctly, I think your problem will be solved.

PS: Build an answer as it is easier to create descriptive links

0

Gangadhar 08 Apr 12 at 6:22 am

source to share

Daniel Fischer · Accepted Answer · 2012-04-08T10:05:17+0000

If you got Fran\195\167ois

inside Data.Text

yours, you already have UTF-8 encoded François

.

This is inconvenient because it Data.Text[.Lazy]

must be UTF-16 encoded text, and the two code blocks 195 and 167 are interpreted as Unicode code 195 respectively. 167, which are "Ã" respectively. '§'. If you encode your text in UTF-8, they are converted to c383 ([195,131])

resp c2a7 ([194,167])

. Byte sequences .

The most likely way of getting into this situation is that the data received from the site was encoded in UTF-8 encoding, but was interpreted as encoded ISO-8859-1 (Latin 1) (or another 8-bit encoding; -15 also widespread).

The correct way to deal with him is to avoid the situation altogether [this may not be possible, unfortunately).

If your data source correctly identifies its encoding - as the website should - learn the encoding and interpret the data accordingly. If an incorrect encoding is specified, you are of course out of luck, and if an encoding is not specified, you should guess right (the natural guess is UTF-8 at the moment, at least for languages that use a variant of the Latin alphabet).

If it is not possible to avoid the situation, the easiest ways to eliminate it are

replacing occurrences of the offending sequence with the desired one prior to encoding:

encodeUtf8 $ replace (pack "Fran\195\167ois") (pack "Fran\231ois") contents

Assuming everything else is ASCII or unintentional UTF-8, interpret the code units Text

as bytes:

Data.ByteString.Lazy.Char8.pack $ Data.Text.Lazy.unpack contents

The first is more effective, but becomes inconvenient if there are many different target codes (for example, caused by different accented letters). The latter only works in an assumed situation (no code units above 255 v Text

) and is rather ineffective for long texts.

Text encoding problems

More articles: