Detecting charset file dynamically in C ++

I am trying to read a file that any charset / codePage can have, but I am not setting the locale to read the file correctly.

Below is a code snippet in which I am trying to read a file encoded as windows-1256, but I want dynamic encoding from the file being read so that I can set the locale accordingly.

std::wifstream input{ filename.c_str() };
std::wstring content{ std::istreambuf_iterator<wchar_t>(input1), std::istreambuf_iterator<wchar_t>() };
input.imbue(std::locale(".1256"));
contents = ws2s(content); // Convert wstring to CString

      

+3


source to share


2 answers


In general, this cannot be done accurately using the contents of just a text file. You usually have to rely on some external information. For example, if the file was uploaded using HTTP, the encoding must be received in the response header.

Some files may contain encoding information specified in the file format. XML, for example <?xml version="1.0" encoding="XXX"?>

.

Unicode encodings can be detected if the file starts with a byte byte mark - this is optional.



You can usually assume that the encoding uses a wide character if the file contains a null byte, which will represent the line terminator as a narrow character - until the end of the file. Likewise, if you find two consecutive zeros aligned on a 2 byte boundary (to the end), then the encoding is likely to be 4 bytes wide.

Alternatively, you can try to guess the encoding based on the frequency of certain characters. This can have some unintended consequences .

+2


source


Let me be blunt and say you can't

Let me state that: the file is just tons of 0s and 1s stuck on your disk. Encoding is the way to interpret these 0s and 1. You should provide information on how to interpret them, namely by specifying the encoding.

A typical way of doing this is to write a header to indicate the encoding.

This is the html header



<head>
  <title>Page Title</title>
  <meta charset="UTF-8">
</head>

      

As you can see, the encoding must be specified one way or another.

From time to time you see that some rogue encoding guessing applications often do this with some heuristics in byte allocation, but this is unreliable and often leads to gibberish.

As a side note, try using UTF-8 everywhere , the rest are easy, dirty, to say the least.

+1


source







All Articles