How can I figure out which code page I'm looking at?

Question

How can I figure out which code page I'm looking at?

I have a device with some documentation on how to send its text. It uses 0x00-0x7F to send special characters like accented characters, euro signs, ...

I assume they copied the existing code page and made some changes, but I have no idea how to determine which code page is closer to the one in my documentation.

In theory, this should be easy to do. For example, they map Á to 0x41, so if I could find a way to go through all the code pages and find the ones that have that character at that position, that would be a piece of cake.

However, all I can find on the internet are links to codepage dumps like the ones I'm looking at, or software that uses heuristics to read text and guesses the most likely codepage. Surely someone there was able to see what code page they are looking at?

0

unicode codepages

Thomas vander stichele 06 jan. '09 at 9:08

source to share

5 answers

Alan moore · Answer 1 · 2009-01-06T14:14:21+0000

If it uses 0x00

to 0x7F

for "special" characters, how does it encode regular ASCII characters?

Most character encodings have a Á

code of 193 ( 0xC1

). If you subtract 128 from that, you get 65 ( 0x41

). Perhaps your "code page" is only the upper half of one of the standard encodings such as ISO-8859-1 or windows-1252, with the most significant bit set to zero instead of one (ie, subtracts 128 from each).

If that happens, I would expect to find a flag that you can set to tell it whether to convert the next chunk of codepoints with "up" or "down" encoding. I don’t know of any system using this scheme, but this is the most reasonable explanation I can think of for this situation.

Bombe · Answer 2 · 2009-01-06T09:14:40+0000

There is no way to automatically detect the code page without additional information. Below the display layer its just bytes and all bytes are created equal. Theres no way to say "Im a 0x41 from this and this code page", theres only "Im 0x41. Show me!"

Statement · Answer 3 · 2009-01-06T14:19:14+0000

Which endian is the system? Perhaps you are flipping bit orders?

+1

Statement 06 jan. 09 at 14:19

source to share

Osama Al-Maadeed · Answer 4 · 2009-01-06T10:03:09+0000

In most code pages, 0x41 is just a normal "A", I don't think the standard code pages have a "Á" in that position. It can have a control character somewhere in front of the A, which added an accent, or it uses a non-standard code page.

I see no point in using the "closest code page", you just need to use the documents you received with the device.

Your last sentence is puzzling what do you mean by "searchable which code page on the page"?

If you include your entire codepage, the people here on SO might be more helpful and give you more information on this issue, since the single data point 0x41 = Á doesn't help much.

user18015 · Answer 5 · 2009-01-14T15:35:39+0000

A bit of a random idea, but if you can get a significant amount of text replicated from the device, you can try running it through something like a function detect

at http://chardet.feedparser.org/ .

How can I figure out which code page I'm looking at?

More articles: