Good resources for learning different types of character encoding and converting between them

One thing I never really understood was the concept of character encoding. The way it is encoded is handled in memory, and the code often puzzles me that I am just copying an example from the Internet without understanding what it does. I feel like this is a very important and very overlooked subject that more people should take the time to get right (myself included).

I'm looking for some good, in fact, resources for learning about the different types of encoding and character conversion between them (preferably in C #). Both books and online resources are welcome.

Thank.


Edit 1:

Thanks for the answers so far. I'm especially looking for more information on how .NET handles encoding. I know this may sound vague, but I don't know what to ask for. I guess I'm curious how the coding is shown in a C # string class and can the class itself manage different types of coding or are there separate classes for that?

+2


source to share


3 answers


I would start with this question: what is a character?

  • Boolean identifier: code . Unicode assigns a number to each character that is not necessarily associated with any form of bit / byte. Encodings (such as UTF-8) define the mapping to byte values.
  • Bits and Bytes: encoded form . One or more bytes per code, values ​​determined by the encoding used.
  • The thing you see on the screen is grapheme . A grapheme is created from one or more code points. This is the material at the end of the presentation.

This code converts in.txt

from windows-1252

to UTF-8

and saves it as out.txt

.

using System;
using System.IO;
using System.Text;
public class Enc {
  public static void Main(String[] args) {
    Encoding win1252 = Encoding.GetEncoding(1252);
    Encoding utf8 = Encoding.UTF8;
    using(StreamReader reader = new StreamReader("in.txt", win1252)) {
      using(StreamWriter writer = new StreamWriter("out.txt", false, utf8)) {
        char[] buffer = new char[1024];
        while(reader.Peek() > 0) {
          int r = reader.Read(buffer, 0, buffer.Length);
          writer.Write(buffer, 0, r); 
        }
      }
    }
  }
}

      

There are two transformations taking place here. First, the bytes are decoded from windows-1252

to UTF-16

(a bit, of course, I think) into a buffer char

. The buffer is then converted to UTF-8

.

code

Some examples of code points:

  • U + 0041 - LATIN CAPITAL LETTER A (A)
  • U + 00A3 - SOUND SIGN (£)
  • U + 042F is CYRILLIC CAPITAL LETTER YA (I)
  • U + 1D50A is MATHEMATICAL FRAKTUR CAPITAL G (𝔊)

Encoding

Anywhere you work with characters, it will be encoded in some form. C # uses UTF-16 for its char type , which it defines as 16 bits wide.



You can think of an encoding as a table mapping between code points and byte representations.

CODEPOINT       UTF-16BE        UTF-8     WINDOWS-1252
U+0041 (A)         00 41           41               41
U+00A3 (£)         00 A3        C2 A3               A3
U+042F (Ya)        04 2F        D0 AF                -
U+1D50A      D8 35 DD 0A  F0 9D 94 8A                -

      

The System.Text.Encoding class provides types / methods for performing conversions.

graphemes

The grapheme you see on the screen may be constructed from more than one codepoint. The character e-acute (é) can be represented with two codepoints, LATIN SMALL LETTER E U + 0065 and COMBINING ACUTE ACCENT U + 0301.

('é' is more usually represented by the single codepoint U + 00E9. You can switch between them using normalization. Not all combining sequences have a single character equivalent, though.)

conclusions

  • When you encode a C # string for encoding, you are converting from UTF-16 to that encoding.
  • Encoding can be lossy conversion. Most non-Unicode encodings can only encode a subset of the existing characters.
  • Since not all code points can fit into a single C # char, the number of characters in a string can be more than the number of code points, and the number of code points can be greater than the number of graphemes displayed.
  • The "length" of a string is context sensitive, so you need to know which meaning you are using and use the appropriate algorithm. How this is handled is determined by the programming language you are using.
  • Providing identical Latin-1 character values ​​in many encodings gives some people the ASCII confusion.

(This is a little longer than I assumed, and probably more than you wanted, so I'll stop. I wrote an even longer Java encoded one here .)

+2


source


Wikipedia has a pretty good explanation of character encoding in general: http://en.wikipedia.org/wiki/Character_encoding .

If you are looking for details on UTF-8, which is one of the most popular character encodings, you should read the UTF-8 and Unicode FAQ .



And, as pointed out, "The Absolute Minimum Every software developer absolutely, positively needs to know about Unicode and character sets (no excuses!)" Is a very good primer.

+2


source


There's Joel's famous article "The Absolute Minimum. Every software developer is absolutely sure to know about Unicode and character sets (no excuses!)" Http://www.joelonsoftware.com/articles/Unicode.html

Edit: While this is more about text formats, re-reading I think you are more interested in things like html encoding and url encoding? Which are for escaping special characters that have meaningful meanings in html or urls (like <and> in html, or? AND = in URLs)

+1


source







All Articles