UTF8 encoding?

What is UTF-8 encoding and why are there more text files saved in this format than others?

For example, I typed "A" in notepad and saved it in UTF-8 format.

After that, the file size will be: 4 bytes. why?

0


source to share


3 answers


This is almost certainly because everything you use to save the file also includes the byte order , which in UTF-8 is 0xEF 0xBB 0xBF.



As far as UTF-8 is concerned, it is a Unicode encoding that uses higher bytes for higher Unicode values; It is important that ASCII characters are stored as single bytes (the same bytes as in ASCII). So any ASCII file is also a UTF-8 file with the same text. This web page has more, just like Wikipedia .

+6


source


Because the BOM was inserted at the beginning of the file (byte order).

The BOM is a special character U + FEFF meaning it makes no sense other than a way to detect the encoding of a file. You can read about it here: http://unicode.org/faq/utf_bom.html#BOM



In the case of UTF-8, the BOM is encoded as \ xEF \ xBB \ xBF, which includes 3 extra bytes. Notepad and other text editors are looking for a BOM to guess the file encoding. If it sees \ xFF \ xFE, it will assume that UCS-2 is encoded in a small tail format. A \ xFE \ xFF means UCS-2 is big end encoded.

+2


source


which is only because of the spec, byte byte. UTF-8 only expands characters that have a numeric value greater than 127 (not ASCII).

not all text editors do this. Notepad is infamous (useless UTF-8 BOM).

+2


source







All Articles