Definition of ISO-8859-1 versus US-ASCII encoding

I am trying to determine if I should use

PrintWriter pw = new PrintWriter(outputFilename, "ISO-8859-1");

      

or

PrintWriter pw = new PrintWriter(outputFilename, "US-ASCII");

      

I was reading All About Character Sets to define the character set of an example file that I have to create in the same encoding via java code.

When my example file contains "European" letters (Norwegian: å ø æ), the following command tells me that the file encoding is "iso-8859-1"

file -bi example.txt

      

However, when I take a copy of the same example file and modify it to contain different data, without any Norwegian text (say I replace "Bjørn" with "Bjorn"), then the same command tells me the file is encoding "us- ascii ".

file -bi example-no-european-letters.txt

      

What does it mean? Is ISO-8859-1 in practice the same as US-ASCII if there are no "European" characters in it?

Should I just use "ISO-8559-1" encoding and everything will be fine?

+3


source to share


2 answers


If the file contains only 7-bit characters US-ASCII

, it can be read as US-ASCII

. It doesn't say anything about what was intended as an encoding. It might just be a coincidence that there were no characters that require a different encoding.

ISO-8859-1 (and -15) is a common European encoding capable of encoding äöåéü and other characters, the first 127 characters are the same as in US-ASCII (as is often the case for convenience reasons).



However, you can't just pick an encoding and assume "everything will be fine". The very common UTF-8 encoding also contains US-ASCII encoding, but it encodes, for example, characters äöå

as two bytes instead of ISO-8859-1, one byte.

TL; DR: Don't accept things with encodings. Find out what was intended and used. If you can't figure it out, watch the data to try and figure out what the correct encoding to use is (as you noted yourself, multiple encodings may work at least temporarily).

+5


source


It depends on the different types of symbols we use in the respective document. ASCII is a 7-bit encoding and ISO-8859-1 is an 8-bit encoding that supports some additional characters. But basically, if you are going to reproduce a document from an input stream, I recommend ISO-8859-1 encoding. It will work for text file like notepad and MS word.



If you are using several different international characters, we need to check the corresponding encoding that supports that particular character, such as UTF-8 ..

+1


source







All Articles