The special character '\ u0098' is read as '\ u02dc' using charCodeAt ()

Question

The special character '\ u0098' is read as '\ u02dc' using charCodeAt ()

I am creating test.js from Java as shown below. Test.js implements the d () function, which takes the special character ~ ('\ u0098');

The d () function should display the charCodeAt () of these special characters, which will be 152. However, it displays 732.

Note that characters 152 and 732 are both represented by the special character ~ , as shown below.

http://www.fileformat.info/info/unicode/char/098/index.htm

http://www.fileformat.info/info/unicode/char/2dc/index.htm

How do I make the d () function display 152 instead of 732? (charset issue?). THANKS TO

TEST.JAVA

public void doPost(HttpServletRequest req, HttpServletResponse res)
throws ServletException, IOException
{
    res.setHeader("Content-Type", "text/javascript;charset=ISO-8859-1");
    res.setHeader("Content-Disposition","attachment;filename=test.js");
    res.setCharacterEncoding("ISO-8859-1");
    PrintWriter printer=res.getWriter();
    printer.write("function d(a){a=(a+\"\").split(\"\");alert(a[0].charCodeAt(0));};d(\""); // Writes beginning of d() function
    printer.write('\u0098'); // Writes special character as parameter of d()
    printer.write("\");"); // Writes end of d() function
    printer.close();
}

TEST.JS generated by TEST.JAVA

function d(a)
{
  a=(a+"").split("");
  alert(a[0].charCodeAt(0));
};
d("˜"); // Note special character representing '\u0098'

test.html

<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head>
<body>
<script type="text/javascript" charset="ISO-8859-1" src="test.js"></script>
</body>
</html>

+2

javascript unicode servlets character-encoding iso-8859-1

Arturo 09 Apr 12 at 22:19

source to share

2 answers

Try:

    printer.write('\\u0098');

JavaScript understands \uNNNN

too, so you can explicitly form the string with the desired symbolic code.

0

Pointy 09 Apr 12 at 22:22

source to share

bobince · Accepted Answer · 2012-04-09T23:59:09+0000

Note that characters 152 and 732 are both represented by the special character ~, as shown below.

Not really. ˜

has the unambiguous character U + 02DC (732), so charCodeAt

it does the right thing. The character U + 0098 (152) is an invisible control code that is almost never used.

The trick is what "ISO-8859-1"

has a different meaning to Java and web browsers. For Java, this is indeed the ISO-8859-1 standard, which exactly matches the first 256 Unicode code points. This includes a number of little-used C1 control characters at 128-159.

However, for a web browser "ISO-8859-1"

it actually means Windows code 1252 (Western European), an encoding that instead adds useful characters to block 128-159. This behavior is due to early web browsers that only used the default machine code page. When proper Unicode and encoding support was added to the browser, a compatibility issue dictated continued support for Windows characters, despite being incorrectly labeled in the ISO-8859 format.

So when you write the character U + 0098 from Java to ISO-8859-1, you get the 0x98 byte, which is then read by the browser as U + 02DC. This is usually harmless because no one ever really wants to use C1 control codes in the U + 0080-U + 009F range. But this is of course confusing.

This ancient quirk, and also related to handling character references &#...;

in the 128-159 range as cp1252 bytes, is finally documented and standardized as part of HTML5, but only for HTML parsing rules. (Not XHTML5, as it follows smarter XML rules.) So the quoted fileformat.info page seems to be misleading that U + 0098 is rendered as ˜

.

If you really need to extract the byte number cp1252 of a character, you will have to use a lookup table to help you, because this information is not made visible to JavaScript. For example:

var CP1252EXTRAS= '\u20ac\u20ac\u201a\u0192\u201e\u2026\u2020\u2021\u02c6\u2030\u0160\u2039\u0152\u0152\u017d\u017d\u017d\u2018\u2019\u201c\u201d\u2022\u2013\u2014\u02dc\u2122\u0161\u203a\u0153\u0153\u017e\u0178';

function getCodePage1252Byte(s) {
    var ix= CP1252EXTRAS.indexOf(s);
    if (ix!==-1)
        return 128+ix;
    var c= s.charCodeAt(0);
    if (c<128 || c>=160 && c<256)
        return c;
    return -1;
}

You probably don't want to do this. Anyway, usually the answer is not to use ISO-8859-1, but to stick to the old old UTF-8 ("Reasonable Encoding Only").

In any case, <script charset="...">

not supported by every browser, nor Content-Type: text/javascript;charset=...

is it supported by every browser. There is no reliable way to serve JavaScript under different coding in the include page. Unless you're 100%, every page will use the exact same encoding as your script, the only safe way forward is to keep your JavaScript ASCII safe by outputting JavaScript sequences \unnnn

instead of literal bytes.

(An ASCII compatible JSON encoder can help you do this.)

The special character '\ u0098' is read as '\ u02dc' using charCodeAt ()

More articles: