How to handle GET parameters containing non-utf8 characters?

In a nodejs / express app, I have to deal with GET requests that may contain characters encoded using iso-8859-1 encoding.

Unfortunately, its parser seems to only handle ASCII and UTF8:

> qs.parse('foo=bar&xyz=foo%20bar')
{ foo: 'bar', xyz: 'foo bar' } # works fine
> qs.parse('foo=bar&xyz=T%FCt%20T%FCt')
{ foo: 'bar', xyz: 'T%FCt%20T%FCt' } # iso-8859-1 breaks, should be "Tüt Tüt"
> qs.parse('foo=bar&xyz=m%C3%B6p')
{ foo: 'bar', xyz: 'möp' } # utf8 works fine

      

Is there a hidden option or some other clean way to make this work with other encodings? The main problem with the default behavior is that I don't know if there was a decode error or not - after all, the input could have been something that was just decoded for something like a urlencoded string.

+1


source to share


2 answers


Well URL encoding should always be in UTF-8, other cases can be seen as an encoding attack and just reject the request. There is no such as non-utf8 character. I don't know why your application can receive query strings in any encoding, but browsers will be fine as long as you just use the encoding of the header in your pages. For API requests or whatever, you can specify UTF-8 and reject invalid UTF-8 as bad request.

If you really mean ISO-8859-1, then this is very simple because the bytes match the unicode code exactly.



'T%FCt%20T%FCt'.replace( /%([a-f0-9]{2})/gi, function( f, m1 ) {
    return String.fromCharCode(parseInt(m1, 16));
});

      

Although it will probably never be ISO-8859-1 on the web, it is Windows-1252 in fact.

+1


source


Perhaps node-iconv is the solution. Do you know in advance which encoding is used?



var qs = require('qs');
var Buffer = require('buffer').Buffer;
var Iconv  = require('iconv').Iconv;

var parsed = qs.parse('foo=bar&xyz=T%FCt%20T%FCt');
var iconv = new Iconv('ISO-8859-1', 'UTF-8');
var buffer = iconv.convert(parsed.xyz);
var xyz = buffer.toString();

      

0


source







All Articles