Testing Unicode Query String Handling in Perl
I am trying to write an example for testing query string parsing when I am facing a Unicode problem. In short, the letter "Omega" (Ω) does not seem to be decoded correctly.
- Unicode: U + 2126
- 3-byte sequence: \ xe2 \ x84 \ xa6
- URI encoded:% E2% 84% A6
So, I wrote this test program to check that I can "decode" unicode query strings using URI :: Encode.
use strict;
use warnings;
use utf8::all; # use before Test::Builder clones STDOUT, etc.
use URI::Encode 'uri_decode';
use Test::More;
sub parse_query_string {
my $query_string = shift;
my @pairs = split /[&;]/ => $query_string;
my %values_for;
foreach my $pair (@pairs) {
my ( $key, $value ) = split( /=/, $pair );
$_ = uri_decode($_) for $key, $value;
$values_for{$key} ||= [];
push @{ $values_for{$key} } => $value;
}
return \%values_for;
}
my $omega = "\N{U+2126}";
my $query = parse_query_string('alpha=%E2%84%A6');
is_deeply $query, { alpha => [$omega] }, 'Unicode should decode correctly';
diag $omega;
diag $query->{alpha}[0];
done_testing;
And the test result:
query.t ..
not ok 1 - Unicode should decode correctly
# Failed test 'Unicode should decode correctly'
# at query.t line 23.
# Structures begin differing at:
# $got->{alpha}[0] = 'â¦'
# $expected->{alpha}[0] = 'Ω'
# Ω
# â¦
1..1
# Looks like you failed 1 test of 1.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests
Test Summary Report
-------------------
query.t (Wstat: 256 Tests: 1 Failed: 1)
Failed test: 1
Non-zero exit status: 1
Files=1, Tests=1, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.05 cusr 0.00 csys = 0.09 CPU)
Result: FAIL
It seems to me that URI :: Encode might be broken here, but switching to URI :: Escape and using the uri_unescape function reports the same error. What am I missing?
source to share
URI encoded characters just represent utf-8 sequences, and URI :: Encode and URI :: Escape just decode them to utf-8 byte string, and neither decode bytes as UTF-8 (which is the correct behavior as common URI decoding library).
Put it differently, your code basically does:,
is "\N{U+2126}", "\xe2\x84\xa6"
and it will fail, since when comparing, perl updates the latter as a 3-character long Latin number string.
You have to manually decode the input value with Encode::decode_utf8
after uri_decode
or compare the utf8 encoded byte sequence instead.
source to share
The problem can be seen in the wrong details in your own explanation of the problem. What you are dealing with is indeed:
- Unicode code: U + 2126
- UTF-8 encoding code point: \ xe2 \ x84 \ xa6
- URI encoding UTF-8 code:% E2% 84% A6
The problem is that you only canceled one of the encodings.
The solutions have already been presented. I just wanted to provide an alternative explanation.
source to share
I would recommend that you take a look at Why does modern Perl avoid UTF-8 by default? for a detailed discussion of this topic.
I would add to the discussion:
- You will notice a lot of odd glyphs on the page. This was intentional on the part of the author.
- I tried the Symbola font recommended in the thread and looked terrible on Win 7. YMMV.
- Reading Why does modern Perl prevent UTF-8 by default? too often can lead to depression and lingering doubts about your life choices.
source to share