Testing Unicode Query String Handling in Perl

Question

Testing Unicode Query String Handling in Perl

I am trying to write an example for testing query string parsing when I am facing a Unicode problem. In short, the letter "Omega" (Ω) does not seem to be decoded correctly.

Unicode: U + 2126
3-byte sequence: \ xe2 \ x84 \ xa6
URI encoded:% E2% 84% A6

So, I wrote this test program to check that I can "decode" unicode query strings using URI :: Encode.

use strict;                                                                                                                                                                    
use warnings;
use utf8::all;    # use before Test::Builder clones STDOUT, etc.
use URI::Encode 'uri_decode';
use Test::More;

sub parse_query_string {
    my $query_string = shift;
    my @pairs = split /[&;]/ => $query_string;

    my %values_for;
    foreach my $pair (@pairs) {
        my ( $key, $value ) = split( /=/, $pair );
        $_ = uri_decode($_) for $key, $value;
        $values_for{$key} ||= [];
        push @{ $values_for{$key} } => $value;
    }
    return \%values_for;
}

my $omega = "\N{U+2126}";
my $query = parse_query_string('alpha=%E2%84%A6');
is_deeply $query, { alpha => [$omega] }, 'Unicode should decode correctly';

diag $omega;
diag $query->{alpha}[0];

done_testing;

And the test result:

query.t .. 
not ok 1 - Unicode should decode correctly
#   Failed test 'Unicode should decode correctly'
#   at query.t line 23.
#     Structures begin differing at:
#          $got->{alpha}[0] = 'â¦'
#     $expected->{alpha}[0] = 'Ω'
# Ω
# â¦
1..1
# Looks like you failed 1 test of 1.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests 

Test Summary Report
-------------------
query.t (Wstat: 256 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
Files=1, Tests=1,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.05 cusr  0.00 csys =  0.09 CPU)
Result: FAIL

It seems to me that URI :: Encode might be broken here, but switching to URI :: Escape and using the uri_unescape function reports the same error. What am I missing?

+3

query-string perl unicode testing

Ovid Apr 10 12 at 9:45 am

source to share

4 answers

URI escaping represents octets and knows nothing about character encodings, so you must decode from UTF-8 octets to the characters themselves, for example:

$_ = decode_utf8(uri_decode($_)) for $key, $value;

+5

ilmari Apr 10 12 at 10:06

source to share

The problem can be seen in the wrong details in your own explanation of the problem. What you are dealing with is indeed:

Unicode code: U + 2126
UTF-8 encoding code point: \ xe2 \ x84 \ xa6
URI encoding UTF-8 code:% E2% 84% A6

The problem is that you only canceled one of the encodings.

The solutions have already been presented. I just wanted to provide an alternative explanation.

+4

ikegami Apr 10 12 at 16:05

source to share

I would recommend that you take a look at Why does modern Perl avoid UTF-8 by default? for a detailed discussion of this topic.

I would add to the discussion:

You will notice a lot of odd glyphs on the page. This was intentional on the part of the author.
I tried the Symbola font recommended in the thread and looked terrible on Win 7. YMMV.
Reading Why does modern Perl prevent UTF-8 by default? too often can lead to depression and lingering doubts about your life choices.

0

converter42 Apr 10 12 at 13:20

source to share

miyagawa · Accepted Answer · 2012-04-10T10:06:12+0000

URI encoded characters just represent utf-8 sequences, and URI :: Encode and URI :: Escape just decode them to utf-8 byte string, and neither decode bytes as UTF-8 (which is the correct behavior as common URI decoding library).

Put it differently, your code basically does:, is "\N{U+2126}", "\xe2\x84\xa6"

and it will fail, since when comparing, perl updates the latter as a 3-character long Latin number string.

You have to manually decode the input value with Encode::decode_utf8

after uri_decode

or compare the utf8 encoded byte sequence instead.

Testing Unicode Query String Handling in Perl

More articles: