Abnormal behavior when comparing a Unicode character to a Unicode character range

Question

Abnormal behavior when comparing a Unicode character to a Unicode character range

For some reason, I am getting unexpected results when comparing Unicode character ranges.

To summarize, my pivot test code ("\u1000".."\u1200") === "\u1100"

has false

where I expect it to be true

- while the same test versus "\u1001"

is equal true

as expected. I find this completely incomprehensible. The results of the operator <

are also interesting - they contradict ===

.

The following code is a good minimal illustration:

# encoding: utf-8

require 'pp'

a = "\u1000"
b = "\u1200"

r = (a..b)

x = "\u1001"
y = "\u1100"

pp a, b, r, x, y

puts "a < x = #{a < x}"
puts "b > x = #{b > x}"

puts "a < y = #{a < y}"
puts "b > y = #{b > y}"

puts "r === x = #{r === x}"
puts "r === y = #{r === y}"

I would naively expect both operations to ===

result in "true" here. However, the actual output from this program is:

ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.3.0]
"\u1000"
"\u1200"
"\u1000".."\u1200"
"\u1001"
"\u1100"
a < x = true
b > x = true
a < y = true
b > y = true
r === x = true
r === y = false

Can anyone enlighten me?

(Note that I'm on 1.9.3 on Mac OS X and I'm explicitly setting the encoding to utf-8.)

+3

ruby

Perry 04 Apr 12 at 22:35

source to share

2 answers

It looks like Range doesn't mean what we think it means.

What I think is happening is that you are creating a range that tries to include letters, numbers and punctuation marks. Ruby cannot do this and does not "understand" that you essentially want an array of code points.

This causes the Range # to_a method to decay:

("\u1000".."\u1099").to_a.size  #=> 55
("\u1100".."\u1199").to_a.size  #=> 154
("\u1200".."\u1299").to_a.size  #=> 73

Singer is when you add all three:

("\u1000".."\u1299").to_a.size  #=> 55

Ruby 1.8.7 works as expected, as Matt points out in the comments, "\ u1000" is literal "u1000" because it's not Unicode.

The source code for the # succ C line doesn't just return the following code:

Returns the successor to <i>str</i>. The successor is calculated by                                                                                                                                                                                                          
incrementing characters starting from the rightmost alphanumeric (or                                                                                                                                                                                                         
the rightmost character if there are no alphanumerics) in the                                                                                                                                                                                                                
string. Incrementing a digit always results in another digit, and                                                                                                                                                                                                            
incrementing a letter results in another letter of the same case.                                                                                                                                                                                                            
Incrementing nonalphanumerics uses the underlying character set                                                                                                                                                                                                            
collating sequence.

The range does something other than just next, next, next.

A range with these characters performs the ACSII sequence:

('8'..'A').to_a
=> ["8", "9", ":", ";", "<", "=", ">", "?", "@", "A"]

But using #succ is completely different:

'8'.succ
=> '9'

'9'.succ
=> '10'  # if we were in a Range.to_a, this would be ":"

+2

joelparkerhenderson 04 Apr At 11:17 pm

source to share

dbenhur · Accepted Answer · 2012-04-04T23:41:43+0000

ACTION: I've presented this behavior as bug # 6258 for ruby-lang .

Something strange about the sort order in this character range

irb(main):081:0> r.to_a.last.ord.to_s(16)
=> "1036"
irb(main):082:0> r.to_a.last.succ.ord.to_s(16)
=> "1000"
irb(main):083:0> r.min.ord.to_s(16)
=> "1000"
irb(main):084:0> r.max.ord.to_s(16)
=> "1200"

The min and max for the range are expected values from your input, but if we turn the range into an array, the last element will be "\ u1036" and its successor will be "\ u1000". Under the caps, Range # === should list the String # succ sequence and not just check the binding on min and max.

If we look at the source (click the toggle button) for Range # === , we see that it sends Range # includes? ... Range # include? the source shows special handling for strings - if the answer can only be determined by the length of the string, or all the variable strings are ASCII, we get simple bounds checks, otherwise we send super, what does #include mean? gets an Enumerable # include response ? which enumerates using Range # each , which again has special handling for string and sending to String # upto which enumerates String # Succ .

The # succ line has a bunch of special handling when the string contains is_alpha or is_digit values (which shouldn't be true for U + 1036 ), otherwise it will increment the final char with enc_succ_char

. At this point I am losing track, but presumably it calculates the successor using the encoding and collation information associated with the string.

BTW, as a work around, you can use a range of whole ordinals and test against ordinals if you only like individual characters. eg:

r = (a.ord..b.ord)
r === x.ord
r === y.ord

Abnormal behavior when comparing a Unicode character to a Unicode character range

More articles: