Abnormal behavior when comparing a Unicode character to a Unicode character range
For some reason, I am getting unexpected results when comparing Unicode character ranges.
To summarize, my pivot test code ("\u1000".."\u1200") === "\u1100"
has false
where I expect it to be true
- while the same test versus "\u1001"
is equal true
as expected. I find this completely incomprehensible. The results of the operator <
are also interesting - they contradict ===
.
The following code is a good minimal illustration:
# encoding: utf-8
require 'pp'
a = "\u1000"
b = "\u1200"
r = (a..b)
x = "\u1001"
y = "\u1100"
pp a, b, r, x, y
puts "a < x = #{a < x}"
puts "b > x = #{b > x}"
puts "a < y = #{a < y}"
puts "b > y = #{b > y}"
puts "r === x = #{r === x}"
puts "r === y = #{r === y}"
I would naively expect both operations to ===
result in "true" here. However, the actual output from this program is:
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.3.0]
"\u1000"
"\u1200"
"\u1000".."\u1200"
"\u1001"
"\u1100"
a < x = true
b > x = true
a < y = true
b > y = true
r === x = true
r === y = false
Can anyone enlighten me?
(Note that I'm on 1.9.3 on Mac OS X and I'm explicitly setting the encoding to utf-8.)
source to share
ACTION: I've presented this behavior as bug # 6258 for ruby-lang .
Something strange about the sort order in this character range
irb(main):081:0> r.to_a.last.ord.to_s(16)
=> "1036"
irb(main):082:0> r.to_a.last.succ.ord.to_s(16)
=> "1000"
irb(main):083:0> r.min.ord.to_s(16)
=> "1000"
irb(main):084:0> r.max.ord.to_s(16)
=> "1200"
The min and max for the range are expected values from your input, but if we turn the range into an array, the last element will be "\ u1036" and its successor will be "\ u1000". Under the caps, Range # === should list the String # succ sequence and not just check the binding on min and max.
If we look at the source (click the toggle button) for Range # === , we see that it sends Range # includes? ... Range # include? the source shows special handling for strings - if the answer can only be determined by the length of the string, or all the variable strings are ASCII, we get simple bounds checks, otherwise we send super, what does #include mean? gets an Enumerable # include response ? which enumerates using Range # each , which again has special handling for string and sending to String # upto which enumerates String # Succ .
The # succ line has a bunch of special handling when the string contains is_alpha or is_digit values (which shouldn't be true for U + 1036 ), otherwise it will increment the final char with enc_succ_char
. At this point I am losing track, but presumably it calculates the successor using the encoding and collation information associated with the string.
BTW, as a work around, you can use a range of whole ordinals and test against ordinals if you only like individual characters. eg:
r = (a.ord..b.ord)
r === x.ord
r === y.ord
source to share
It looks like Range doesn't mean what we think it means.
What I think is happening is that you are creating a range that tries to include letters, numbers and punctuation marks. Ruby cannot do this and does not "understand" that you essentially want an array of code points.
This causes the Range # to_a method to decay:
("\u1000".."\u1099").to_a.size #=> 55
("\u1100".."\u1199").to_a.size #=> 154
("\u1200".."\u1299").to_a.size #=> 73
Singer is when you add all three:
("\u1000".."\u1299").to_a.size #=> 55
Ruby 1.8.7 works as expected, as Matt points out in the comments, "\ u1000" is literal "u1000" because it's not Unicode.
The source code for the # succ C line doesn't just return the following code:
Returns the successor to <i>str</i>. The successor is calculated by
incrementing characters starting from the rightmost alphanumeric (or
the rightmost character if there are no alphanumerics) in the
string. Incrementing a digit always results in another digit, and
incrementing a letter results in another letter of the same case.
Incrementing nonalphanumerics uses the underlying character set
collating sequence.
The range does something other than just next, next, next.
A range with these characters performs the ACSII sequence:
('8'..'A').to_a
=> ["8", "9", ":", ";", "<", "=", ">", "?", "@", "A"]
But using #succ is completely different:
'8'.succ
=> '9'
'9'.succ
=> '10' # if we were in a Range.to_a, this would be ":"
source to share