Why do character sets appear in an order?

I have always thought that Sets are not ordered, but noticed that the character sets appear to be ordered:

(seq #{\e \c \b \z \a}) 

=> (\a \b \c \e \z)

      

If I represent other types of characters, it seems that they are ordered according to character codes:

(seq #{\e \A \c \space \b \z \a})

=> (\space \A \a \b \c \e \z)

      

Why are characters sorted according to their code, but the sets of numbers are in arbitrary order?

+3


source to share


1 answer


This is because it is Character/hashCode

directly tied to the ordinal of the character, and sets are based on hashmaps. But if you enter enough characters to start getting hash collisions, the visible ordering isn't completely held together:

; the whole alphabet is small enough to avoid collisions
user=> (apply str (set "abcdefghijklmnopqrstuvwxyz"))
"abcdefghijklmnopqrstuvwxyz"
; and observe the hashes are indeed sequential
user=> (map hash (set "abcdefghijklmnopqrstuvwxyz"))
(97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122)

; but go from 26 to 36 elements, and you start to see collisions
user=> (apply str (set "0123456789abcdefghijklmnopqrstuvwxyz"))
"abcdefghijklmno0p1q2r3s4t5u6v7w8x9yz"
user=> (map hash (set "0123456789abcdefghijklmnopqrstuvwxyz"))
(97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 48 112 49 113 50 114 51 115 52 116 53 117 54 118 55 119 56 120 57 121 122)

      



But of course, as you know, this is not a specific behavior, but just how the implementation currently works.

Now you're asking why this doesn't happen for numbers: the reason is that Clojure is clearly avoiding it! (.hashCode 1)

returns 1 as Java defines its hash codes. But the Clojure hash

function
uses murmur3 which returns very different values ​​for numbers than just returns the input: (hash 1)

it gives 1392991556. I'm not an expert on this, but I think the main motivation behind using noise instead of Java's built-in hash function is to avoid hash collisions for reasons security. Temporary attacks or something else?

+10


source







All Articles