Multibyte character problem with .match?

Question

Multibyte character problem with .match?

The following code is what I'm starting to test for use in the Texas Hold Em game I'm working on.

My question is why when running the following code, noise containing "♥" returns "\ u" in it. I feel confident that this multibyte character is causing the problem in the second puts, I replaced ♦ with d in the string array and it returned what I expected. See below:

My code:

#! /usr/bin/env ruby
# encoding: utf-8

table_cards = ["|2♥|", "|8♥|", "|6d|", "|6♣|", "|Q♠|"]

# Array of cards

player_1_face_1 = "8"
player_1_suit_1 = "♦"

# Player 1 face and suit of first card he has

player_1_face_2 = "6"
player_1_suit_2 = "♥"

# Player 1 face and suit of second card he has

test_str_1 = /(\D8\D{2})/.match(table_cards.to_s)

# EX: Searching for match between face values on (player 1 |8♦|) and the |8♥| on the table

test_str_2 = /(\D6\D{2})/.match(table_cards.to_s)

# EX: Searching for match between face values on (player 1 |6♥|) and the |6d| on the table

puts "#{test_str_1}"
puts "#{test_str_2}"

Displays:

|8\u

|6d|

- My goal was to return the first puts: | 8 ♥ |

I'm not so much looking for a solution to this (maybe not just one), but rather an "as simple as possible" explanation of what is causing this problem and why. Thanks in advance for any information on what is going on here and how I can solve this problem.

+3

ruby regex multibyte

Arc_X 09 jan. '15 at 21:30

source to share

1 answer

joelparkerhenderson · Accepted Answer · 2015-01-09T22:58:40+0000

The "\ u" you see is a Unicode string indicator.

For example, the Unicode character "HEAVY BLACK HEART" (U + 2764) can be printed as "\ u2764".

Friendly site for listing Unicode characters http://unicode-table.com/en/sets/

Can you run interactive Ruby in your shell and print a heart like that?

irb
irb> puts "\u2764"
❤

When I run my code in my Ruby, I get the response you expect:

test_str_1 = /(\D8\D{2})/.match(table_cards.to_s)
=> #<MatchData "|8♥|" 1:"|8♥|">

What happens if you try a regex that is more specific to your maps?

 test_str_1 = /(\|8[♥♦♣♠]\|)/.match(table_cards.to_s)

In your example output, you don't see the Unicode heart character as you want. Instead, your output prints "\ u", which is a Unicode starter, but then doesn't print the rest of the expected string, which is "2764".

See user Tin Man's comment which describes the encoding for your console. If it fixes, then I expect the more specific regex to succeed, but it will still print the wrong output.
See David Knipe's comment which says it looks truncated because the regex only matches 4 characters. If it fixes, I expect the more specific regex to succeed and also print the correct result.

(The rest of this answer is typical Unix: if you're on Windows, ignore the rest here ...)

To show your system language settings try this in your shell:

echo $LC_ALL
echo $LC_CTYPE

If they are not "UTF-8" or something, try this in your shell:

export LC_ALL=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8

Then run your code - be sure to use the same wrapper.

If this works and you want to make it permanent, one way is to add them here:

# /etc/environment
LC_ALL=en_US.UTF-8
LC_CTYPE=en_US.UTF-8

Then post that file from your .bashrc or .zshrc or whatever shell startup file you are using.

Multibyte character problem with .match?

More articles: