Invalid byte sequence in UTF-8 Ruby

Question

Invalid byte sequence in UTF-8 Ruby

I have a line like this "abce\xC3".sub("a","A")

, when I execute the line I get the following error.

ArgumentError: invalid byte sequence in UTF-8
    from (irb):20:in `sub'
    from (irb):20
    from /home/vijay/.rvm/rubies/ruby-2.0.0-p598/bin/irb:12:in `<main>'

Can someone help me solve this problem.

+3

ruby character-encoding ruby-2.0

Vijay 07 jul. 15 at 15:20

source to share

3 answers

You need to figure out what you need \xC3

. Does it represent a char Ã

?

You see the error because it is \xC3

not a valid byte sequence encoded in (default) UTF-8. You can fix the encoding of the String first (answering the question above) and then do the replacement.

"abce\xC3".force_encoding("iso-8859-1").sub('a', 'A')

Or, if the encoding doesn't matter, say you are processing a sequence of bytes and not a sequence of characters, you can force the encoding ASCII-8BIT

.

"abce\xC3".force_encoding("ASCII-8BIT").sub('a', 'A')

+5

Arie xiao 07 jul. 15 at 15:28

source to share

Regarding your comment / actual problem:

"ITZVÃ"

is the content of the file. When I read the file.
 z = File.open("x")
 z.read(5)

      

        
        
        
      

    
Then the output should be ITZV\xC3\x83

, instead I getITZV\xC3

This is due to the fact that in UTF-8, it Ã

is a multibyte character, i.e. your string is 5 characters, but 6 bytes:

"ITZVÃ".chars #=> ["I", "T", "Z", "V", "Ã"]
"ITZVÃ".bytes #=> [ 73,  84,  90,  86, 195, 131]

z.read(5)

reads 5 bytes from your files, thus returning an incomplete UTF-8 string:

require 'tempfile'

z = Tempfile.new('foo')
z << 'ITZVÃ'

z.rewind
z.read(5) #=> "ITZV\xC3"

You should read 6 bytes instead:

z.rewind
z.read(6) #=> "ITZV\xC3\x83"

Note that it read

always returns ASCII-8BIT encoded strings. You have to set a different encoding manually:

z.rewind
z.read(6).force_encoding('utf-8') #=> "ITZVÃ"

+2

Stefan 08 jul. 15 at 7:37

source to share

shivam · Accepted Answer · 2015-07-07T15:38:54+0000

Since Arie has already answered this error, it is because the byte sequence is invalid \xC3

If you are using Ruby 2.1+ you can also use String#scrub

to replace invalid bytes with a given replacement character. Here:

a = "abce\xC3"
# => "abce\xC3" 
a.scrub
# => "abce "
a.scrub.sub("a","A")
# => "Abce "

Invalid byte sequence in UTF-8 Ruby

More articles: