Invalid byte sequence in UTF-8 Ruby

I have a line like this "abce\xC3".sub("a","A")

, when I execute the line I get the following error.

ArgumentError: invalid byte sequence in UTF-8
    from (irb):20:in `sub'
    from (irb):20
    from /home/vijay/.rvm/rubies/ruby-2.0.0-p598/bin/irb:12:in `<main>'

      

Can someone help me solve this problem.

+3


source to share


3 answers


Since Arie has already answered this error, it is because the byte sequence is invalid \xC3

If you are using Ruby 2.1+ you can also use String#scrub

to replace invalid bytes with a given replacement character. Here:



a = "abce\xC3"
# => "abce\xC3" 
a.scrub
# => "abce "
a.scrub.sub("a","A")
# => "Abce "

      

+6


source


You need to figure out what you need \xC3

. Does it represent a char Ã

?

You see the error because it is \xC3

not a valid byte sequence encoded in (default) UTF-8. You can fix the encoding of the String first (answering the question above) and then do the replacement.

"abce\xC3".force_encoding("iso-8859-1").sub('a', 'A')

      



Or, if the encoding doesn't matter, say you are processing a sequence of bytes and not a sequence of characters, you can force the encoding ASCII-8BIT

.

"abce\xC3".force_encoding("ASCII-8BIT").sub('a', 'A')

      

+5


source


Regarding your comment / actual problem:

"ITZVÃ"

is the content of the file. When I read the file.

 z = File.open("x")
 z.read(5)

      

Then the output should be ITZV\xC3\x83

, instead I getITZV\xC3

This is due to the fact that in UTF-8, it Ã

is a multibyte character, i.e. your string is 5 characters, but 6 bytes:

"ITZVÃ".chars #=> ["I", "T", "Z", "V", "Ã"]
"ITZVÃ".bytes #=> [ 73,  84,  90,  86, 195, 131]

      

z.read(5)

reads 5 bytes from your files, thus returning an incomplete UTF-8 string:

require 'tempfile'

z = Tempfile.new('foo')
z << 'ITZVÃ'

z.rewind
z.read(5) #=> "ITZV\xC3"

      

You should read 6 bytes instead:

z.rewind
z.read(6) #=> "ITZV\xC3\x83"

      

Note that it read

always returns ASCII-8BIT encoded strings. You have to set a different encoding manually:

z.rewind
z.read(6).force_encoding('utf-8') #=> "ITZVÃ"

      

+2


source







All Articles