Invalid byte sequence in UTF-8 Ruby
I have a line like this "abce\xC3".sub("a","A")
, when I execute the line I get the following error.
ArgumentError: invalid byte sequence in UTF-8
from (irb):20:in `sub'
from (irb):20
from /home/vijay/.rvm/rubies/ruby-2.0.0-p598/bin/irb:12:in `<main>'
Can someone help me solve this problem.
source to share
Since Arie has already answered this error, it is because the byte sequence is invalid \xC3
If you are using Ruby 2.1+ you can also use String#scrub
to replace invalid bytes with a given replacement character. Here:
a = "abce\xC3"
# => "abce\xC3"
a.scrub
# => "abce "
a.scrub.sub("a","A")
# => "Abce "
source to share
You need to figure out what you need \xC3
. Does it represent a char Ã
?
You see the error because it is \xC3
not a valid byte sequence encoded in (default) UTF-8. You can fix the encoding of the String first (answering the question above) and then do the replacement.
"abce\xC3".force_encoding("iso-8859-1").sub('a', 'A')
Or, if the encoding doesn't matter, say you are processing a sequence of bytes and not a sequence of characters, you can force the encoding ASCII-8BIT
.
"abce\xC3".force_encoding("ASCII-8BIT").sub('a', 'A')
source to share
Regarding your comment / actual problem:
"ITZVÃ"
is the content of the file. When I read the file.
z = File.open("x") z.read(5)
Then the output should be
ITZV\xC3\x83
, instead I getITZV\xC3
This is due to the fact that in UTF-8, it Ã
is a multibyte character, i.e. your string is 5 characters, but 6 bytes:
"ITZVÃ".chars #=> ["I", "T", "Z", "V", "Ã"]
"ITZVÃ".bytes #=> [ 73, 84, 90, 86, 195, 131]
z.read(5)
reads 5 bytes from your files, thus returning an incomplete UTF-8 string:
require 'tempfile'
z = Tempfile.new('foo')
z << 'ITZVÃ'
z.rewind
z.read(5) #=> "ITZV\xC3"
You should read 6 bytes instead:
z.rewind
z.read(6) #=> "ITZV\xC3\x83"
Note that it read
always returns ASCII-8BIT encoded strings. You have to set a different encoding manually:
z.rewind
z.read(6).force_encoding('utf-8') #=> "ITZVÃ"
source to share