How can I check if a string contains latin characters like é in Ruby?
Given:
str1 = "é" # Latin accent
str2 = "囧" # Chinese character
str3 = "ジ" # Japanese character
str4 = "e" # English character
How to distinguish str1
(Latin accent characters) from other strings?
Update:
Considering
str1 = "\xE9" # Latin accent é actually stored as \xE9 reading from a file
How will the answer differ?
source to share
I would first highlight all simple ASCII characters with gsub
, and then check with a regex to see if there are any latin characters left. This should detect accented Latin characters.
def latin_accented?(str)
str.gsub(/\p{Ascii}/, "") =~ /\p{Latin}/
end
latin_accented?("é") #=> 0 (truthy)
latin_accented?("囧") #=> nil (falsy)
latin_accented?("ジ") #=> nil (falsy)
latin_accented?("e") #=> nil (falsy)
source to share
I would use a two step approach:
- Rule strings containing non-Latin characters by attempting to encode the string as Latin-1 (ISO-8859-1).
- Test for accented characters with regular expression.
Example:
def is_accented_latin?(test_string)
test_string.encode("ISO-8859-1") # just to see if it raises an exception
test_string.match(/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿ]/)
rescue Encoding::UndefinedConversionError
false
end
I highly recommend that you choose the accented characters you are trying to screen for yourself, and not just copy what I wrote; I may have missed a few. Also note that this will always return false
for strings containing non-Latin characters, even if the string also contains an accented Latin character.
source to share