Detecting damaged characters in a UTF-8 encoded text file
I have a text file that has been edited with the wrong character encoding and therefore has some mojibak and corrupt characters on some lines when I open it with UTF-8. What scripting language will be most effective in detecting these corrupted symbols? Perl is not an option. I am basically trying to find a way to scan through a text file using a script and output the line numbers and possibly offset where the damaged character is found. How should I do it? I was thinking about using AWk, but I don't know what regex to use to find corrupted characters. If I could be directed in the right direction, that would be great.
More complete input:
I want the script to tell me the number of the damaged characters line, which will be the fifth line in the above example. Also, there are different languages in the text file. I have English Chinese, French, Spanish, Russian, Portuguese, Turkish, French_Euro, German, Dutch, Flemish, Korean, Portuguese_Moz. And I have some special characters like # and! and ***
I used this if statement to get the above output:
if($1 ~ /[^\x00-\x7F]/){
print NR ":" , $0 > "output.txt";
count++;
}
source to share
This finds all characters outside the ASCII range:
$ awk '/[^\x00-\x7F]/{ print NR ":", $0 }' file
1: Interruptor EC não está em DESLOCAR
4: 辅助驾驶室门关é—
5: Porte cab. aux. fermée
7: Ð"верь аппаратной камеры закрыта
13: 高压ä¿æŠ¤æ‰‹æŸ„å‘下
14: Barrière descendue
16: Огранич. Планка ВВК опущ.
19: Barra de separação descida
22: DP未å¯åŠ¨
23: Puiss. rép. non activée
25: !!! ВнешнÑÑ Ð¼Ð¾Ñ‰Ð½Ð¾ÑÑ‚ÑŒ не включена
26: Potência Dist Não Ativada
28: Potência dist não activada
31: 机车未移动
33: Motor no se está moviendo
34: Локомотив неподвижен
35: Auto Não se Movendo
37: A não se move
40: 机车状况å…许自动åœæœº
41: Conditions auto\npermettent arrêt auto
43: УÑтановки локомотива\nПредуÑматривают Ð °Ð²Ñ‚оматичеÑкую оÑтановку
44: Condições da moto\nPermitem Auto Parada
Is that good enough? If not, edit your question to display a more verbose sample of input, including cases for which the above doesn't work.
source to share