Detecting damaged characters in a UTF-8 encoded text file

I have a text file that has been edited with the wrong character encoding and therefore has some mojibak and corrupt characters on some lines when I open it with UTF-8. What scripting language will be most effective in detecting these corrupted symbols? Perl is not an option. I am basically trying to find a way to scan through a text file using a script and output the line numbers and possibly offset where the damaged character is found. How should I do it? I was thinking about using AWk, but I don't know what regex to use to find corrupted characters. If I could be directed in the right direction, that would be great.

More complete input:

I want the script to tell me the number of the damaged characters line, which will be the fifth line in the above example. Also, there are different languages ​​in the text file. I have English Chinese, French, Spanish, Russian, Portuguese, Turkish, French_Euro, German, Dutch, Flemish, Korean, Portuguese_Moz. And I have some special characters like # and! and ***

I used this if statement to get the above output:

if($1 ~ /[^\x00-\x7F]/){
print NR ":" , $0 > "output.txt";
count++;
}

      

+3


source to share


1 answer


This finds all characters outside the ASCII range:

$ awk '/[^\x00-\x7F]/{ print NR ":", $0 }' file
1: Interruptor EC não está em DESLOCAR
4: 辅助驾驶室门关闭
5: Porte cab. aux. fermée
7: Ð"верь аппаратной камеры закрыта
13: 高压ä¿æŠ¤æ‰‹æŸ„å‘下
14: Barrière descendue
16: Огранич. Планка ВВК опущ.
19: Barra de separação descida
22: DP未å¯åŠ¨
23: Puiss. rép. non activée
25: !!! ВнешнÑÑ Ð¼Ð¾Ñ‰Ð½Ð¾ÑÑ‚ÑŒ не включена
26: Potência Dist Não Ativada
28: Potência dist não activada
31: 机车未移动
33: Motor no se está moviendo
34: Локомотив неподвижен
35: Auto Não se Movendo
37: A não se move
40: 机车状况å…许自动åœæœº
41: Conditions auto\npermettent arrêt auto
43: УÑтановки локомотива\nПредуÑматривают Ð     °Ð²Ñ‚оматичеÑкую оÑтановку
44: Condições da moto\nPermitem Auto Parada

      



Is that good enough? If not, edit your question to display a more verbose sample of input, including cases for which the above doesn't work.

+3


source







All Articles