Detecting damaged characters in a UTF-8 encoded text file

Question

Detecting damaged characters in a UTF-8 encoded text file

I have a text file that has been edited with the wrong character encoding and therefore has some mojibak and corrupt characters on some lines when I open it with UTF-8. What scripting language will be most effective in detecting these corrupted symbols? Perl is not an option. I am basically trying to find a way to scan through a text file using a script and output the line numbers and possibly offset where the damaged character is found. How should I do it? I was thinking about using AWk, but I don't know what regex to use to find corrupted characters. If I could be directed in the right direction, that would be great.

More complete input:

I want the script to tell me the number of the damaged characters line, which will be the fifth line in the above example. Also, there are different languages in the text file. I have English Chinese, French, Spanish, Russian, Portuguese, Turkish, French_Euro, German, Dutch, Flemish, Korean, Portuguese_Moz. And I have some special characters like # and! and ***

I used this if statement to get the above output:

if($1 ~ /[^\x00-\x7F]/){
print NR ":" , $0 > "output.txt";
count++;
}

+3

scripting regex awk encoding utf-8

user2056389 09 June '15 at 17:30

source to share

1 answer

Ed morton · Accepted Answer · 2015-06-10T18:52:59+0000

This finds all characters outside the ASCII range:

$ awk '/[^\x00-\x7F]/{ print NR ":", $0 }' file
1: Interruptor EC nÃ£o estÃ¡ em DESLOCAR
4: è¾…åŠ©é©¾é©¶å®¤é—¨å…³é—
5: Porte cab. aux. fermÃ©e
7: Ð"Ð²ÐµÑ€ÑŒ Ð°Ð¿Ð¿Ð°Ñ€Ð°Ñ‚Ð½Ð¾Ð¹ ÐºÐ°Ð¼ÐµÑ€Ñ‹ Ð·Ð°ÐºÑ€Ñ‹Ñ‚Ð°
13: é«˜åŽ‹ä¿æŠ¤æ‰‹æŸ„å‘ä¸‹
14: BarriÃ¨re descendue
16: ÐžÐ³Ñ€Ð°Ð½Ð¸Ñ‡. ÐŸÐ»Ð°Ð½ÐºÐ° Ð’Ð’Ðš Ð¾Ð¿ÑƒÑ‰.
19: Barra de separaÃ§Ã£o descida
22: DPæœªå¯åŠ¨
23: Puiss. rÃ©p. non activÃ©e
25: !!! Ð’Ð½ÐµÑˆÐ½ÑÑ Ð¼Ð¾Ñ‰Ð½Ð¾ÑÑ‚ÑŒ Ð½Ðµ Ð²ÐºÐ»ÑŽÑ‡ÐµÐ½Ð°
26: PotÃªncia Dist NÃ£o Ativada
28: PotÃªncia dist nÃ£o activada
31: æœºè½¦æœªç§»åŠ¨
33: Motor no se estÃ¡ moviendo
34: Ð›Ð¾ÐºÐ¾Ð¼Ð¾Ñ‚Ð¸Ð² Ð½ÐµÐ¿Ð¾Ð´Ð²Ð¸Ð¶ÐµÐ½
35: Auto NÃ£o se Movendo
37: A nÃ£o se move
40: æœºè½¦çŠ¶å†µå…è®¸è‡ªåŠ¨åœæœº
41: Conditions auto\npermettent arrÃªt auto
43: Ð£ÑÑ‚Ð°Ð½Ð¾Ð²ÐºÐ¸ Ð»Ð¾ÐºÐ¾Ð¼Ð¾Ñ‚Ð¸Ð²Ð°\nÐŸÑ€ÐµÐ´ÑƒÑÐ¼Ð°Ñ‚Ñ€Ð¸Ð²Ð°ÑŽÑ‚ Ð     °Ð²Ñ‚Ð¾Ð¼Ð°Ñ‚Ð¸Ñ‡ÐµÑÐºÑƒÑŽ Ð¾ÑÑ‚Ð°Ð½Ð¾Ð²ÐºÑƒ
44: CondiÃ§Ãµes da moto\nPermitem Auto Parada

Is that good enough? If not, edit your question to display a more verbose sample of input, including cases for which the above doesn't work.

Detecting damaged characters in a UTF-8 encoded text file

More articles: