How do I remove all diacritics from a file?

Question

How do I remove all diacritics from a file?

I have a file that contains many diacritical vowels. I need to make these replacements:

Replace ā, á, ǎ and à with a.
Replace ē, é, ě and è with e.
Replace ī, í, ǐ and ì with i.
Replace ō, ó, ǒ and ò with o.
Replace ū, ú, ǔ and ù with u.
Replace ǖ, ǘ, ǚ, and ǜ with ü.
Replace A, Á, Ǎ, and A with A.
Replace Ē, É, Ě and È with E.
Replace Ī, Í, Ǐ and Ì with I.
Replace Ō, Ó, Ǒ and Ò with O.
Replace Ū, Ú, Ǔ and Ù with U.
Replace Ǖ, Ǘ, Ǚ and Ǜ with Ü.

I know that I can replace them one at a time with this:

sed -i 's/ā/a/g' ./file.txt

Is there a better way to replace all of these?

+25

bash replace sed

Village Apr 18 12 at 10:17

source to share

8 answers

This might work for you:

sed -i 'y/āáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÜÜÜÜ/' file

+8

potong Apr 18 '12 at 13:30

source to share

I like iconv

it as it handles all variations of accents:

cat non-ascii.txt | iconv -f utf8 -t ascii//TRANSLIT//IGNORE > ascii.txt

+6

Fedir RYKHTIK 02 Sep At 15:56

source to share

This is done using the tr (1) command. For example:

tr 'āáǎàēéěèīíǐì...' 'aaaaeeeeiii...' <infile >outfile

You may need to check / change the environment variable LANG

according to the character set used.

+2

ktf Apr 18 12 at 10:27

source to share

You can use something like this:

  sed -e 's/[àâ]/a/g;s/[ọõ]/o/g;s/[í,ì]/i/g;s/[ê,ệ]/e/g'

just add more characters to [..] for your needs.

+2

hungnv Apr 18 12 at 10:36

source to share

You can use man iso_8859_1

(or your char set) or od -bc

to identify the octal representation of the diacritic. Then use gawk

to replace.

{ gsub(/\344/,"a"; print $0 }

This replaces ä

with a

.

+1

Rich traube 09 jul. 16 at 21:57

source to share

It might not work. Just because your language needs to be installed!

use locale to set LC_ALL like:

export LC_ALL=en_US.iso88591

Note that a complete list of locales is available via:

locale -a

0

Bruno 02 dec. 13 at 16:23

source to share

If you, like me, only need to replace accents in some special places in your file, you can do so using this regex

echo '{"doNotReplaceKey":"bábögêjírù","replaceValueKey":"bábögêjírù","anotherNotReplaceKey":"bábögêjírù"}' \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[áâàãä]/replaceValueKey":"\1a/g;ta' \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[éêèë]/replaceValueKey":"\1e/g;ta'  \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[íîìï]/replaceValueKey":"\1i/g;ta'  \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[óôòõö]/replaceValueKey":"\1o/g;ta' \
    | sed -e ':a;s/replaceValueKey":"\([a-zA-Z0-9 -_]*\)[úûùü]/replaceValueKey":"\1u/g;ta'

Output

{"doNotReplaceKey":"bábögêjírù","replaceValueKey":"babogejiru","anotherNotReplaceKey":"bábögêjírù"}

0

Thiago mata June 29. 16:05

source to share

Kent · Accepted Answer · 2012-04-18T10:35:36+0000

If you check the man page of the tool iconv

:

// TRANSLIT
When the string "// TRANSLIT" is added to the -to code, transliteration is activated. This means that when a character cannot be represented in the target character set, it can be approximated by one or more similar characters.

so that we can do:

kent$  cat test1
    Replace ā, á, ǎ, and à with a.
    Replace ē, é, ě, and è with e.
    Replace ī, í, ǐ, and ì with i.
    Replace ō, ó, ǒ, and ò with o.
    Replace ū, ú, ǔ, and ù with u.
    Replace ǖ, ǘ, ǚ, and ǜ with ü.
    Replace Ā, Á, Ǎ, and À with A.
    Replace Ē, É, Ě, and È with E.
    Replace Ī, Í, Ǐ, and Ì with I.
    Replace Ō, Ó, Ǒ, and Ò with O.
    Replace Ū, Ú, Ǔ, and Ù with U.
    Replace Ǖ, Ǘ, Ǚ, and Ǜ with Ü.


kent$  iconv -f utf8 -t ascii//TRANSLIT test1
    Replace a, a, a, and a with a.
    Replace e, e, e, and e with e.
    Replace i, i, i, and i with i.
    Replace o, o, o, and o with o.
    Replace u, u, u, and u with u.
    Replace u, u, u, and u with u.
    Replace A, A, A, and A with A.
    Replace E, E, E, and E with E.
    Replace I, I, I, and I with I.
    Replace O, O, O, and O with O.
    Replace U, U, U, and U with U.
    Replace U, U, U, and U with U.

How do I remove all diacritics from a file?

More articles: