Removing diacritics from Greek text automatically

I have a decompiled stardict dictionary as a tab file

κακός <tab> bad

      

where <tab>

stands for tab.

Unfortunately, the way words are defined requires that all diacritics are included in the query. So if I want to search for ζῷον, I need all the iotas and circumflexes to be correct.

Thus, I would like to convert the entire file so that the keyword removes the diacritics. So the line will become

κακος <tab> <h3>κακός</h3> <br/> bad

      

I know that I could read the file line by line in bash as described here [1]

while read line           
do           
    command           
done <file 

      

But what's the way to automate the string conversion operation? I heard about iconv

[2] but was unable to achieve the desired transformation using it. My best bet is to use a bash script.


Also, is there an automatic way to transliterate Greek eg. using Perseus method?

Perseus' way of doing it


/ edit: Maybe we could use Unicode codes? We can notice that U+1F0x

, U+1F8x

for x < 8

, etc. - all variants of the letter α. This will reduce the amount of manual work involved. I would also accept a C ++ solution.

[1] http://en.kioskea.net/faq/1757-how-to-read-a-file-line-by-line
[2] How to remove all diacritics from a file?

+3


source to share


2 answers


You can easily remove diacritics from a string using Perl:

$_=NFKD($_);s/\p{InDiacriticals}//g;

      

eg:

$ echo 'ὦὢῶὼώὠὤ ᾪ' | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g'
ωωωωωωω Ω

      



It works like this:

  • -CS

    includes UTF8 for Perl stdin / stdout
  • -MUnicode::Normalize

    loads a library for Unicode normalization
  • -e

    executes the script from the command line; -n

    automatically moves through the lines at the input; -p

    automatically prints the result.
  • NFKD()

    converts the line to one of the Unicode normalization forms; this means accents and diacritics are decomposed into separate characters, making it easier to remove them in the next step.
  • s/\p{InDiacriticals}//g

    removes all characters that Unicoded designates as diacritics

This should actually work for removing diacritics, etc. for all scripts / languages ​​that have good Unicode support, not just Greek ones.

+1


source


I am not very familiar with Ancient Greek as I am with Modern Greek (which really only uses two diacritics)

However, I went through the vowels and found out that it is combined with diacritics. This gave me the following list:

ἆἂᾶὰάἀἄ 
ἒὲέἐἔ 
ἦἢῆὴήἠἤ 
ἶἲῖὶίἰἴ 
ὂὸόὀὄ 
ὖὒῦὺύὐὔ 
ὦὢῶὼώὠὤ  

      

I saved this list as a file and passed it to this sed

cat test.txt | sed -e 's/[ἆἂᾶὰάἀἄ]/α/g;s/[ἒὲέἐἔ]/ε/g;s/[ἦἢῆὴήἠἤ]/η/g;s/[ἶἲῖὶίἰἴ]/ι/g;s/[ὂὸόὀὄ]/ο/g;s/[ὖὒῦὺύὐὔ]/υ/g;s/[ὦὢῶὼώὠὤ]/ω/g'

      



Credit to hungnv

This is a simple sed. It takes each of the parameters and replaces it with an unsigned character. The result of the above command is:

ααααααα
εεεεε
ηηηηηηη
ιιιιιιι
οοοοο
υυυυυυυ
ωωωωωωω

      


Regarding Greek transliteration: The image from your post is intended to help the user enter Greek into the site you were using using similar glyphs, not always similar sounds. These are poor transliterations. for example β is most often transliterated as v. ψ is ps. φ - ph, etc.

+2


source







All Articles