Removing diacritics from Greek text automatically
I have a decompiled stardict dictionary as a tab file
κακός <tab> bad
where <tab>
stands for tab.
Unfortunately, the way words are defined requires that all diacritics are included in the query. So if I want to search for ζῷον, I need all the iotas and circumflexes to be correct.
Thus, I would like to convert the entire file so that the keyword removes the diacritics. So the line will become
κακος <tab> <h3>κακός</h3> <br/> bad
I know that I could read the file line by line in bash as described here [1]
while read line
do
command
done <file
But what's the way to automate the string conversion operation? I heard about iconv
[2] but was unable to achieve the desired transformation using it. My best bet is to use a bash script.
Also, is there an automatic way to transliterate Greek eg. using Perseus method?
/ edit: Maybe we could use Unicode codes? We can notice that U+1F0x
, U+1F8x
for x < 8
, etc. - all variants of the letter α. This will reduce the amount of manual work involved. I would also accept a C ++ solution.
[1] http://en.kioskea.net/faq/1757-how-to-read-a-file-line-by-line
[2] How to remove all diacritics from a file?
source to share
You can easily remove diacritics from a string using Perl:
$_=NFKD($_);s/\p{InDiacriticals}//g;
eg:
$ echo 'ὦὢῶὼώὠὤ ᾪ' | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g'
ωωωωωωω Ω
It works like this:
-
-CS
includes UTF8 for Perl stdin / stdout -
-MUnicode::Normalize
loads a library for Unicode normalization -
-e
executes the script from the command line;-n
automatically moves through the lines at the input;-p
automatically prints the result. -
NFKD()
converts the line to one of the Unicode normalization forms; this means accents and diacritics are decomposed into separate characters, making it easier to remove them in the next step. -
s/\p{InDiacriticals}//g
removes all characters that Unicoded designates as diacritics
This should actually work for removing diacritics, etc. for all scripts / languages that have good Unicode support, not just Greek ones.
source to share
I am not very familiar with Ancient Greek as I am with Modern Greek (which really only uses two diacritics)
However, I went through the vowels and found out that it is combined with diacritics. This gave me the following list:
ἆἂᾶὰάἀἄ ἒὲέἐἔ ἦἢῆὴήἠἤ ἶἲῖὶίἰἴ ὂὸόὀὄ ὖὒῦὺύὐὔ ὦὢῶὼώὠὤ
I saved this list as a file and passed it to this sed
cat test.txt | sed -e 's/[ἆἂᾶὰάἀἄ]/α/g;s/[ἒὲέἐἔ]/ε/g;s/[ἦἢῆὴήἠἤ]/η/g;s/[ἶἲῖὶίἰἴ]/ι/g;s/[ὂὸόὀὄ]/ο/g;s/[ὖὒῦὺύὐὔ]/υ/g;s/[ὦὢῶὼώὠὤ]/ω/g'
This is a simple sed. It takes each of the parameters and replaces it with an unsigned character. The result of the above command is:
ααααααα εεεεε ηηηηηηη ιιιιιιι οοοοο υυυυυυυ ωωωωωωω
Regarding Greek transliteration: The image from your post is intended to help the user enter Greek into the site you were using using similar glyphs, not always similar sounds. These are poor transliterations. for example β is most often transliterated as v. ψ is ps. φ - ph, etc.
source to share