Lemmatization of Italian sentences for frequency counting

I would like to lemmatize some Italian text in order to do some word frequency counting and further research on the output of this lemmatized content.

I prefer lemmatizing than creation because I could extract the meaning of a word from the context of a sentence (for example, distinguish between a verb and a noun) and get words that exist in that language, rather than the roots of those words that usually don't make sense.

I came to know this library called pattern

( pip2 install pattern

), which needs to complement nltk

in order to lemmatize the Italian language , however I am not sure if the approach below is correct because each word is lemmatized on its own and not in the context of a sentence.

I should probably give pattern

responsibility for tokenizing a sentence (and annotating each word with metadata regarding verbs / nouns / adjectives, etc.) and then extracting the lemmatized word, but I can't do that, and I'm not even sure if is it possible at the moment?

Also: in Italian some articles are presented with an apostrophe, so, for example, "l'appartamento" (in English "apartment") is actually two words: "lo" and "appartamento". Right now I cannot find a way to separate these 2 words with the combination nltk

and pattern

therefore I cannot calculate the word frequency correctly.

import nltk
import string
import pattern

# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()

# the following function is just to get the lemma
# out of the original input word (but right now
# it may be loosing the context about the sentence
# from where the word is coming from i.e.
# the same word could either be a noun/verb/adjective
# according to the context)
def lemmatize_word(input_word):
    in_word = input_word#.decode('utf-8')
    # print('Something: {}'.format(in_word))
    word_it = pattern.it.parse(
        in_word, 
        tokenize=False,  
        tag=False,  
        chunk=False,  
        lemmata=True 
    )
    # print("Input: {} Output: {}".format(in_word, word_it))
    the_lemmatized_word = word_it.split()[0][0][4]
    # print("Returning: {}".format(the_lemmatized_word))
    return the_lemmatized_word

it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."

# 1st tokenize the sentence(s)
word_tokenized_list = nltk.tokenize.word_tokenize(it_string)
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list))

# 2nd remove punctuation and everything lower case
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation]
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct))

# 3rd remove stop words (for the Italian language)
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw))

# 4.1 lemmatize the words
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw]
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized))

# 4.2 snowball stemmer for Italian
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw]
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem))

# difference between stemmer and lemmatizer
print(
    "For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)"
    .format(
        word_tokenized_no_punct_no_sw[1],
        word_tokenized_no_punct_no_sw[6],
        word_tokenize_list_no_punct_lc_no_stowords_stem[1],
        word_tokenize_list_no_punct_lc_no_stowords_stem[6],
        word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1],
        word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1]
    )
)

      

Gives this output:

1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.']
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure']
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure']
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura']
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur']
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2)

      

  • How can I effectively lemmatize some sentences pattern

    using my tokenizer? (assuming lemmas are recognized as nouns / verbs / adjectives, etc.).
  • Is there a python alternative pattern

    for italian lemmatization with nltk

    ?
  • How to split articles anchored to the next word using apostrophes?
+3


source to share


1 answer


I will try to answer your question knowing that I don't know much about Italian!

1) As far as I know, the main responsibility for removing the apostrophe is the tokenizer and it seems the thermal tokenizer nltk

seems to have failed.

3) A simple thing you can do about this is a method call replace

(although you probably have to use a package re

for a more complex pattern), for example:

word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]

      

This gives:

['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', 'all', 'ippodromo', 'stasera', 'mangio', 'pizza', 'verdure']

      

2) An alternative to the template would be treetagger

if it is not the easiest install of all (you need a python package and a tool , however after this part it works on Windows and Linux).



Simple example with above example:

import treetaggerwrapper 
from pprint import pprint

it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
tagger = treetaggerwrapper.TreeTagger(TAGLANG="it")
tags = tagger.tag_text(it_string)
pprint(treetaggerwrapper.make_tags(tags))

      

pprint

gives:

[Tag(word=u'Ieri', pos=u'ADV', lemma=u'ieri'),
 Tag(word=u'sono', pos=u'VER:pres', lemma=u'essere'),
 Tag(word=u'andato', pos=u'VER:pper', lemma=u'andare'),
 Tag(word=u'in', pos=u'PRE', lemma=u'in'),
 Tag(word=u'due', pos=u'ADJ', lemma=u'due'),
 Tag(word=u'supermercati', pos=u'NOM', lemma=u'supermercato'),
 Tag(word=u'.', pos=u'SENT', lemma=u'.'),
 Tag(word=u'Oggi', pos=u'ADV', lemma=u'oggi'),
 Tag(word=u'volevo', pos=u'VER:impf', lemma=u'volere'),
 Tag(word=u'andare', pos=u'VER:infi', lemma=u'andare'),
 Tag(word=u"all'", pos=u'PRE:det', lemma=u'al'),
 Tag(word=u'ippodromo', pos=u'NOM', lemma=u'ippodromo'),
 Tag(word=u'.', pos=u'SENT', lemma=u'.'),
 Tag(word=u'Stasera', pos=u'ADV', lemma=u'stasera'),
 Tag(word=u'mangio', pos=u'VER:pres', lemma=u'mangiare'),
 Tag(word=u'la', pos=u'DET:def', lemma=u'il'),
 Tag(word=u'pizza', pos=u'NOM', lemma=u'pizza'),
 Tag(word=u'con', pos=u'PRE', lemma=u'con'),
 Tag(word=u'le', pos=u'DET:def', lemma=u'il'),
 Tag(word=u'verdure', pos=u'NOM', lemma=u'verdura'),
 Tag(word=u'.', pos=u'SENT', lemma=u'.')]

      

It also symbolized all'ippodromo

to very well al

and ippodromo

(hopefully correct) under the hood before lemmatization. Now we just need to apply stop word and punctuation removal and everything will be fine.

The doc to install the TreeTaggerWrapper library for python

+1


source







All Articles