How is this useful?

Simple question: when do we start or lemmitise words? Is the output useful for all nlp processes, or are there applications where using the full form of words can lead to more precision or precision?

+3


source to share


3 answers


In the context of machine-based NLP, this makes your training data denser. This reduces the size of the dictionary (the number of words used in the corpus) by a factor of two or three (even more for languages ​​with many exceptions, such as French, where a single pivot can generate dozens of words in the case of verbs, for example).

Having the same body but smaller inlet dimensions, ML will perform better. The reminder should really be better.



The downside is that in some cases the actual word (as opposed to its stem) matters, then your system won't be able to use it. Thus, you may lose some precision.

+4


source


When do we begin or lemmitise words?

Stemming is a useful "normalization" technique for words. Consider as an example a search in a corpus of documents. More specifically, we could prepare a bunch of documents to be searched in some sort of search index. When we create a search index, we take similar terms and bind them to the root word so that searches on other forms of the word match our document.

Consider for e the following terms

  • indexers
  • indexing
  • indexed


Let's say we convert each one to a term index

in our search index. Whenever we come across one of them, we will use the root form "index" instead of the word contained in the document.

Likewise, we perform the same step before launching a search query eg database indexing

.

The query will be converted to database index

, matching all documents that have some form of "index" in them, as an increase in the relevance of the search results.

In full text search, storing stems is useful when doing phrase searches where we might spell out a grammatically correct phrase. Something like the exact phrase "Doug likes indexing databases"

. In this context, we need full "indexing" in full text search.

+1


source


Stemming is very useful for a variety of tasks. For example, if you are doing document similarity, it is much better to normalize the data. Remove genitive case, stop words, follow lowercase letters, split punctuation and split. Another suggestion is to sort the words. It's not that bad with bigrams, but it can seem strange with much larger terms.

Stack Exchange's
stack exchange
STACK EXCHANGE
Exchange, Stack
Stack Exchange (WEB)
StAcK Exchanges

      

All of them must be normalized for "stack exchange" for further computation.

+1


source







All Articles