How is this useful?
In the context of machine-based NLP, this makes your training data denser. This reduces the size of the dictionary (the number of words used in the corpus) by a factor of two or three (even more for languages with many exceptions, such as French, where a single pivot can generate dozens of words in the case of verbs, for example).
Having the same body but smaller inlet dimensions, ML will perform better. The reminder should really be better.
The downside is that in some cases the actual word (as opposed to its stem) matters, then your system won't be able to use it. Thus, you may lose some precision.
source to share
When do we begin or lemmitise words?
Stemming is a useful "normalization" technique for words. Consider as an example a search in a corpus of documents. More specifically, we could prepare a bunch of documents to be searched in some sort of search index. When we create a search index, we take similar terms and bind them to the root word so that searches on other forms of the word match our document.
Consider for e the following terms
- indexers
- indexing
- indexed
Let's say we convert each one to a term index
in our search index. Whenever we come across one of them, we will use the root form "index" instead of the word contained in the document.
Likewise, we perform the same step before launching a search query eg database indexing
.
The query will be converted to database index
, matching all documents that have some form of "index" in them, as an increase in the relevance of the search results.
In full text search, storing stems is useful when doing phrase searches where we might spell out a grammatically correct phrase. Something like the exact phrase "Doug likes indexing databases"
. In this context, we need full "indexing" in full text search.
source to share
Stemming is very useful for a variety of tasks. For example, if you are doing document similarity, it is much better to normalize the data. Remove genitive case, stop words, follow lowercase letters, split punctuation and split. Another suggestion is to sort the words. It's not that bad with bigrams, but it can seem strange with much larger terms.
Stack Exchange's
stack exchange
STACK EXCHANGE
Exchange, Stack
Stack Exchange (WEB)
StAcK Exchanges
All of them must be normalized for "stack exchange" for further computation.
source to share