Is Latent Semantic Indexing (LSI) a statistical classification algorithm?

Is Latent Semantic Indexing (LSI) a statistical classification algorithm? Why or why not?

Basically, I'm trying to understand why the Wikipedia page for statistical classification doesn't mention LSI. I'm just doing this and I'm trying to figure out how all the different approaches to classifying something relate to each other.

+2


source to share


4 answers


No, they are not exactly the same. Statistical classification is designed to categorize items as cleanly as possible - to make a clean decision about whether item X is more similar to items in group A or group B. For example:



LSI is designed to show the degree of similarity or difference between items and, above all, to find items that show the degree of similarity to a particular item. While it looks like it is not exactly the same.

+5


source


LSI / LSA is ultimately a dimensionality reduction technique and is usually combined with a nearest neighbor algorithm to turn it into a classification system. This in itself is the only way to "index" data in a lower dimension using SVD.



+3


source


Have you read about LSI on Wikipedia ? He says he uses matrix factorization ( SVD ), which in turn is sometimes used in classification.

+1


source


The main difference between machine learning is "supervised" and "unsupervised" modeling.

Usually the words "statistical classification" refer to controlled models, but not always.

With supervised methods, the training set contains a ground-truth label in which you build the model for prediction. When you are evaluating a model, the goal is to predict the best guess (or probability distribution) of a true label that you will not have at the time of evaluation. There is often a performance metric and it is very clear what the right or wrong answer is.

Unsupervised classification methods attempt to group a large number of data points that may seem complex for different types into fewer β€œsimilar” categories. The data in each category should be similar in some "interesting" or "deep" way. Since there is no "fundamental truth", you cannot judge "right or wrong" but "more" versus "less" interesting or useful.

For a similar evaluation time, you can put new examples in potentially one of the clusters (clear classification) or give some kind of weighting factor that determines how similar or similar to each other the "archetype" of the cluster.

So, in a sense, supervised and unsupervised models may give something that is "prediction", a prediction of the class / cluster label, but they are inherently different.

Often, the goal of an unsupervised model is to provide smarter and more powerful compact inputs for the subsequent supervised model.

+1


source







All Articles