Difference between implicit and explicit semantic analysis
I'm trying to parse the article Computing Semantic Binding Using Explicit Semantics based on Wikipedia's Analysis. ''
One component of the system it describes that I am currently dealing with is the difference between implicit and explicit semantic analysis.
I am writing a doc to encapsulate my understanding, but this is somewhat, "cobble together", from sources that I do not understand 100%, so I would like to know if I came up with is accurate, here it is:
When implementing a process like singular value decomposition (SVD) or Markov
chain Monte Carlo machines, a corpus of documents can be partitioned on the
basis of inherent characteristics and assigned to categories by applying different
weights to the features that constitute each singular data index. In this highdimensional
space it is often difficult to determine the combination of factors
leading to an outcome or result, the variables of interest are "hidden" or latent.
By defining a set of humanly intelligible categories, i.e. Wikipedia article
pages as a basis for comparison [Gabrilovich et al. 2007] have devised a system
whereby the criteria used to distinguish a datum are readily comprehensible,
from the text we note that "semantic analysis is explicit in the sense that we
manipulate manifest concepts grounded in human cognition, rather than ‘latent
concepts’ used by Latent Semantic Analysis".
With that we have now established Explicit Semantic Analysis in opposition
to Latent Semantic Analysis.
is it accurate?
Information on this topic is somewhat sparse. This question purportedly addresses a similar issue, although not entirely.
source to share
The difference between Latent Semantic Analysis and the so-called Explicit Semantic Analysis lies in the corpus used and in the sizes of the vectors that determine the meaning of the word.
Latent semantic analysis begins with document-based word vectors that capture the relationship between each word and the documents in which it appears, typically through a weighting function such as tf-idf. It then reduces the dimension of these vectors of the vector to (usually) 300 using singular value decomposition. Unlike the original measurements (which matched the documents), these 300 new measurements have no direct interpretation. Therefore they are called "hidden". LSA can then be used to classify texts by concatenating all word vectors in the text.
From the document you mentioned, I understand that explicit semantic analysis is also a document-based model: it models words in terms of the Wikipedia articles in which they appear. However, it differs from Latent Semantic Analysis in that (a) the corpus (Wikipedia) cannot be freely chosen, and (b) there is no dimensionality reduction. Again, word vectors in text can be combined to classify or otherwise interpret text.
source to share
Simple explanation:
ESA - Uses a knowledge base such as (wikipedia) to create an inverted index that maps words to content (i.e. the name of the wikipedia page on which the word occurs). Then acts on this vector representation of words, where each word is now a vector of names with 0, 1. In them
LSA - uses the principle of singular decomposition to project a word-doc matrix into a lower ranked space, such that the dot-product of the vector representation of the word-doc words that do not match each other in any document, but they are co-ccur with a similar set words (i.e. Imagine Cat and Car never occur in a document, but can occur with a person in some document D_1, and Car together with Man in another document D_2) be higher.
source to share