Extracting a function from one word

Usually one needs to get a function from text using the word bag of words approach by counting words and calculating various measures, such as tf-idf values, for example: How to include words as numerical features in a classification

But my problem is different, I want to extract a vector of a function from one word. I want to know, for example, that potatoes and fries are next to each other in vector space, since they are both made from potatoes. I want to know that milk and cream are also close, hot and warm, stone and hard, etc.

What is this problem called? Can I find out the similarities and features of words simply by looking at a large number of documents?

I will not be doing the implementation in English, so I cannot use databases.

+3


source to share


3 answers


hmm, the extraction of a function (like tf-idf) in text data is based on statistics. On the other hand, you are looking for meaning (semantics). So a method like tf-idef won't work for you.

There are 3 basic levels in NLP:

  • morphological analysis
  • parses
  • semantic analysis


(a larger number is a big problem :)). Morphology is known by most languages. Parsing is a big problem (we are talking about things like a verb, a noun in a certain sentence, ...). Semantic analysis has most of the problems because it deals with meaning that is quite difficult to represent on machines, has many exceptions, and is language specific.

As far as I understand, you want to know some relationship between words, this can be done using so called dependency tree banks , or just treebank ): http://en.wikipedia.org/wiki/Treebank . It is a database / graph of sentences where a word can be thought of as a node and a relation as an arc. There is a good tree bank for Czech and there will be some for English as well, but for many "less covered" languages ​​it may be difficult to find one ...

+3


source


user1506145,

Here's a simple idea that I've used in the past. Collect a large number of short documents such as Wikipedia articles. Count the words for each document. For the i-th document and j-th word let

I = number of documents,

J = number of words,



x_ij = the number of times the j-th word appears in the i-th document, and

y_ij = ln (1+ x_ij).

Let [U, D, V] = svd (Y) be a singular value decomposition in Y. Thus, Y = U * D * transposes (V)), U is IxI, D is the diagonal of IxJ, and V is JxJ,

You can use (V_1j, V_2j, V_3j, V_4j) as a function vector in R ^ 4 for the jth word.

+1


source


I'm surprised the previous answers didn't mention word nesting. The word embedding algorithm can create word vectors for each word in a given dataset. These algorithms can carry word vectors out of context. For example, considering the context of the following sentences, we can say that "smart" and "smart" are somehow related. Because the context is almost the same.

He is a clever guy He is a smart guy

To do this, you can build a match matrix. However, it is too ineffective. A known technique developed for this purpose is called Word2Vec. It can be studied from the following documents.
https://arxiv.org/pdf/1411.2738.pdf
https://arxiv.org/pdf/1402.3722.pdf

I use it for Swedish. It is quite effective in detecting similar words and is completely uncontrolled.

The package can be found in gensim and tensorflow.

0


source







All Articles