How to use gensim for lda in news articles?

I am trying to get a list of topics from a large corpus of news articles, I am planning on using gensim to extract the topic distribution for each document using LDA. I want to know the format of the processed articles required for the gensim implementation of lda and how to convert the original articles to that format. I saw this link about using lda on the wikipedia dump, but I found that the corpus is in a processed state, the format of which was not mentioned anywhere

+3


source to share


2 answers


There is an offline learning step and an online function creation step.

Offline learning

Suppose you have a large corpus like Wikipedia or a bunch of news loaded.

For each article / document:

  • You get the original text
  • You lemmit it. Gensim has utils.lemmatize
  • You are creating a dictionary
  • You create a word representation package


Then you train the TF-IDF model and convert the whole body to TF-IDF space. Finally, you train your LDA model on a TF-IDF case.

the Internet

With an incoming news article, you do much the same:

  • Lemmatize it
  • Create a package of words representing the dictionary.
  • Convert it to TF-IDF space using TF-IDF model
  • Convert it to LDA space.
+3


source


I don't know if I got the problem right, but gensim supports multiple corpuses. You can find a list of them here .



If you want to process natural language, you must wrap the text first. You can follow the step by step tutorial on the gensim website here . This was explained quite well.

+4


source







All Articles