How to use gensim for lda in news articles?

Question

How to use gensim for lda in news articles?

I am trying to get a list of topics from a large corpus of news articles, I am planning on using gensim to extract the topic distribution for each document using LDA. I want to know the format of the processed articles required for the gensim implementation of lda and how to convert the original articles to that format. I saw this link about using lda on the wikipedia dump, but I found that the corpus is in a processed state, the format of which was not mentioned anywhere

+3

machine-learning gensim lda

Rohit 02 Apr 12 at 12:31

source to share

2 answers

I don't know if I got the problem right, but gensim supports multiple corpuses. You can find a list of them here .

If you want to process natural language, you must wrap the text first. You can follow the step by step tutorial on the gensim website here . This was explained quite well.

+4

snøreven 06 Apr 12 at 17:33

source to share

Karsten · Accepted Answer · 2012-11-21T20:22:42+0000

There is an offline learning step and an online function creation step.

Offline learning

Suppose you have a large corpus like Wikipedia or a bunch of news loaded.

For each article / document:

You get the original text
You lemmit it. Gensim has utils.lemmatize
You are creating a dictionary
You create a word representation package

Then you train the TF-IDF model and convert the whole body to TF-IDF space. Finally, you train your LDA model on a TF-IDF case.

the Internet

With an incoming news article, you do much the same:

Lemmatize it
Create a package of words representing the dictionary.
Convert it to TF-IDF space using TF-IDF model
Convert it to LDA space.

How to use gensim for lda in news articles?

More articles: