How to use gensim for lda in news articles?
I am trying to get a list of topics from a large corpus of news articles, I am planning on using gensim to extract the topic distribution for each document using LDA. I want to know the format of the processed articles required for the gensim implementation of lda and how to convert the original articles to that format. I saw this link about using lda on the wikipedia dump, but I found that the corpus is in a processed state, the format of which was not mentioned anywhere
source to share
There is an offline learning step and an online function creation step.
Offline learning
Suppose you have a large corpus like Wikipedia or a bunch of news loaded.
For each article / document:
- You get the original text
- You lemmit it. Gensim has utils.lemmatize
- You are creating a dictionary
- You create a word representation package
Then you train the TF-IDF model and convert the whole body to TF-IDF space. Finally, you train your LDA model on a TF-IDF case.
the Internet
With an incoming news article, you do much the same:
- Lemmatize it
- Create a package of words representing the dictionary.
- Convert it to TF-IDF space using TF-IDF model
- Convert it to LDA space.
source to share
I don't know if I got the problem right, but gensim supports multiple corpuses. You can find a list of them here .
If you want to process natural language, you must wrap the text first. You can follow the step by step tutorial on the gensim website here . This was explained quite well.
source to share