LDA with Python - input files

Question

LDA with Python - input files

I am running the lda library in Python and I am running this example. Does anyone know the X format, vocab and titles? I can't find any documentation.

import numpy as np
import lda
X = lda.datasets.load_reuters()
vocab = lda.datasets.load_reuters_vocab()
titles = lda.datasets.load_reuters_titles()

+3

python scikit-learn lda

user1011332 May 18 '15 at 11:21 PM

source to share

1 answer

user2707389 · Answer 1 · 2015-05-19T03:42:10+0000

X is a matrix where rows are headers and columns are vocab. It is a bag of textual representation of the title text.

X
Out[8]: 
array([[1, 0, 1, ..., 0, 0, 0],
       [7, 0, 2, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [1, 0, 1, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0]], dtype=int32)

In the above matrix, each row is a batch of verbal reprints of individual headings. Each column is an example of a specific word.

vocab[:5]
Out[5]: ('church', 'pope', 'years', 'people', 'mother')

So, each row i, col j in matrix X sets the frequency of a particular word in the i-th heading.

titles[:1]
Out[11]: ('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20',)

Title U: Prince Charles ... mentions the word church once, dad 0 times, years once, etc.

In [13]: type(titles)
Out[13]: tuple

In [14]: type(vocab)
Out[14]: tuple

In [15]: type(X)
Out[15]: numpy.ndarray

LDA with Python - input files

More articles: