The problem in understanding the LDA theme model in MLlib
I have some trouble understanding the output of the LDA model model in Spark Mlib.
As far as I understand, we will get the following result:
Topic 1: term1, term2, term....
Topic 2: term1, term2, term3...
Topic n: term1, ........
Doc1 : Topic1, Topic2,...
Doc2 : Topic1, Topic2,...
Doc3 : Topic1, Topic2,...
Docn οΌTopic1, Topic2,...
I am applying LDA to the data of a Spark Mllib sample that looks like this:
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0
Subsequently, I get the following results:
topics: org.apache.spark.mllib.linalg.Matrix =
10.33743440804936 9.104197117225599 6.5583684747250395
6.342536927434482 12.486281081997593 10.171181990567925
2.1728012328444692 2.1939589470020042 7.633239820153526
17.858082227094904 9.405347532724434 12.736570240180663
13.226180094790433 3.9570395921153536 7.816780313094214
6.155778858763581 10.224730593556806 5.619490547679611
7.834725138351118 15.52628918346391 7.63898567818497
4.419396221560405 3.072221927676895 2.5083818507627
1.4984991123084432 3.5227422247618927 2.978758662929664
5.696963722524612 7.254625667071781 11.048410610403607
11.080658179168758 10.11489350657456 11.804448314256682
Each column is a topic distribution term. There are 3 topics in total, and each topic is a distribution of 11 dictionaries.
I think there are 12 documents, each with 11 dictionaries. My problem is that
- How can I find the distribution of topics for each document?
- Why is each topic distributed over 11 vocabularies, while there are only 10 different vocabularies in the data (0-9)?
- Why is the sum of each column not 100 (which means 100% according to my understanding)?
source to share
You can get a distribution of topics for each document by calling
at Iskra 1.4. This will only work if your model optimizer is set to EMLDAOptimizer
Here's an example here and here's the documentation .
In Java, it looks something like this:
LDAModel ldaModel = lda.setK(k.intValue()).run(corpus);
JavaPairRDD<Long,Vector> topic_dist_over_docs = ((DistributedLDAModel) ldaModel).javaTopicDistributions();
Regarding the second question:
The LDA model returns the probability distribution over each word in the dictionary for each topic. So you have three topics (three columns), each with 11 lines (one for each word in the vocab) because the size of the vocabulary is 11.
source to share
Why is the sum of each column not 100 (I mean 100% as per my understanding)
Use the describeTopics method to get the distributions of a topic by vocabs.
The sum of the probabilities of each vocabulary can be 1.0 (almost, but cannot be exact 1.0)
Examples of codes in java:
Tuple2<int[], double[]>[] topicDesces = ldaModel.describeTopics();
int topicCount = topicDesces.length;
for( int t=0; t<topicCount; t++ ){
Tuple2<int[], double[]> topic = topicDesces[t];
System.out.print("Topic " + t + ":");
int[] indices = topic._1();
double[] values = topic._2();
double sum = 0.0d;
int wordCount = indices.length;
for( int w=0; w<wordCount; w++ ){
double prob = values[w];
System.out.format("\t%d:%f", indices[w] , prob);
sum += prob;
System.out.println( "(" + sum + ")");
source to share