The problem in understanding the LDA theme model in MLlib

I have some trouble understanding the output of the LDA model model in Spark Mlib.

As far as I understand, we will get the following result:

 Topic 1: term1, term2, term....
 Topic 2: term1, term2, term3...
 ...
 Topic n: term1, ........

 Doc1 : Topic1, Topic2,...
 Doc2 : Topic1, Topic2,...
 Doc3 : Topic1, Topic2,...
 ...
 Docn :Topic1, Topic2,...

      

I am applying LDA to the data of a Spark Mllib sample that looks like this:

1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0

      

Subsequently, I get the following results:

topics: org.apache.spark.mllib.linalg.Matrix = 

10.33743440804936   9.104197117225599   6.5583684747250395  
6.342536927434482   12.486281081997593  10.171181990567925  
2.1728012328444692  2.1939589470020042  7.633239820153526   
17.858082227094904  9.405347532724434   12.736570240180663  
13.226180094790433  3.9570395921153536  7.816780313094214   
6.155778858763581   10.224730593556806  5.619490547679611   
7.834725138351118   15.52628918346391   7.63898567818497    
4.419396221560405   3.072221927676895   2.5083818507627     
1.4984991123084432  3.5227422247618927  2.978758662929664   
5.696963722524612   7.254625667071781   11.048410610403607  
11.080658179168758  10.11489350657456   11.804448314256682  

      

Each column is a topic distribution term. There are 3 topics in total, and each topic is a distribution of 11 dictionaries.

I think there are 12 documents, each with 11 dictionaries. My problem is that

  • How can I find the distribution of topics for each document?
  • Why is each topic distributed over 11 vocabularies, while there are only 10 different vocabularies in the data (0-9)?
  • Why is the sum of each column not 100 (which means 100% according to my understanding)?
+3


source to share


2 answers


You can get a distribution of topics for each document by calling DistributedLDAModel.topicDistributions()

or DistributedLDAModel.javaTopicDistributions()

at Iskra 1.4. This will only work if your model optimizer is set to EMLDAOptimizer

(default).

Here's an example here and here's the documentation .

In Java, it looks something like this:



LDAModel ldaModel = lda.setK(k.intValue()).run(corpus);
JavaPairRDD<Long,Vector> topic_dist_over_docs = ((DistributedLDAModel) ldaModel).javaTopicDistributions();

      

Regarding the second question:

The LDA model returns the probability distribution over each word in the dictionary for each topic. So you have three topics (three columns), each with 11 lines (one for each word in the vocab) because the size of the vocabulary is 11.

+3


source


Why is the sum of each column not 100 (I mean 100% as per my understanding)

  • Use the describeTopics method to get the distributions of a topic by vocabs.

  • The sum of the probabilities of each vocabulary can be 1.0 (almost, but cannot be exact 1.0)



Examples of codes in java:

    Tuple2<int[], double[]>[] topicDesces = ldaModel.describeTopics();
    int topicCount = topicDesces.length;

    for( int t=0; t<topicCount; t++ ){

        Tuple2<int[], double[]> topic = topicDesces[t];
        System.out.print("Topic " + t + ":");

        int[] indices = topic._1();
        double[] values = topic._2();
        double sum = 0.0d;
        int wordCount = indices.length;

        for( int w=0; w<wordCount; w++ ){

            double prob = values[w];
            System.out.format("\t%d:%f", indices[w] , prob);
            sum += prob;
        }
        System.out.println( "(" + sum + ")");
    }

      

+1


source







All Articles