You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Manish Tripathi <tr...@gmail.com> on 2017/02/16 18:36:30 UTC

Latent Dirichlet Allocation in Spark

Hi

I am trying to do topic modeling in Spark using Spark's LDA package. Using
Spark 2.0.2 and pyspark API.

I ran the code as below:

*from pyspark.ml.clustering import LDA*
*lda = LDA(featuresCol="tf_features",k=10, seed=1, optimizer="online")*
*ldaModel=lda.fit(tf_df)*

*lda_df=ldaModel.transform(tf_df)*

I went through the docs to understand the output (the form of data) Spark
generates for LDA.

I understand the ldaModel.describeTopics() method. Gives topics with list
of terms and weights.

But I am not sure I understand the method ldamodel.topicsMatrix().

It gives me this:


​

if the doc says it is the distribution of words for each topic (1184 words
as rows, 10 topics as columns and the values of these cells. But then these
values are not probabilities which is what one would expect for topic-word
distribution.

These have random values more than 1 (132.76, 3.00 and so on).

Any jdea on this?

Thanks
ᐧ