You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by moustachio <re...@gmail.com> on 2015/10/14 18:02:32 UTC

Get *document*-topic distribution from PySpark LDA model?

Hi! I already have a StackOverflow question on this (see  here
<https://stackoverflow.com/questions/33072449/extract-document-topic-matrix-from-pyspark-lda-model> 
), but haven't received any responses, so I thought I'd try here!

Long story short, I'm working in PySpark and have successfully generated an
LDA topic model, but can't figure out how to (or if I can) extract the topic
distributions for each document from the model. I understand the LDA
functionality is still in development, but getting document topic
distributions is arguably the principal use case here, and is not (as far as
I can tell) implemented in the Python API. I can easily get the *word*-topic
distribution, by calling model.topicsMatrix(), but this isn't what I need,
and there don't seems to be any other useful methods in the Python LDA model
class.

The only glimmer of hope came from finding the documentation for
DistributedLDAModel in the Java api, which has a topicDistributions() method
that I think is just what I need here (but I'm 100% sure if the LDAModel in
Pyspark is in fact a DistributedLDAModel under the hood...).

In any case, I am able to indirectly call this method like so, without any
overt failures:

In [127]: model.call('topicDistributions')
Out[127]: MapPartitionsRDD[3156] at mapPartitions at
PythonMLLibAPI.scala:1480

But if I actually look at the results, all I get are strings telling me that
the result is actually a Scala tuple (I think):

In [128]: model.call('topicDistributions').take(5)
Out[128]:
[{u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'}]

Maybe this is generally the right approach, but is there way to get the
actual results?

Thanks in advance for any guidance you can offer!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Get-document-topic-distribution-from-PySpark-LDA-model-tp25063.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org