You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Crawdaddy (JIRA)" <ji...@apache.org> on 2016/01/14 20:51:40 UTC
[jira] [Commented] (SPARK-10809) Single-document topicDistributions
method for LocalLDAModel
[ https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098737#comment-15098737 ]
Crawdaddy commented on SPARK-10809:
-----------------------------------
With a 100K document / 200K feature model with K = 250, even this single-document topicDistributions method takes 40s (!) on my 12-core X5680 Dell.
This is with Spark 1.6 compiled with netlib and the native BLAS library (OpenBLAS compiled to X56xx architecture).
The killer is in the first line:
{code:title=LDAModel.scala : topicDistribution|borderStyle=solid}
val expElogbeta = exp(LDAUtils.dirichletExpectation(topicsMatrix.toBreeze.toDenseMatrix.t).t)
{code}
I don't see a reason expElogbeta can't be pre-computed outside the method, since it has nothing to do with the input Vector. I made a little method to do that:
{code}
def getExpElogbeta(): BDM[Double] = {
exp(LDAUtils.dirichletExpectation(topicsMatrix.toBreeze.toDenseMatrix.t).t)
}
{code}
then modified topicDistribution to take it in as a method parameter:
{code}
def topicDistribution(document: Vector, expElogbeta : BDM[Double]): Vector = {...}
{code}
Now my predictions go from 40s to 150ms. That's more like it (though I hope I can make it even faster - that's still slow in my world).
I'm new to Scala/Spark/MLLib so I didn't include a patch, but maybe [~yuhaoyan] can review and suggest the most versatile implementation of this idea?
> Single-document topicDistributions method for LocalLDAModel
> -----------------------------------------------------------
>
> Key: SPARK-10809
> URL: https://issues.apache.org/jira/browse/SPARK-10809
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Joseph K. Bradley
> Assignee: yuhao yang
> Priority: Minor
> Fix For: 2.0.0
>
>
> We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org