You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Crawdaddy (JIRA)" <ji...@apache.org> on 2016/01/14 20:51:40 UTC

[jira] [Commented] (SPARK-10809) Single-document topicDistributions method for LocalLDAModel

    [ https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098737#comment-15098737 ] 

Crawdaddy commented on SPARK-10809:
-----------------------------------

With a 100K document / 200K feature model with K = 250, even this single-document topicDistributions method takes 40s (!) on my 12-core X5680 Dell.  
This is with Spark 1.6 compiled with netlib and the native BLAS library (OpenBLAS compiled to X56xx architecture).

The killer is in the first line:
{code:title=LDAModel.scala : topicDistribution|borderStyle=solid}
val expElogbeta = exp(LDAUtils.dirichletExpectation(topicsMatrix.toBreeze.toDenseMatrix.t).t)
{code}

I don't see a reason expElogbeta can't be pre-computed outside the method, since it has nothing to do with the input Vector.  I made a little method to do that:

{code}
def getExpElogbeta(): BDM[Double] = {
    exp(LDAUtils.dirichletExpectation(topicsMatrix.toBreeze.toDenseMatrix.t).t)
}
{code}

then modified topicDistribution to take it in as a method parameter:
{code}
def topicDistribution(document: Vector, expElogbeta : BDM[Double]): Vector = {...}
{code}

Now my predictions go from 40s to 150ms.  That's more like it (though I hope I can make it even faster - that's still slow in my world).

I'm new to Scala/Spark/MLLib so I didn't include a patch, but maybe [~yuhaoyan] can review and suggest the most versatile implementation of this idea?


> Single-document topicDistributions method for LocalLDAModel
> -----------------------------------------------------------
>
>                 Key: SPARK-10809
>                 URL: https://issues.apache.org/jira/browse/SPARK-10809
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: yuhao yang
>            Priority: Minor
>             Fix For: 2.0.0
>
>
> We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations.  Currently, the user must use an RDD of documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org