You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Crawdaddy (JIRA)" <ji...@apache.org> on 2016/01/14 21:39:39 UTC

[jira] [Comment Edited] (SPARK-10809) Single-document topicDistributions method for LocalLDAModel

    [ https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098737#comment-15098737 ] 

Crawdaddy edited comment on SPARK-10809 at 1/14/16 8:39 PM:
------------------------------------------------------------

With a 100K document / 200K feature model with K = 250, even this single-document topicDistributions method takes 40s (!) on my 12-core X5680 Dell.  
This is with Spark 1.6 compiled with netlib and the native BLAS library (OpenBLAS compiled to X56xx architecture).

The killer is in the first line:
{code:title=LDAModel.scala : topicDistribution|borderStyle=solid}
val expElogbeta = exp(LDAUtils.dirichletExpectation(topicsMatrix.toBreeze.toDenseMatrix.t).t)
{code}

I don't see a reason expElogbeta can't be pre-computed outside the method, since it has nothing to do with the input Vector.  I made a little method to do that:

{code}
def getExpElogbeta(): BDM[Double] = {
    exp(LDAUtils.dirichletExpectation(topicsMatrix.toBreeze.toDenseMatrix.t).t)
}
{code}

then modified topicDistribution to take it in as a method parameter:
{code}
def topicDistribution(document: Vector, expElogbeta : BDM[Double]): Vector = {...}
{code}

Now my predictions go from 40s to 150ms.  That's more like it (though I hope I can make it even faster - that's still slow in my world).

I'm new to Scala/Spark/MLLib so I didn't include a patch, but maybe [~yuhaoyan] can review and suggest the most versatile implementation of this idea?  E.g. expElogbeta or dense matrix cached as an instance variable instead, so other methods like describeTopics can take advantage of it.



was (Author: crawdaddy78):
With a 100K document / 200K feature model with K = 250, even this single-document topicDistributions method takes 40s (!) on my 12-core X5680 Dell.  
This is with Spark 1.6 compiled with netlib and the native BLAS library (OpenBLAS compiled to X56xx architecture).

The killer is in the first line:
{code:title=LDAModel.scala : topicDistribution|borderStyle=solid}
val expElogbeta = exp(LDAUtils.dirichletExpectation(topicsMatrix.toBreeze.toDenseMatrix.t).t)
{code}

I don't see a reason expElogbeta can't be pre-computed outside the method, since it has nothing to do with the input Vector.  I made a little method to do that:

{code}
def getExpElogbeta(): BDM[Double] = {
    exp(LDAUtils.dirichletExpectation(topicsMatrix.toBreeze.toDenseMatrix.t).t)
}
{code}

then modified topicDistribution to take it in as a method parameter:
{code}
def topicDistribution(document: Vector, expElogbeta : BDM[Double]): Vector = {...}
{code}

Now my predictions go from 40s to 150ms.  That's more like it (though I hope I can make it even faster - that's still slow in my world).

I'm new to Scala/Spark/MLLib so I didn't include a patch, but maybe [~yuhaoyan] can review and suggest the most versatile implementation of this idea?


> Single-document topicDistributions method for LocalLDAModel
> -----------------------------------------------------------
>
>                 Key: SPARK-10809
>                 URL: https://issues.apache.org/jira/browse/SPARK-10809
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: yuhao yang
>            Priority: Minor
>             Fix For: 2.0.0
>
>
> We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations.  Currently, the user must use an RDD of documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org