You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/09/24 20:21:05 UTC

[jira] [Closed] (SPARK-10791) Optimize MLlib LDA topic distribution query performance

     [ https://issues.apache.org/jira/browse/SPARK-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph K. Bradley closed SPARK-10791.
-------------------------------------
    Resolution: Done

> Optimize MLlib LDA topic distribution query performance
> -------------------------------------------------------
>
>                 Key: SPARK-10791
>                 URL: https://issues.apache.org/jira/browse/SPARK-10791
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.5.0
>         Environment: Ubuntu 13.10, Oracle Java 8
>            Reporter: Marko Asplund
>
> I've been testing MLlib LDA training with 100 topics, 105 K vocabulary size and ~3.4 M documents using EMLDAOptimizer.
> Training the model took ~2.5 hours with MLlib, whereas with Vowpal Wabbit training with the same data and on the same system set took ~5 minutes. Loading the persisted model from disk (~2 minutes), as well as querying LDA model topic distributions (~4 seconds for one document) are also quite slow operations.
> Our application is querying LDA model topic distribution (for one doc at a time) as part of end-user operation execution flow, so a ~4 second execution time is very problematic.
> The log includes the following message, which AFAIK, should mean that netlib-java is using machine optimised native implementation: "com.github.fommil.jni.JniLoader - successfully loaded /tmp/jniloader4682745056459314976netlib-native_system-linux-x86_64.so"
> My test code can be found here:
> https://github.com/marko-asplund/tech-protos/blob/08e9819a2108bf6bd4d878253c4aa32510a0a9ce/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala#L56-L57
> I also tried using the OnlineLDAOptimizer, but there wasn't a noticeable change in training performance. Model loading time was reduced to ~ 5 seconds from ~ 2 minutes (now persisted as LocalLDAModel). However, query / prediction time was unchanged.
> Unfortunately, this is the critical performance characteristic in our case.
> I did some profiling for my LDA prototype code that requests topic distributions from a model. According to Java Mission Control more than 80 % of execution time during sample interval is spent in the following methods:
> - org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
> - org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91%
> - org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50;
> 6.98%
> - java.lang.Double.valueOf(double); count: 31; 4.33%
> Is there any way of using the API more optimally?
> Are there any opportunities for optimising the "topicDistributions" code
> path in MLlib?
> My query test code looks like this essentially:
> // executed once
> val model = LocalLDAModel.load(ctx, ModelFileName)
> // executed four times
> val samples = Transformers.toSparseVectors(vocabularySize,
> ctx.parallelize(Seq(input))) // fast
> model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this
> seems to take about 4 seconds to execute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org