You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2015/02/01 03:58:40 UTC

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4047#issuecomment-72348800

*Update on tests*

Summary:
* On a small dataset (20 newsgroups), it seems to work fine (on my laptop).
* On a big dataset (Wikipedia dump with close to 1 billion tokens), it's been hard to get it to run for more than 10 or 20 iterations (on a 16-node EC2 cluster).

Details:

Small dataset: You can see the output here: [https://github.com/jkbradley/spark/blob/lda-tmp/20news.lda.out]. The log likelihood improves with each iteration, and iteration running times stay about the same throughout training. The topics are really nicely divided among the newsgroups. (But I did run this using 20 topics.) I used 100 iterations and the stopwords mentioned above.

Large dataset: Even with checkpointing, it has been hard to run for many iterations, mainly because of shuffle files and checkpoint files building up. I need to spend some more time running tests. Currently, the results on the Wikipedia dump do not look good; topics are pretty much all the same. It is unclear if this is because of poor convergence, a need for parameter tuning, a need for supporting sparsity as mentioned above (which might help to force topics to differentiate), or a need for better initialization (since EM can have lots of trouble with LDA's many local minima).

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org