You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/06/22 22:24:01 UTC
[jira] [Updated] (SPARK-5560) LDA EM should scale to more
iterations
[ https://issues.apache.org/jira/browse/SPARK-5560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-5560:
-------------------------------------
Remaining Estimate: 336h
Original Estimate: 336h
> LDA EM should scale to more iterations
> --------------------------------------
>
> Key: SPARK-5560
> URL: https://issues.apache.org/jira/browse/SPARK-5560
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 1.3.0
> Reporter: Joseph K. Bradley
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (LDA) sometimes fails to run for many iterations on large datasets, even when it is able to run for a few iterations. It should be able to run for as many iterations as the user likes, with proper persistence and checkpointing.
> Here is an example from a test on 16 workers (EC2 r3.2xlarge) on a big Wikipedia dataset:
> * 100 topics
> * Training set size: 4072243 documents
> * Vocabulary size: 9869422 terms
> * Training set size: 1041734290 tokens
> It runs for about 10-15 iterations before failing, even when using a variety of checkpointInterval values and longer timeout settings (up to 5 minutes). The failure varies from disconnections from workers/driver to workers running out of disk space. I would not expect workers to run out of memory or disk space based on rough calculations. There was some job imbalance, but not a significant amount.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org