You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Frank Zhang <da...@yahoo.com.INVALID> on 2016/09/20 04:19:53 UTC

LDA and Maximum Iterations

Hi all,
   I have a question about parameter setting for LDA model. When I tried to set a large number like 500 for  setMaxIterations, the program always fails.  There is a very straightforward LDA tutorial using an example data set in the mllib package:http://stackoverflow.com/questions/36631991/latent-dirichlet-allocation-lda-algorithm-not-printing-results-in-spark-scala.  The codes are here:
import org.apache.spark.mllib.clustering.LDAimport org.apache.spark.mllib.linalg.Vectors// Load and parse the dataval data = sc.textFile("/data/mllib/sample_lda_data.txt") // you might need to change the path for the data setval parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))// Index documents with unique IDsval corpus = parsedData.zipWithIndex.map(_.swap).cache()// Cluster the documents into three topics using LDAval ldaModel = new LDA().setK(3).run(corpus)

But if I change the last line to 
val ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus), the program fails.  

    I greatly appreciate your help! 
Best,
    Frank

Re: LDA and Maximum Iterations

Posted by Richard Garris <rl...@databricks.com>.

Hi Frank,

Two suggestions

1. I would recommend caching the corpus prior to running LDA

2. If you are using EM I would tweak the sample size using the
setMiniBatchFraction
parameter to decrease the sample per iteration.

-Richard

On Tue, Sep 20, 2016 at 10:27 AM, Frank Zhang <
dataminingus@yahoo.com.invalid> wrote:

> Hi Yuhao,
>
>    Thank you so much for your great contribution to the LDA and other
> Spark modules!
>
>     I use both Spark 1.6.2 and 2.0.0. The data I used originally is very
> large which has tens of millions of documents. But for test purpose, the
> data set I mentioned earlier ("/data/mllib/sample_lda_data.txt") is good
> enough.  Please change the path to where you install your Spark to point to
> the data set and run those lines:
>
> import org.apache.spark.mllib.clustering.LDA
> import org.apache.spark.mllib.linalg.Vectors
>
> *//please change the path for the data set below:*
> *val data = sc.textFile("/data/mllib/sample_lda_data.txt") *
> val parsedData = data.map(s => Vectors.dense(s.trim.split('
> ').map(_.toDouble)))
> val corpus = parsedData.zipWithIndex.map(_.swap).cache()
> val ldaModel = new LDA().setK(3).run(corpus)
>
>    It should work. After that, please run:
> val ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus)
>
>    When I ran it, at job #90, that iteration took relatively extremely
> long then it stopped with exception:
> Active Jobs (1)
> Job IdDescriptionSubmittedDurationStages: Succeeded/TotalTasks (for all
> stages): Succeeded/Total
> 90 fold at LDAOptimizer.scala:226
> <http://10.252.12.37:4040/jobs/job?id=90> 2016/09/20 10:18:30 22 s 0/269
> 0/538
> Completed Jobs (90)
> Job IdDescriptionSubmittedDurationStages: Succeeded/TotalTasks (for all
> stages): Succeeded/Total
> 89 fold at LDAOptimizer.scala:226
> <http://10.252.12.37:4040/jobs/job?id=89> 2016/09/20 10:18:30 43 ms 4/4
> (262 skipped)
> 8/8 (524 skipped)
> 88 fold at LDAOptimizer.scala:226
> <http://10.252.12.37:4040/jobs/job?id=88> 2016/09/20 10:18:30 40 ms 4/4
> (259 skipped)
> 8/8 (518 skipped)
> 87 fold at LDAOptimizer.scala:226
> <http://10.252.12.37:4040/jobs/job?id=87> 2016/09/20 10:18:29 80 ms 4/4
> (256 skipped)
> 8/8 (512 skipped)
> 86 fold at LDAOptimizer.scala:226
> <http://10.252.12.37:4040/jobs/job?id=86> 2016/09/20 10:18:29 41 ms 4/4
> (253 skipped)
> 8/8 (506 skipped)
>    Part of the error message:
> Driver stacktrace:
>   at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
> scheduler$DAGScheduler$$failJobAndIndependentStages(
> DAGScheduler.scala:1450)
>   at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
> DAGScheduler.scala:1438)
>   at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
> DAGScheduler.scala:1437)
>   at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.abortStage(
> DAGScheduler.scala:1437)
>   at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at scala.Option.foreach(Option.scala:257)
>   at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
> DAGScheduler.scala:811)
>   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> doOnReceive(DAGScheduler.scala:1659)
>   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> onReceive(DAGScheduler.scala:1618)
>   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> onReceive(DAGScheduler.scala:1607)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(
> DAGScheduler.scala:632)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1934)
>   at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1046)
>   at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>   at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>   at org.apache.spark.rdd.RDD.fold(RDD.scala:1040)
>   at org.apache.spark.mllib.clustering.EMLDAOptimizer.
> computeGlobalTopicTotals(LDAOptimizer.scala:226)
>   at org.apache.spark.mllib.clustering.EMLDAOptimizer.
> next(LDAOptimizer.scala:213)
>   at org.apache.spark.mllib.clustering.EMLDAOptimizer.
> next(LDAOptimizer.scala:79)
>   at org.apache.spark.mllib.clustering.LDA.run(LDA.scala:334)
>   ... 48 elided
> Caused by: java.lang.StackOverflowError
>   at java.lang.reflect.InvocationTargetException.<init>(
> InvocationTargetException.java:72)
>   at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at java.io.ObjectStreamClass.invokeReadObject(
> ObjectStreamClass.java:1058)
>
>    Thank you so much!
>
>    Frank
>
>
>
> ------------------------------
> *From:* "Yang, Yuhao" <yu...@intel.com>
> *To:* Frank Zhang <da...@yahoo.com>; "user@spark.apache.org" <
> user@spark.apache.org>
> *Sent:* Tuesday, September 20, 2016 9:49 AM
> *Subject:* RE: LDA and Maximum Iterations
>
> Hi Frank,
>
> Which version of Spark are you using? Also can you share more information
> about the exception.
>
> If it’s not confidential, you can send the data sample to me (
> yuhao.yang@intel.com) and I can try to investigate.
>
> Regards,
> Yuhao
>
> *From:* Frank Zhang [mailto:dataminingus@yahoo.com.INVALID]
> *Sent:* Monday, September 19, 2016 9:20 PM
> *To:* user@spark.apache.org
> *Subject:* LDA and Maximum Iterations
>
> Hi all,
>
>    I have a question about parameter setting for LDA model. When I tried
> to set a large number like 500 for
> setMaxIterations, the program always fails.  There is a very
> straightforward LDA tutorial using an example data set in the mllib package:
> http://stackoverflow.com/questions/36631991/latent-
> dirichlet-allocation-lda-algorithm-not-printing-results-in-spark-scala.
>  The codes are here:
>
> import org.apache.spark.mllib.clustering.LDA
> import org.apache.spark.mllib.linalg.Vectors
> // Load and parse the data
> val data = sc.textFile("/data/mllib/sample_lda_data.txt") // you might
> need to change the path for the data set
> val parsedData = data.map(s => Vectors.dense(s.trim.split('
> ').map(_.toDouble)))
> // Index documents with unique IDs
> val corpus = parsedData.zipWithIndex.map(_.swap).cache()
> // Cluster the documents into three topics using LDA
> val ldaModel = new LDA().setK(3).run(corpus)
>
> But if I change the last line to
> val ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus), the
> program fails.
>
>     I greatly appreciate your help!
>
> Best,
>
>     Frank
>
>
>
>
>
>


-- 

Richard L Garris

Solution Architect

Databricks, Inc.

richard@databricks.com

Mobile: 650.200.0840

databricks.com
<http://databricks.com/>

Re: LDA and Maximum Iterations

Posted by Frank Zhang <da...@yahoo.com.INVALID>.

Hi Yuhao,
   Thank you so much for your great contribution to the LDA and other Spark modules!
    I use both Spark 1.6.2 and 2.0.0. The data I used originally is very large which has tens of millions of documents. But for test purpose, the data set I mentioned earlier ("/data/mllib/sample_lda_data.txt") is good enough.  Please change the path to where you install your Spark to point to the data set and run those lines:
import org.apache.spark.mllib.clustering.LDAimport org.apache.spark.mllib.linalg.Vectors//please change the path for the data set below:
val data = sc.textFile("/data/mllib/sample_lda_data.txt") val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))val corpus = parsedData.zipWithIndex.map(_.swap).cache()val ldaModel = new LDA().setK(3).run(corpus)    It should work. After that, please run:val ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus)

   When I ran it, at job #90, that iteration took relatively extremely long then it stopped with exception:
Active Jobs (1)

| Job Id | Description | Submitted | Duration | Stages: Succeeded/Total | Tasks (for all stages): Succeeded/Total |
| 90 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 22 s | 0/269 | 0/538 |


Completed Jobs (90)

| Job Id | Description | Submitted | Duration | Stages: Succeeded/Total | Tasks (for all stages): Succeeded/Total |
| 89 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 43 ms | 4/4 (262 skipped) | 8/8 (524 skipped) |
| 88 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:30 | 40 ms | 4/4 (259 skipped) | 8/8 (518 skipped) |
| 87 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:29 | 80 ms | 4/4 (256 skipped) | 8/8 (512 skipped) |
| 86 | fold at LDAOptimizer.scala:226 | 2016/09/20 10:18:29 | 41 ms | 4/4 (253 skipped) | 8/8 (506 skipped) |

   Part of the error message:Driver stacktrace:  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)  at scala.Option.foreach(Option.scala:257)  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1934)  at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1046)  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)  at org.apache.spark.rdd.RDD.fold(RDD.scala:1040)  at org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:226)  at org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:213)  at org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:79)  at org.apache.spark.mllib.clustering.LDA.run(LDA.scala:334)  ... 48 elidedCaused by: java.lang.StackOverflowError  at java.lang.reflect.InvocationTargetException.<init>(InvocationTargetException.java:72)  at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)  at java.lang.reflect.Method.invoke(Method.java:498)  at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
   Thank you so much!
   Frank 


      From: "Yang, Yuhao" <yu...@intel.com>
 To: Frank Zhang <da...@yahoo.com>; "user@spark.apache.org" <us...@spark.apache.org> 
 Sent: Tuesday, September 20, 2016 9:49 AM
 Subject: RE: LDA and Maximum Iterations
  
#yiv8087534397 -- filtered {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;}#yiv8087534397 filtered {font-family:SimSun;panose-1:2 1 6 0 3 1 1 1 1 1;}#yiv8087534397 filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv8087534397 filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv8087534397 filtered {panose-1:2 1 6 0 3 1 1 1 1 1;}#yiv8087534397 p.yiv8087534397MsoNormal, #yiv8087534397 li.yiv8087534397MsoNormal, #yiv8087534397 div.yiv8087534397MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv8087534397 a:link, #yiv8087534397 span.yiv8087534397MsoHyperlink {color:#0563C1;text-decoration:underline;}#yiv8087534397 a:visited, #yiv8087534397 span.yiv8087534397MsoHyperlinkFollowed {color:#954F72;text-decoration:underline;}#yiv8087534397 span.yiv8087534397EmailStyle17 {color:#1F497D;}#yiv8087534397 .yiv8087534397MsoChpDefault {font-size:10.0pt;}#yiv8087534397 filtered {margin:1.0in 1.0in 1.0in 1.0in;}#yiv8087534397 div.yiv8087534397WordSection1 {}#yiv8087534397 Hi Frank,    Which version of Spark are you using? Also can you share more information about the exception.    If it’s not confidential, you can send the data sample to me (yuhao.yang@intel.com) and I can try to investigate.    Regards, Yuhao    From: Frank Zhang [mailto:dataminingus@yahoo.com.INVALID]
Sent: Monday, September 19, 2016 9:20 PM
To: user@spark.apache.org
Subject: LDA and Maximum Iterations    Hi all,       I have a question about parameter setting for LDA model. When I tried to set a large number like 500 for   setMaxIterations, the program always fails.  There is a very straightforward LDA tutorial using an example data set in the mllib package:http://stackoverflow.com/questions/36631991/latent-dirichlet-allocation-lda-algorithm-not-printing-results-in-spark-scala.  The codes are here:    import org.apache.spark.mllib.clustering.LDA import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("/data/mllib/sample_lda_data.txt") // you might need to change the path for the data set val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble))) // Index documents with unique IDs val corpus = parsedData.zipWithIndex.map(_.swap).cache() // Cluster the documents into three topics using LDA val ldaModel = new LDA().setK(3).run(corpus)    But if I change the last line to  val ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus), the program fails.          I greatly appreciate your help!     Best,        Frank

RE: LDA and Maximum Iterations

Posted by "Yang, Yuhao" <yu...@intel.com>.

Hi Frank,

Which version of Spark are you using? Also can you share more information about the exception.

If it’s not confidential, you can send the data sample to me (yuhao.yang@intel.com) and I can try to investigate.

Regards,
Yuhao

From: Frank Zhang [mailto:dataminingus@yahoo.com.INVALID]
Sent: Monday, September 19, 2016 9:20 PM
To: user@spark.apache.org
Subject: LDA and Maximum Iterations

Hi all,

   I have a question about parameter setting for LDA model. When I tried to set a large number like 500 for
setMaxIterations, the program always fails.  There is a very straightforward LDA tutorial using an example data set in the mllib package:http://stackoverflow.com/questions/36631991/latent-dirichlet-allocation-lda-algorithm-not-printing-results-in-spark-scala.  The codes are here:

import org.apache.spark.mllib.clustering.LDA
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("/data/mllib/sample_lda_data.txt") // you might need to change the path for the data set
val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)

But if I change the last line to
val ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus), the program fails.

    I greatly appreciate your help!

Best,

    Frank