You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by crater <cq...@ucmerced.edu> on 2014/07/14 09:15:14 UTC

Error when testing with large sparse svm

Hi,

I encounter an error when testing svm (example one) on very large sparse
data. The dataset I ran on was a toy dataset with only ten examples but 13
million sparse vector with a few thousands non-zero entries.

The errors is showing below. I am wondering is this a bug or I am missing
something?

14/07/13 23:59:44 INFO SecurityManager: Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
14/07/13 23:59:44 INFO SecurityManager: Changing view acls to: chengjie
14/07/13 23:59:44 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(chengjie)
14/07/13 23:59:45 INFO Slf4jLogger: Slf4jLogger started
14/07/13 23:59:45 INFO Remoting: Starting remoting
14/07/13 23:59:45 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@master:53173]
14/07/13 23:59:45 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@master:53173]
14/07/13 23:59:45 INFO SparkEnv: Registering MapOutputTracker
14/07/13 23:59:45 INFO SparkEnv: Registering BlockManagerMaster
14/07/13 23:59:45 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20140713235945-c78f
14/07/13 23:59:45 INFO MemoryStore: MemoryStore started with capacity 14.4
GB.
14/07/13 23:59:45 INFO ConnectionManager: Bound socket to port 37674 with id
= ConnectionManagerId(master,37674)
14/07/13 23:59:45 INFO BlockManagerMaster: Trying to register BlockManager
14/07/13 23:59:45 INFO BlockManagerInfo: Registering block manager
master:37674 with 14.4 GB RAM
14/07/13 23:59:45 INFO BlockManagerMaster: Registered BlockManager
14/07/13 23:59:45 INFO HttpServer: Starting HTTP Server
14/07/13 23:59:45 INFO HttpBroadcast: Broadcast server started at
http://10.10.255.128:41838
14/07/13 23:59:45 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-ac459d4b-a3c4-4577-bad4-576ac427d0bf
14/07/13 23:59:45 INFO HttpServer: Starting HTTP Server
14/07/13 23:59:51 INFO SparkUI: Started SparkUI at http://master:4040
14/07/13 23:59:51 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/07/13 23:59:52 INFO EventLoggingListener: Logging events to
/tmp/spark-events/binaryclassification-with-params(hdfs---master-9001-splice.small,1,1.0,svm,l1,0.1)-1405317591776
14/07/13 23:59:52 INFO SparkContext: Added JAR
file:/home/chengjie/spark-1.0.1/examples/target/scala-2.10/spark-examples-1.0.1-hadoop2.3.0.jar
at http://10.10.255.128:54689/jars/spark-examples-1.0.1-hadoop2.3.0.jar with
timestamp 1405317592653
14/07/13 23:59:52 INFO AppClient$ClientActor: Connecting to master
spark://master:7077...
14/07/14 00:00:08 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
14/07/14 00:00:23 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
14/07/14 00:00:38 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
14/07/14 00:00:53 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
Training: 10
14/07/14 00:01:09 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
14/07/14 00:01:09 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS
*Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Serialized task 20:0 was 94453098 bytes which exceeds
spark.akka.frameSize (10485760 bytes). Consider using broadcast variables
for large values.*
	at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
	at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
	at scala.Option.foreach(Option.scala:236)
	at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
	at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
	at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by crater <cq...@ucmerced.edu>.


(1) What is "number of partitions"? Is it number of workers per node?
(2) I already set the driver memory pretty big, which is 25g.
(3) I am running Spark 1.0.1 in standalone cluster with 9 nodes, 1 one them
works as master, others are workers.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by crater <cq...@ucmerced.edu>.

I don't really know how to create JIRA :(

Specifically, the code I commented out are:

    //val prediction = model.predict(test.map(_.features))
    //val predictionAndLabel = prediction.zip(test.map(_.label))
    //val prediction = model.predict(training.map(_.features))
    //val predictionAndLabel = prediction.zip(training.map(_.label))

    //val metrics = new BinaryClassificationMetrics(predictionAndLabel)

    //println(s"Test areaUnderPR = ${metrics.areaUnderPR()}.")
    //println(s"Test areaUnderROC = ${metrics.areaUnderROC()}.")

in
examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala.





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p10010.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by Xiangrui Meng <me...@gmail.com>.

Then it may be a new issue. Do you mind creating a JIRA to track this
issue? It would be great if you can help locate the line in
BinaryClassificationMetrics that caused the problem. Thanks! -Xiangrui

On Tue, Jul 15, 2014 at 10:56 PM, crater <cq...@ucmerced.edu> wrote:
> I don't really have "my code", I was just running example program in :
> examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala
>
> What I did was simple try this example on a 13M sparse data, and I got the
> error I posted.
> Today I managed to ran it after I commented out the prediction part.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9884.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by crater <cq...@ucmerced.edu>.

I don't really have "my code", I was just running example program in :
examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala 

What I did was simple try this example on a 13M sparse data, and I got the
error I posted. 
Today I managed to ran it after I commented out the prediction part.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by Xiangrui Meng <me...@gmail.com>.

crater, was the error message the same as what you posted before:

14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote
Akka client disassociated
14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0)
14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on node8: remote
Akka client disassociated
14/07/14 11:32:21 WARN TaskSetManager: Lost TID 21 (task 13.0:1)
14/07/14 11:32:23 ERROR TaskSchedulerImpl: Lost executor 6 on node3: remote
Akka client disassociated
14/07/14 11:32:23 WARN TaskSetManager: Lost TID 22 (task 13.0:0)
14/07/14 11:32:25 ERROR TaskSchedulerImpl: Lost executor 0 on node4: remote
Akka client disassociated
14/07/14 11:32:25 WARN TaskSetManager: Lost TID 23 (task 13.0:1)
14/07/14 11:32:26 ERROR TaskSchedulerImpl: Lost executor 5 on node1: remote
Akka client disassociated
14/07/14 11:32:26 WARN TaskSetManager: Lost TID 24 (task 13.0:0)
14/07/14 11:32:28 ERROR TaskSchedulerImpl: Lost executor 7 on node6: remote
Akka client disassociated
14/07/14 11:32:28 WARN TaskSetManager: Lost TID 26 (task 13.0:0)
14/07/14 11:32:28 ERROR TaskSetManager: Task 13.0:0 failed 4 times; aborting
job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 13.0:0 failed 4 times, most recent failure: TID 26 on
host node6 failed for unknown reason
Driver stacktrace:

Could you paste your code on gist? It may help to identify the problem. Thanks!

Xiangrui

On Tue, Jul 15, 2014 at 2:51 PM, crater <cq...@ucmerced.edu> wrote:
> I got a bit progress. I think the problem is with the
> "BinaryClassificationMetrics",
> as long as I comment out all the prediction related metrics, I can run the
> svm example with my data.
> So the problem should be there I guess.
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9832.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by crater <cq...@ucmerced.edu>.

I got a bit progress. I think the problem is with the
"BinaryClassificationMetrics", 
as long as I comment out all the prediction related metrics, I can run the
svm example with my data.
So the problem should be there I guess.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9832.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by Srikrishna S <sr...@gmail.com>.

I am running Spark 1.0.1 on a 5 node yarn cluster. I have set the
driver memory to 8G and executor memory to about 12G.

Regards,
Krishna


On Mon, Jul 14, 2014 at 5:56 PM, Xiangrui Meng <me...@gmail.com> wrote:
> Is it on a standalone server? There are several settings worthing checking:
>
> 1) number of partitions, which should match the number of cores
> 2) driver memory (you can see it from the executor tab of the Spark
> WebUI and set it with "--driver-memory 10g"
> 3) the version of Spark you were running
>
> Best,
> Xiangrui
>
> On Mon, Jul 14, 2014 at 12:14 PM, Srikrishna S <sr...@gmail.com> wrote:
>> That is exactly the same error that I got. I am still having no success.
>>
>> Regards,
>> Krishna
>>
>> On Mon, Jul 14, 2014 at 11:50 AM, crater <cq...@ucmerced.edu> wrote:
>>> Hi Krishna,
>>>
>>> Thanks for your help. Are you able to get your 29M data running yet? I fix
>>> the previous problem by setting larger spark.akka.frameSize, but now I get
>>> some other errors below. Did you get these errors before?
>>>
>>>
>>> 14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote
>>> Akka client disassociated
>>> 14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0)
>>> 14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on node8: remote
>>> Akka client disassociated
>>> 14/07/14 11:32:21 WARN TaskSetManager: Lost TID 21 (task 13.0:1)
>>> 14/07/14 11:32:23 ERROR TaskSchedulerImpl: Lost executor 6 on node3: remote
>>> Akka client disassociated
>>> 14/07/14 11:32:23 WARN TaskSetManager: Lost TID 22 (task 13.0:0)
>>> 14/07/14 11:32:25 ERROR TaskSchedulerImpl: Lost executor 0 on node4: remote
>>> Akka client disassociated
>>> 14/07/14 11:32:25 WARN TaskSetManager: Lost TID 23 (task 13.0:1)
>>> 14/07/14 11:32:26 ERROR TaskSchedulerImpl: Lost executor 5 on node1: remote
>>> Akka client disassociated
>>> 14/07/14 11:32:26 WARN TaskSetManager: Lost TID 24 (task 13.0:0)
>>> 14/07/14 11:32:28 ERROR TaskSchedulerImpl: Lost executor 7 on node6: remote
>>> Akka client disassociated
>>> 14/07/14 11:32:28 WARN TaskSetManager: Lost TID 26 (task 13.0:0)
>>> 14/07/14 11:32:28 ERROR TaskSetManager: Task 13.0:0 failed 4 times; aborting
>>> job
>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
>>> to stage failure: Task 13.0:0 failed 4 times, most recent failure: TID 26 on
>>> host node6 failed for unknown reason
>>> Driver stacktrace:
>>>         at
>>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>         at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>         at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>         at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>         at
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>         at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>         at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>         at scala.Option.foreach(Option.scala:236)
>>>         at
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>         at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>         at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>         at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>         at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9623.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by Xiangrui Meng <me...@gmail.com>.

Is it on a standalone server? There are several settings worthing checking:

1) number of partitions, which should match the number of cores
2) driver memory (you can see it from the executor tab of the Spark
WebUI and set it with "--driver-memory 10g"
3) the version of Spark you were running

Best,
Xiangrui

On Mon, Jul 14, 2014 at 12:14 PM, Srikrishna S <sr...@gmail.com> wrote:
> That is exactly the same error that I got. I am still having no success.
>
> Regards,
> Krishna
>
> On Mon, Jul 14, 2014 at 11:50 AM, crater <cq...@ucmerced.edu> wrote:
>> Hi Krishna,
>>
>> Thanks for your help. Are you able to get your 29M data running yet? I fix
>> the previous problem by setting larger spark.akka.frameSize, but now I get
>> some other errors below. Did you get these errors before?
>>
>>
>> 14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote
>> Akka client disassociated
>> 14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0)
>> 14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on node8: remote
>> Akka client disassociated
>> 14/07/14 11:32:21 WARN TaskSetManager: Lost TID 21 (task 13.0:1)
>> 14/07/14 11:32:23 ERROR TaskSchedulerImpl: Lost executor 6 on node3: remote
>> Akka client disassociated
>> 14/07/14 11:32:23 WARN TaskSetManager: Lost TID 22 (task 13.0:0)
>> 14/07/14 11:32:25 ERROR TaskSchedulerImpl: Lost executor 0 on node4: remote
>> Akka client disassociated
>> 14/07/14 11:32:25 WARN TaskSetManager: Lost TID 23 (task 13.0:1)
>> 14/07/14 11:32:26 ERROR TaskSchedulerImpl: Lost executor 5 on node1: remote
>> Akka client disassociated
>> 14/07/14 11:32:26 WARN TaskSetManager: Lost TID 24 (task 13.0:0)
>> 14/07/14 11:32:28 ERROR TaskSchedulerImpl: Lost executor 7 on node6: remote
>> Akka client disassociated
>> 14/07/14 11:32:28 WARN TaskSetManager: Lost TID 26 (task 13.0:0)
>> 14/07/14 11:32:28 ERROR TaskSetManager: Task 13.0:0 failed 4 times; aborting
>> job
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
>> to stage failure: Task 13.0:0 failed 4 times, most recent failure: TID 26 on
>> host node6 failed for unknown reason
>> Driver stacktrace:
>>         at
>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>         at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>         at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>         at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>         at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>         at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>         at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>         at scala.Option.foreach(Option.scala:236)
>>         at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>         at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>         at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>         at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>         at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9623.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by Srikrishna S <sr...@gmail.com>.

That is exactly the same error that I got. I am still having no success.

Regards,
Krishna

On Mon, Jul 14, 2014 at 11:50 AM, crater <cq...@ucmerced.edu> wrote:
> Hi Krishna,
>
> Thanks for your help. Are you able to get your 29M data running yet? I fix
> the previous problem by setting larger spark.akka.frameSize, but now I get
> some other errors below. Did you get these errors before?
>
>
> 14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote
> Akka client disassociated
> 14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0)
> 14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on node8: remote
> Akka client disassociated
> 14/07/14 11:32:21 WARN TaskSetManager: Lost TID 21 (task 13.0:1)
> 14/07/14 11:32:23 ERROR TaskSchedulerImpl: Lost executor 6 on node3: remote
> Akka client disassociated
> 14/07/14 11:32:23 WARN TaskSetManager: Lost TID 22 (task 13.0:0)
> 14/07/14 11:32:25 ERROR TaskSchedulerImpl: Lost executor 0 on node4: remote
> Akka client disassociated
> 14/07/14 11:32:25 WARN TaskSetManager: Lost TID 23 (task 13.0:1)
> 14/07/14 11:32:26 ERROR TaskSchedulerImpl: Lost executor 5 on node1: remote
> Akka client disassociated
> 14/07/14 11:32:26 WARN TaskSetManager: Lost TID 24 (task 13.0:0)
> 14/07/14 11:32:28 ERROR TaskSchedulerImpl: Lost executor 7 on node6: remote
> Akka client disassociated
> 14/07/14 11:32:28 WARN TaskSetManager: Lost TID 26 (task 13.0:0)
> 14/07/14 11:32:28 ERROR TaskSetManager: Task 13.0:0 failed 4 times; aborting
> job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 13.0:0 failed 4 times, most recent failure: TID 26 on
> host node6 failed for unknown reason
> Driver stacktrace:
>         at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>         at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>         at scala.Option.foreach(Option.scala:236)
>         at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9623.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by crater <cq...@ucmerced.edu>.

Hi Krishna,

Thanks for your help. Are you able to get your 29M data running yet? I fix
the previous problem by setting larger spark.akka.frameSize, but now I get
some other errors below. Did you get these errors before?


14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote
Akka client disassociated
14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0)
14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on node8: remote
Akka client disassociated
14/07/14 11:32:21 WARN TaskSetManager: Lost TID 21 (task 13.0:1)
14/07/14 11:32:23 ERROR TaskSchedulerImpl: Lost executor 6 on node3: remote
Akka client disassociated
14/07/14 11:32:23 WARN TaskSetManager: Lost TID 22 (task 13.0:0)
14/07/14 11:32:25 ERROR TaskSchedulerImpl: Lost executor 0 on node4: remote
Akka client disassociated
14/07/14 11:32:25 WARN TaskSetManager: Lost TID 23 (task 13.0:1)
14/07/14 11:32:26 ERROR TaskSchedulerImpl: Lost executor 5 on node1: remote
Akka client disassociated
14/07/14 11:32:26 WARN TaskSetManager: Lost TID 24 (task 13.0:0)
14/07/14 11:32:28 ERROR TaskSchedulerImpl: Lost executor 7 on node6: remote
Akka client disassociated
14/07/14 11:32:28 WARN TaskSetManager: Lost TID 26 (task 13.0:0)
14/07/14 11:32:28 ERROR TaskSetManager: Task 13.0:0 failed 4 times; aborting
job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 13.0:0 failed 4 times, most recent failure: TID 26 on
host node6 failed for unknown reason
Driver stacktrace:
	at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
	at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
	at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
	at scala.Option.foreach(Option.scala:236)
	at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
	at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
	at akka.actor.ActorCell.invoke(ActorCell.scala:456)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
	at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9623.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by Srikrishna S <sr...@gmail.com>.

If you use Scala, you can do:

  val conf = new SparkConf()
             .setMaster("yarn-client")
             .setAppName("Logistic regression SGD fixed")
             .set("spark.akka.frameSize", "100")
             .setExecutorEnv("SPARK_JAVA_OPTS", " -Dspark.akka.frameSize=100")
    var sc = new SparkContext(conf)


I have been struggling with this too. I was trying to run Spark on the
KDDB website which has about 29M features. It implodes and dies. Let
me know if you are able to figure out how to get things to work well
on really really wide datasets.

Regards,
Krishna

On Mon, Jul 14, 2014 at 10:18 AM, crater <cq...@ucmerced.edu> wrote:
> Hi xiangrui,
>
>
> Where can I set the "spark.akka.frameSize" ?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9616.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by crater <cq...@ucmerced.edu>.

Hi xiangrui,


Where can I set the "spark.akka.frameSize" ?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592p9616.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error when testing with large sparse svm

Posted by Xiangrui Meng <me...@gmail.com>.

You need to set a larger `spark.akka.frameSize`, e.g., 128, for the
serialized weight vector. There is a JIRA about switching
automatically between sending through akka or broadcast:
https://issues.apache.org/jira/browse/SPARK-2361 . -Xiangrui

On Mon, Jul 14, 2014 at 12:15 AM, crater <cq...@ucmerced.edu> wrote:
> Hi,
>
> I encounter an error when testing svm (example one) on very large sparse
> data. The dataset I ran on was a toy dataset with only ten examples but 13
> million sparse vector with a few thousands non-zero entries.
>
> The errors is showing below. I am wondering is this a bug or I am missing
> something?
>
> 14/07/13 23:59:44 INFO SecurityManager: Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 14/07/13 23:59:44 INFO SecurityManager: Changing view acls to: chengjie
> 14/07/13 23:59:44 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(chengjie)
> 14/07/13 23:59:45 INFO Slf4jLogger: Slf4jLogger started
> 14/07/13 23:59:45 INFO Remoting: Starting remoting
> 14/07/13 23:59:45 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://spark@master:53173]
> 14/07/13 23:59:45 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://spark@master:53173]
> 14/07/13 23:59:45 INFO SparkEnv: Registering MapOutputTracker
> 14/07/13 23:59:45 INFO SparkEnv: Registering BlockManagerMaster
> 14/07/13 23:59:45 INFO DiskBlockManager: Created local directory at
> /tmp/spark-local-20140713235945-c78f
> 14/07/13 23:59:45 INFO MemoryStore: MemoryStore started with capacity 14.4
> GB.
> 14/07/13 23:59:45 INFO ConnectionManager: Bound socket to port 37674 with id
> = ConnectionManagerId(master,37674)
> 14/07/13 23:59:45 INFO BlockManagerMaster: Trying to register BlockManager
> 14/07/13 23:59:45 INFO BlockManagerInfo: Registering block manager
> master:37674 with 14.4 GB RAM
> 14/07/13 23:59:45 INFO BlockManagerMaster: Registered BlockManager
> 14/07/13 23:59:45 INFO HttpServer: Starting HTTP Server
> 14/07/13 23:59:45 INFO HttpBroadcast: Broadcast server started at
> http://10.10.255.128:41838
> 14/07/13 23:59:45 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-ac459d4b-a3c4-4577-bad4-576ac427d0bf
> 14/07/13 23:59:45 INFO HttpServer: Starting HTTP Server
> 14/07/13 23:59:51 INFO SparkUI: Started SparkUI at http://master:4040
> 14/07/13 23:59:51 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/07/13 23:59:52 INFO EventLoggingListener: Logging events to
> /tmp/spark-events/binaryclassification-with-params(hdfs---master-9001-splice.small,1,1.0,svm,l1,0.1)-1405317591776
> 14/07/13 23:59:52 INFO SparkContext: Added JAR
> file:/home/chengjie/spark-1.0.1/examples/target/scala-2.10/spark-examples-1.0.1-hadoop2.3.0.jar
> at http://10.10.255.128:54689/jars/spark-examples-1.0.1-hadoop2.3.0.jar with
> timestamp 1405317592653
> 14/07/13 23:59:52 INFO AppClient$ClientActor: Connecting to master
> spark://master:7077...
> 14/07/14 00:00:08 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/14 00:00:23 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/14 00:00:38 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> 14/07/14 00:00:53 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory
> Training: 10
> 14/07/14 00:01:09 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemBLAS
> 14/07/14 00:01:09 WARN BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefBLAS
> *Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Serialized task 20:0 was 94453098 bytes which exceeds
> spark.akka.frameSize (10485760 bytes). Consider using broadcast variables
> for large values.*
>         at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>         at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>         at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>         at scala.Option.foreach(Option.scala:236)
>         at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-testing-with-large-sparse-svm-tp9592.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.