You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Vicky Kak <vi...@gmail.com> on 2014/09/04 08:39:40 UTC

Programatically running of the Spark Jobs.

I have been able to submit the spark jobs using the submit script but I
would like to do it via code.
I am unable to search anything matching to my need.
I am thinking of using org.apache.spark.deploy.SparkSubmit to do so, may be
have to write some utility that passes the parameters required for this
class.
I would be interested to know how community is doing.

Thanks,
Vicky

Re: Programatically running of the Spark Jobs.

Posted by Vicky Kak <vi...@gmail.com>.

I don't think so.


On Thu, Sep 4, 2014 at 5:36 PM, Ruebenacker, Oliver A <
Oliver.Ruebenacker@altisource.com> wrote:

>
>
>      Hello,
>
>
>
>   Can this be used as a library from within another application?
>
>   Thanks!
>
>
>
>      Best, Oliver
>
>
>
> *From:* Matt Chu [mailto:mchu@kabam.com]
> *Sent:* Thursday, September 04, 2014 2:46 AM
> *To:* Vicky Kak
> *Cc:* user
> *Subject:* Re: Programatically running of the Spark Jobs.
>
>
>
> https://github.com/spark-jobserver/spark-jobserver
>
>
>
> Ooyala's Spark jobserver is the current de facto standard, IIUC. I just
> added it to our prototype stack, and will begin trying it out soon. Note
> that you can only do standalone or Mesos; YARN isn't quite there yet.
>
>
>
> (The repo just moved from https://github.com/ooyala/spark-jobserver, so
> don't trust Google on this one (yet); development is happening in the first
> repo.)
>
>
>
>
>
> On Wed, Sep 3, 2014 at 11:39 PM, Vicky Kak <vi...@gmail.com> wrote:
>
> I have been able to submit the spark jobs using the submit script but I
> would like to do it via code.
>
> I am unable to search anything matching to my need.
>
> I am thinking of using org.apache.spark.deploy.SparkSubmit to do so, may
> be have to write some utility that passes the parameters required for this
> class.
>
> I would be interested to know how community is doing.
>
> Thanks,
> Vicky
>
>
>
>
> ***********************************************************************************************************************
>
> This email message and any attachments are intended solely for the use of
> the addressee. If you are not the intended recipient, you are prohibited
> from reading, disclosing, reproducing, distributing, disseminating or
> otherwise using this transmission. If you have received this message in
> error, please promptly notify the sender by reply email and immediately
> delete this message from your system.
> This message and any attachments may contain information that is
> confidential, privileged or exempt from disclosure. Delivery of this
> message to any person other than the intended recipient is not intended to
> waive any right or privilege. Message transmission is not guaranteed to be
> secure or free of software viruses.
>
> ***********************************************************************************************************************
>

RE: Programatically running of the Spark Jobs.

Posted by "Ruebenacker, Oliver A" <Ol...@altisource.com>.

     Hello,

  Can this be used as a library from within another application?
  Thanks!

     Best, Oliver

From: Matt Chu [mailto:mchu@kabam.com]
Sent: Thursday, September 04, 2014 2:46 AM
To: Vicky Kak
Cc: user
Subject: Re: Programatically running of the Spark Jobs.

https://github.com/spark-jobserver/spark-jobserver

Ooyala's Spark jobserver is the current de facto standard, IIUC. I just added it to our prototype stack, and will begin trying it out soon. Note that you can only do standalone or Mesos; YARN isn't quite there yet.

(The repo just moved from https://github.com/ooyala/spark-jobserver, so don't trust Google on this one (yet); development is happening in the first repo.)

On Wed, Sep 3, 2014 at 11:39 PM, Vicky Kak <vi...@gmail.com>> wrote:
I have been able to submit the spark jobs using the submit script but I would like to do it via code.
I am unable to search anything matching to my need.
I am thinking of using org.apache.spark.deploy.SparkSubmit to do so, may be have to write some utility that passes the parameters required for this class.
I would be interested to know how community is doing.
Thanks,
Vicky

***********************************************************************************************************************

This email message and any attachments are intended solely for the use of the addressee. If you are not the intended recipient, you are prohibited from reading, disclosing, reproducing, distributing, disseminating or otherwise using this transmission. If you have received this message in error, please promptly notify the sender by reply email and immediately delete this message from your system. This message and any attachments may contain information that is confidential, privileged or exempt from disclosure. Delivery of this message to any person other than the intended recipient is not intended to waive any right or privilege. Message transmission is not guaranteed to be secure or free of software viruses.
***********************************************************************************************************************

Re: Programatically running of the Spark Jobs.

Posted by Vicky Kak <vi...@gmail.com>.

I get this error when i run it from IDE
***************************************************************************************

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Master removed our application: FAILED
    at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
    at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
    at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
    at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
    at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
    at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
    at scala.Option.foreach(Option.scala:236)
    at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
    at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
    at akka.actor.ActorCell.invoke(ActorCell.scala:456)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
    at akka.dispatch.Mailbox.run(Mailbox.scala:219)
    at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

***************************************************************************************



On Fri, Sep 5, 2014 at 7:35 AM, ericacm <er...@gmail.com> wrote:

> Ahh - that probably explains an issue I am seeing.  I am a brand new user
> and
> I tried running the SimpleApp class that is on the Quick Start page
> (http://spark.apache.org/docs/latest/quick-start.html).
>
> When I use conf.setMaster("local") then I can run the class directly from
> my
> IDE.  But when I try to set the master to my standalone cluster using
> conf.setMaster("spark://myhost:7077") and then run the class directly from
> the IDE I got this error in the local application (running from the IDE):
>
> 14/09/01 10:56:04 ERROR scheduler.TaskSetManager: Task 0.0:0 failed 4
> times;
> aborting job
> 14/09/01 10:56:04 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0,
> whose tasks have all completed, from pool
> 14/09/01 10:56:04 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
> 14/09/01 10:56:04 INFO client.AppClient$ClientActor: Executor updated:
> app-20140901105546-0001/3 is now EXITED (Command exited with code 52)
> 14/09/01 10:56:04 INFO cluster.SparkDeploySchedulerBackend: Executor
> app-20140901105546-0001/3 removed: Command exited with code 52
> 14/09/01 10:56:04 INFO scheduler.DAGScheduler: Failed to run count at
> SimpleApp.scala:17
> Exception in thread "main" 14/09/01 10:56:04 INFO
> client.AppClient$ClientActor: Executor added: app-20140901105546-0001/4 on
> worker-20140901105055-10.0.1.5-56156 (10.0.1.5:56156) with 8 cores
> org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0.0:0 failed 4 times, most recent failure: TID 3 on host 10.0.1.5 failed
> for
> unknown reason
>
> and this error in the worker stderr:
>
> 14/09/01 10:55:54 ERROR Executor: Exception in task ID 1
> java.lang.OutOfMemoryError: Java heap space
>         at
>
> org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
>         at
> org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378)
>         at
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
>         at
> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
>         at
>
> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:42)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:601)
>         at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)
>         at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1872)
>         at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
>         at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
>         at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>         at
>
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>
> Which made no sense because I also gave the worker 1gb of heap and it was
> trying to process a 4k README.md file.  I'm guessing it must have tried to
> deserialize a bogus object because I was not submitting the job correctly
> (via spark-submit or this spark-jobserver)?
>
> Thanks,
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Programatically-running-of-the-Spark-Jobs-tp13426p13518.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Programatically running of the Spark Jobs.

Posted by ericacm <er...@gmail.com>.

Ahh - that probably explains an issue I am seeing.  I am a brand new user and
I tried running the SimpleApp class that is on the Quick Start page
(http://spark.apache.org/docs/latest/quick-start.html).

When I use conf.setMaster("local") then I can run the class directly from my
IDE.  But when I try to set the master to my standalone cluster using
conf.setMaster("spark://myhost:7077") and then run the class directly from
the IDE I got this error in the local application (running from the IDE):

14/09/01 10:56:04 ERROR scheduler.TaskSetManager: Task 0.0:0 failed 4 times;
aborting job
14/09/01 10:56:04 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0,
whose tasks have all completed, from pool 
14/09/01 10:56:04 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
14/09/01 10:56:04 INFO client.AppClient$ClientActor: Executor updated:
app-20140901105546-0001/3 is now EXITED (Command exited with code 52)
14/09/01 10:56:04 INFO cluster.SparkDeploySchedulerBackend: Executor
app-20140901105546-0001/3 removed: Command exited with code 52
14/09/01 10:56:04 INFO scheduler.DAGScheduler: Failed to run count at
SimpleApp.scala:17
Exception in thread "main" 14/09/01 10:56:04 INFO
client.AppClient$ClientActor: Executor added: app-20140901105546-0001/4 on
worker-20140901105055-10.0.1.5-56156 (10.0.1.5:56156) with 8 cores
org.apache.spark.SparkException: Job aborted due to stage failure: Task
0.0:0 failed 4 times, most recent failure: TID 3 on host 10.0.1.5 failed for
unknown reason

and this error in the worker stderr:

14/09/01 10:55:54 ERROR Executor: Exception in task ID 1
java.lang.OutOfMemoryError: Java heap space
	at
org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
	at org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378)
	at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
	at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77)
	at
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:42)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:601)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1872)
	at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1777)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
	at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)

Which made no sense because I also gave the worker 1gb of heap and it was
trying to process a 4k README.md file.  I'm guessing it must have tried to
deserialize a bogus object because I was not submitting the job correctly
(via spark-submit or this spark-jobserver)?

Thanks,



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Programatically-running-of-the-Spark-Jobs-tp13426p13518.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Programatically running of the Spark Jobs.

Posted by Vicky Kak <vi...@gmail.com>.

I don't want to use YARN or Mesos, just trying the standalone spark cluster.
We need a way to do seamless submission with the API which I don't see.
To my surprise I was hit by this issue when i tried running the submit from
another machine, it is crazy that I have to submit the job from the worked
node or play with the envirnments variables. It is the seamless
http://apache-spark-user-list.1001560.n3.nabble.com/executor-failed-cannot-find-compute-classpath-sh-td859.html

On Fri, Sep 5, 2014 at 8:33 AM, Guru Medasani <gd...@outlook.com> wrote:

> I am able to run Spark jobs and Spark Streaming jobs successfully via YARN
> on a CDH cluster.
>
> When you mean YARN isn’t quite there yet, you mean to submit the jobs
> programmatically? or just in general?
>
>
> On Sep 4, 2014, at 1:45 AM, Matt Chu <mc...@kabam.com> wrote:
>
> https://github.com/spark-jobserver/spark-jobserver
>
> Ooyala's Spark jobserver is the current de facto standard, IIUC. I just
> added it to our prototype stack, and will begin trying it out soon. Note
> that you can only do standalone or Mesos; YARN isn't quite there yet.
>
> (The repo just moved from https://github.com/ooyala/spark-jobserver, so
> don't trust Google on this one (yet); development is happening in the first
> repo.)
>
>
>
> On Wed, Sep 3, 2014 at 11:39 PM, Vicky Kak <vi...@gmail.com> wrote:
>
>> I have been able to submit the spark jobs using the submit script but I
>> would like to do it via code.
>> I am unable to search anything matching to my need.
>> I am thinking of using org.apache.spark.deploy.SparkSubmit to do so, may
>> be have to write some utility that passes the parameters required for this
>> class.
>> I would be interested to know how community is doing.
>>
>> Thanks,
>> Vicky
>>
>
>
>

Re: Programatically running of the Spark Jobs.

Posted by Guru Medasani <gd...@outlook.com>.

I am able to run Spark jobs and Spark Streaming jobs successfully via YARN on a CDH cluster. 

When you mean YARN isn’t quite there yet, you mean to submit the jobs programmatically? or just in general?
 

On Sep 4, 2014, at 1:45 AM, Matt Chu <mc...@kabam.com> wrote:

> https://github.com/spark-jobserver/spark-jobserver
> 
> Ooyala's Spark jobserver is the current de facto standard, IIUC. I just added it to our prototype stack, and will begin trying it out soon. Note that you can only do standalone or Mesos; YARN isn't quite there yet.
> 
> (The repo just moved from https://github.com/ooyala/spark-jobserver, so don't trust Google on this one (yet); development is happening in the first repo.)
> 
> 
> 
> On Wed, Sep 3, 2014 at 11:39 PM, Vicky Kak <vi...@gmail.com> wrote:
> I have been able to submit the spark jobs using the submit script but I would like to do it via code.
> I am unable to search anything matching to my need.
> I am thinking of using org.apache.spark.deploy.SparkSubmit to do so, may be have to write some utility that passes the parameters required for this class.
> I would be interested to know how community is doing.
> 
> Thanks,
> Vicky
>

Re: Programatically running of the Spark Jobs.

Posted by Matt Chu <mc...@kabam.com>.

https://github.com/spark-jobserver/spark-jobserver

Ooyala's Spark jobserver is the current de facto standard, IIUC. I just
added it to our prototype stack, and will begin trying it out soon. Note
that you can only do standalone or Mesos; YARN isn't quite there yet.

(The repo just moved from https://github.com/ooyala/spark-jobserver, so
don't trust Google on this one (yet); development is happening in the first
repo.)

On Wed, Sep 3, 2014 at 11:39 PM, Vicky Kak <vi...@gmail.com> wrote:

> I have been able to submit the spark jobs using the submit script but I
> would like to do it via code.
> I am unable to search anything matching to my need.
> I am thinking of using org.apache.spark.deploy.SparkSubmit to do so, may
> be have to write some utility that passes the parameters required for this
> class.
> I would be interested to know how community is doing.
>
> Thanks,
> Vicky
>