You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Cosmin Posteuca <co...@gmail.com> on 2017/02/07 13:37:40 UTC

[Spark Context]: How to add on demand jobs to an existing spark context?

I want to run different jobs on demand with same spark context, but i don't
know how exactly i can do this.

I try to get current context, but seems it create a new spark context(with
new executors).

I call spark-submit to add new jobs.

I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with
yarn as resource manager.

My code:

val sparkContext = SparkContext.getOrCreate()
val content = 1 to 40000
val result = sparkContext.parallelize(content, 5)
result.map(value => value.toString).foreach(loop)

def loop(x: String): Unit = {
   for (a <- 1 to 30000000) {

   }
}

spark-submit:

spark-submit --executor-cores 1 \
             --executor-memory 1g \
             --driver-memory 1g \
             --master yarn \
             --deploy-mode cluster \
             --conf spark.dynamicAllocation.enabled=true \
             --conf spark.shuffle.service.enabled=true \
             --conf spark.dynamicAllocation.minExecutors=1 \
             --conf spark.dynamicAllocation.maxExecutors=3 \
             --conf spark.dynamicAllocation.initialExecutors=3 \
             --conf spark.executor.instances=3 \

If i run twice spark-submit it create 6 executors, but i want to run all
this jobs on same spark application.

How can achieve adding jobs to an existing spark application?

I don't understand why SparkContext.getOrCreate() don't get existing spark
context.


Thanks,

Cosmin P.

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Posted by Cosmin Posteuca <co...@gmail.com>.

Response for vincent:

Thanks for answer!

Yes, i need a business solution, that's the reason why i can't use Spark
jobserver or Livy solutions. I will look on your github to see how to build
such a system.

But i don't understand, why spark doesn't have a solution for this kind of
problem? and why can't i get the existing context and run some code on it?

Thanks

2017-02-07 19:26 GMT+02:00 vincent gromakowski <
vincent.gromakowski@gmail.com>:

> Spark jobserver or Livy server are the best options for pure technical API.
> If you want to publish business API you will probably have to build you
> own app like the one I wrote a year ago https://github.com/elppc/akka-
> spark-experiments
> It combines Akka actors and a shared Spark context to serve concurrent
> subsecond jobs
>
>
> 2017-02-07 15:28 GMT+01:00 ayan guha <gu...@gmail.com>:
>
>> I think you are loking for livy or spark  jobserver
>>
>> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <
>> cosmin.posteuca@gmail.com> wrote:
>>
>>> I want to run different jobs on demand with same spark context, but i
>>> don't know how exactly i can do this.
>>>
>>> I try to get current context, but seems it create a new spark
>>> context(with new executors).
>>>
>>> I call spark-submit to add new jobs.
>>>
>>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance),
>>> with yarn as resource manager.
>>>
>>> My code:
>>>
>>> val sparkContext = SparkContext.getOrCreate()
>>> val content = 1 to 40000
>>> val result = sparkContext.parallelize(content, 5)
>>> result.map(value => value.toString).foreach(loop)
>>>
>>> def loop(x: String): Unit = {
>>>    for (a <- 1 to 30000000) {
>>>
>>>    }
>>> }
>>>
>>> spark-submit:
>>>
>>> spark-submit --executor-cores 1 \
>>>              --executor-memory 1g \
>>>              --driver-memory 1g \
>>>              --master yarn \
>>>              --deploy-mode cluster \
>>>              --conf spark.dynamicAllocation.enabled=true \
>>>              --conf spark.shuffle.service.enabled=true \
>>>              --conf spark.dynamicAllocation.minExecutors=1 \
>>>              --conf spark.dynamicAllocation.maxExecutors=3 \
>>>              --conf spark.dynamicAllocation.initialExecutors=3 \
>>>              --conf spark.executor.instances=3 \
>>>
>>> If i run twice spark-submit it create 6 executors, but i want to run all
>>> this jobs on same spark application.
>>>
>>> How can achieve adding jobs to an existing spark application?
>>>
>>> I don't understand why SparkContext.getOrCreate() don't get existing
>>> spark context.
>>>
>>>
>>> Thanks,
>>>
>>> Cosmin P.
>>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Posted by Jörn Franke <jo...@gmail.com>.

The resource management in yarn cluster mode is yarns task. So it dependents how you configured the queues and the scheduler there.

> On 8 Feb 2017, at 12:10, Cosmin Posteuca <co...@gmail.com> wrote:
> 
> I tried to run some test on EMR on yarn cluster mode.
> 
> I have a cluster with 16 cores(8 processors with 2 threads each). If i run one job(use 5 core) takes 90 seconds, if i run 2 jobs simultaneous, both finished in 170 seconds. If i run 3 jobs simultaneous, all three finished in 240 seconds.
> 
> If i run 6 jobs, i expect to first 3 jobs to finish simultaneous in 240 seconds, and next 3 jobs finish in 480 seconds from cluster start time. But that doesn’t happened. My firs job finished after 120 second, second finished after 180 seconds, third finished after 240 second, the fourth and the fifth finished simultaneous after 360 seconds, and the last finished after 400 seconds.
> 
> I expected to run in a FIFO mode, but that doesn’t happened. Seems to be a combination of FIFO and FAIR.
> 
> Is this the correct behavior of spark?
> 
> Thank you!
> 
> 
> 2017-02-08 9:29 GMT+02:00 Gourav Sengupta <go...@gmail.com>:
>> Hi,
>> 
>> Michael's answer will solve the problem in case you using only SQL based solution.
>> 
>> Otherwise please refer to the wonderful details mentioned here https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0 released  SPARK 2.1.0 is available in AWS.
>> 
>> (note that there is an issue with using zeppelin in it and I have raised it as an issue to AWS and they are looking into it now)
>> 
>> Regards,
>> Gourav Sengupta
>> 
>>> On Tue, Feb 7, 2017 at 10:37 PM, Michael Segel <ms...@hotmail.com> wrote:
>>> Why couldn’t you use the spark thrift server? 
>>> 
>>> 
>>>> On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca <co...@gmail.com> wrote:
>>>> 
>>>> answer for Gourav Sengupta
>>>> 
>>>> I want to use same spark application because i want to work as a FIFO scheduler. My problem is that i have many jobs(not so big) and if i run an application for every job my cluster will split resources as a FAIR scheduler(it's what i observe, maybe i'm wrong) and exist the possibility to create bottleneck effect. The start time isn't a problem for me, because it isn't a real-time application.
>>>> 
>>>> I need a business solution, that's the reason why i can't use code from github.
>>>> 
>>>> Thanks!
>>>> 
>>>> 2017-02-07 19:55 GMT+02:00 Gourav Sengupta <go...@gmail.com>:
>>>>> Hi,
>>>>> 
>>>>> May I ask the reason for using the same spark application? Is it because of the time it takes in order to start a spark context?
>>>>> 
>>>>> On another note you may want to look at the number of contributors in a github repo before choosing a solution.
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Gourav 
>>>>> 
>>>>>> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski <vi...@gmail.com> wrote:
>>>>>> Spark jobserver or Livy server are the best options for pure technical API.
>>>>>> If you want to publish business API you will probably have to build you own app like the one I wrote a year ago https://github.com/elppc/akka-spark-experiments
>>>>>> It combines Akka actors and a shared Spark context to serve concurrent subsecond jobs
>>>>>> 
>>>>>> 
>>>>>> 2017-02-07 15:28 GMT+01:00 ayan guha <gu...@gmail.com>:
>>>>>>> I think you are loking for livy or spark  jobserver
>>>>>>> 
>>>>>>>> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <co...@gmail.com> wrote:
>>>>>>>> I want to run different jobs on demand with same spark context, but i don't know how exactly i can do this.
>>>>>>>> 
>>>>>>>> I try to get current context, but seems it create a new spark context(with new executors).
>>>>>>>> 
>>>>>>>> I call spark-submit to add new jobs.
>>>>>>>> 
>>>>>>>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with yarn as resource manager.
>>>>>>>> 
>>>>>>>> My code:
>>>>>>>> 
>>>>>>>> val sparkContext = SparkContext.getOrCreate()
>>>>>>>> val content = 1 to 40000
>>>>>>>> val result = sparkContext.parallelize(content, 5)
>>>>>>>> result.map(value => value.toString).foreach(loop)
>>>>>>>> 
>>>>>>>> def loop(x: String): Unit = {
>>>>>>>>    for (a <- 1 to 30000000) {
>>>>>>>> 
>>>>>>>>    }
>>>>>>>> }
>>>>>>>> spark-submit:
>>>>>>>> 
>>>>>>>> spark-submit --executor-cores 1 \
>>>>>>>>              --executor-memory 1g \
>>>>>>>>              --driver-memory 1g \
>>>>>>>>              --master yarn \
>>>>>>>>              --deploy-mode cluster \
>>>>>>>>              --conf spark.dynamicAllocation.enabled=true \
>>>>>>>>              --conf spark.shuffle.service.enabled=true \
>>>>>>>>              --conf spark.dynamicAllocation.minExecutors=1 \
>>>>>>>>              --conf spark.dynamicAllocation.maxExecutors=3 \
>>>>>>>>              --conf spark.dynamicAllocation.initialExecutors=3 \
>>>>>>>>              --conf spark.executor.instances=3 \
>>>>>>>> If i run twice spark-submit it create 6 executors, but i want to run all this jobs on same spark application.
>>>>>>>> 
>>>>>>>> How can achieve adding jobs to an existing spark application?
>>>>>>>> 
>>>>>>>> I don't understand why SparkContext.getOrCreate() don't get existing spark context.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Cosmin P.
>>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Posted by Cosmin Posteuca <co...@gmail.com>.

Thank you very much for your answers, Now i understand better what i have
to do!  Thank you!

On Wed, 8 Feb 2017 at 22:37, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
>
> I am not quite sure of your used case here, but I would use spark-submit
> and submit sequential jobs as steps to an EMR cluster.
>
>
> Regards,
> Gourav
>
> On Wed, Feb 8, 2017 at 11:10 AM, Cosmin Posteuca <
> cosmin.posteuca@gmail.com> wrote:
>
> I tried to run some test on EMR on yarn cluster mode.
>
> I have a cluster with 16 cores(8 processors with 2 threads each). If i run
> one job(use 5 core) takes 90 seconds, if i run 2 jobs simultaneous, both
> finished in 170 seconds. If i run 3 jobs simultaneous, all three finished
> in 240 seconds.
>
> If i run 6 jobs, i expect to first 3 jobs to finish simultaneous in 240
> seconds, and next 3 jobs finish in 480 seconds from cluster start time. But
> that doesn’t happened. My firs job finished after 120 second, second
> finished after 180 seconds, third finished after 240 second, the fourth and
> the fifth finished simultaneous after 360 seconds, and the last finished
> after 400 seconds.
>
> I expected to run in a FIFO mode, but that doesn’t happened. Seems to be a
> combination of FIFO and FAIR.
>
> Is this the correct behavior of spark?
>
> Thank you!
>
> 2017-02-08 9:29 GMT+02:00 Gourav Sengupta <go...@gmail.com>:
>
> Hi,
>
> Michael's answer will solve the problem in case you using only SQL based
> solution.
>
> Otherwise please refer to the wonderful details mentioned here
> https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0
> released  SPARK 2.1.0 is available in AWS.
>
> (note that there is an issue with using zeppelin in it and I have raised
> it as an issue to AWS and they are looking into it now)
>
> Regards,
> Gourav Sengupta
>
> On Tue, Feb 7, 2017 at 10:37 PM, Michael Segel <ms...@hotmail.com>
> wrote:
>
>
>
>
>
>
>
>
>
>
>
> Why couldn’t you use the spark thrift server?
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca <co...@gmail.com>
> wrote:
>
>
>
>
>
>
>
> answer for Gourav Sengupta
>
>
>
>
>
> I want to use same spark application because i want to work as a FIFO
> scheduler. My problem is that i have many jobs(not so big) and if i run an
> application for every job my cluster will split resources as a FAIR
> scheduler(it's what i observe, maybe i'm wrong)
>
> and exist the possibility to create bottleneck effect. The start time
> isn't a problem for me, because it isn't a real-time application.
>
>
>
>
>
> I need a business solution, that's the reason why i can't use code from
> github.
>
>
>
>
>
> Thanks!
>
>
>
>
>
> 2017-02-07 19:55 GMT+02:00 Gourav Sengupta
>
> <go...@gmail.com>:
>
>
>
>
>
>
> Hi,
>
>
>
>
>
>
>
> May I ask the reason for using the same spark application? Is it because
> of the time it takes in order to start a spark context?
>
>
>
>
>
>
> On another note you may want to look at the number of contributors in a
> github repo before choosing a solution.
>
>
>
>
>
>
>
>
>
>
>
>
> Regards,
>
>
> Gourav
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski
>
> <vi...@gmail.com> wrote:
>
>
>
>
> Spark jobserver or Livy server are the best options for pure technical API.
>
> If you want to publish business API you will probably have to build you
> own app like the one I wrote a year ago
>
>
>
> https://github.com/elppc/akka-spark-experiments
>
>
> It combines Akka actors and a shared Spark context to serve concurrent
> subsecond jobs
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2017-02-07 15:28 GMT+01:00 ayan guha
>
> <gu...@gmail.com>:
>
>
>
>
> I think you are loking for livy or spark  jobserver
>
>
>
>
>
>
>
>
> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <co...@gmail.com>
> wrote:
>
>
>
>
>
>
>
>
>
>
> I want to run different jobs on demand with same spark context, but i
> don't know how exactly i can do this.
>
>
>
>
> I try to get current context, but seems it create a new spark context(with
> new executors).
>
>
>
>
> I call spark-submit to add new jobs.
>
>
>
>
> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with
> yarn as resource manager.
>
>
>
>
> My code:
>
>
> val sparkContext = SparkContext.getOrCreate()
>
> val content = 1 to 40000
>
> val result = sparkContext.parallelize(content, 5)
>
> result.map(value => value.toString).foreach(loop)
>
>
>
> def loop(x: String): Unit = {
>
>    for (a <- 1 to 30000000) {
>
>
>
>    }
>
> }
>
>
>
>
>
> spark-submit:
>
>
> spark-submit --executor-cores 1 \
>
>              --executor-memory 1g \
>
>              --driver-memory 1g \
>
>              --master yarn \
>
>              --deploy-mode cluster \
>
>              --conf spark.dynamicAllocation.enabled=true \
>
>              --conf spark.shuffle.service.enabled=true \
>
>              --conf spark.dynamicAllocation.minExecutors=1 \
>
>              --conf spark.dynamicAllocation.maxExecutors=3 \
>
>              --conf spark.dynamicAllocation.initialExecutors=3 \
>
>              --conf spark.executor.instances=3 \
>
>
>
>
>
> If i run twice spark-submit it create 6 executors, but i want to run all
> this jobs on same spark application.
>
>
>
>
> How can achieve adding jobs to an existing spark application?
>
>
>
>
> I don't understand why SparkContext.getOrCreate() don't
>
> get existing spark context.
>
>
>
>
>
>
>
>
>
>
>
> Thanks,
>
>
>
>
> Cosmin P.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
>
> Best Regards,
>
>
> Ayan Guha
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

I am not quite sure of your used case here, but I would use spark-submit
and submit sequential jobs as steps to an EMR cluster.


Regards,
Gourav

On Wed, Feb 8, 2017 at 11:10 AM, Cosmin Posteuca <co...@gmail.com>
wrote:

> I tried to run some test on EMR on yarn cluster mode.
>
> I have a cluster with 16 cores(8 processors with 2 threads each). If i run
> one job(use 5 core) takes 90 seconds, if i run 2 jobs simultaneous, both
> finished in 170 seconds. If i run 3 jobs simultaneous, all three finished
> in 240 seconds.
>
> If i run 6 jobs, i expect to first 3 jobs to finish simultaneous in 240
> seconds, and next 3 jobs finish in 480 seconds from cluster start time. But
> that doesn’t happened. My firs job finished after 120 second, second
> finished after 180 seconds, third finished after 240 second, the fourth and
> the fifth finished simultaneous after 360 seconds, and the last finished
> after 400 seconds.
>
> I expected to run in a FIFO mode, but that doesn’t happened. Seems to be a
> combination of FIFO and FAIR.
>
> Is this the correct behavior of spark?
>
> Thank you!
>
> 2017-02-08 9:29 GMT+02:00 Gourav Sengupta <go...@gmail.com>:
>
>> Hi,
>>
>> Michael's answer will solve the problem in case you using only SQL based
>> solution.
>>
>> Otherwise please refer to the wonderful details mentioned here
>> https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0
>> released  SPARK 2.1.0 is available in AWS.
>>
>> (note that there is an issue with using zeppelin in it and I have raised
>> it as an issue to AWS and they are looking into it now)
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Feb 7, 2017 at 10:37 PM, Michael Segel <msegel_hadoop@hotmail.com
>> > wrote:
>>
>>> Why couldn’t you use the spark thrift server?
>>>
>>>
>>> On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca <co...@gmail.com>
>>> wrote:
>>>
>>> answer for Gourav Sengupta
>>>
>>> I want to use same spark application because i want to work as a FIFO
>>> scheduler. My problem is that i have many jobs(not so big) and if i run an
>>> application for every job my cluster will split resources as a FAIR
>>> scheduler(it's what i observe, maybe i'm wrong) and exist the possibility
>>> to create bottleneck effect. The start time isn't a problem for me, because
>>> it isn't a real-time application.
>>>
>>> I need a business solution, that's the reason why i can't use code from
>>> github.
>>>
>>> Thanks!
>>>
>>> 2017-02-07 19:55 GMT+02:00 Gourav Sengupta <go...@gmail.com>:
>>>
>>>> Hi,
>>>>
>>>> May I ask the reason for using the same spark application? Is it
>>>> because of the time it takes in order to start a spark context?
>>>>
>>>> On another note you may want to look at the number of contributors in a
>>>> github repo before choosing a solution.
>>>>
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski <
>>>> vincent.gromakowski@gmail.com> wrote:
>>>>
>>>>> Spark jobserver or Livy server are the best options for pure technical
>>>>> API.
>>>>> If you want to publish business API you will probably have to build
>>>>> you own app like the one I wrote a year ago
>>>>> https://github.com/elppc/akka-spark-experiments
>>>>> It combines Akka actors and a shared Spark context to serve concurrent
>>>>> subsecond jobs
>>>>>
>>>>>
>>>>> 2017-02-07 15:28 GMT+01:00 ayan guha <gu...@gmail.com>:
>>>>>
>>>>>> I think you are loking for livy or spark  jobserver
>>>>>>
>>>>>> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <
>>>>>> cosmin.posteuca@gmail.com> wrote:
>>>>>>
>>>>>>> I want to run different jobs on demand with same spark context, but
>>>>>>> i don't know how exactly i can do this.
>>>>>>>
>>>>>>> I try to get current context, but seems it create a new spark
>>>>>>> context(with new executors).
>>>>>>>
>>>>>>> I call spark-submit to add new jobs.
>>>>>>>
>>>>>>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance),
>>>>>>> with yarn as resource manager.
>>>>>>>
>>>>>>> My code:
>>>>>>>
>>>>>>> val sparkContext = SparkContext.getOrCreate()
>>>>>>> val content = 1 to 40000
>>>>>>> val result = sparkContext.parallelize(content, 5)
>>>>>>> result.map(value => value.toString).foreach(loop)
>>>>>>>
>>>>>>> def loop(x: String): Unit = {
>>>>>>>    for (a <- 1 to 30000000) {
>>>>>>>
>>>>>>>    }
>>>>>>> }
>>>>>>>
>>>>>>> spark-submit:
>>>>>>>
>>>>>>> spark-submit --executor-cores 1 \
>>>>>>>              --executor-memory 1g \
>>>>>>>              --driver-memory 1g \
>>>>>>>              --master yarn \
>>>>>>>              --deploy-mode cluster \
>>>>>>>              --conf spark.dynamicAllocation.enabled=true \
>>>>>>>              --conf spark.shuffle.service.enabled=true \
>>>>>>>              --conf spark.dynamicAllocation.minExecutors=1 \
>>>>>>>              --conf spark.dynamicAllocation.maxExecutors=3 \
>>>>>>>              --conf spark.dynamicAllocation.initialExecutors=3 \
>>>>>>>              --conf spark.executor.instances=3 \
>>>>>>>
>>>>>>> If i run twice spark-submit it create 6 executors, but i want to run
>>>>>>> all this jobs on same spark application.
>>>>>>>
>>>>>>> How can achieve adding jobs to an existing spark application?
>>>>>>>
>>>>>>> I don't understand why SparkContext.getOrCreate() don't get
>>>>>>> existing spark context.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Cosmin P.
>>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Posted by Cosmin Posteuca <co...@gmail.com>.

I tried to run some test on EMR on yarn cluster mode.

I have a cluster with 16 cores(8 processors with 2 threads each). If i run
one job(use 5 core) takes 90 seconds, if i run 2 jobs simultaneous, both
finished in 170 seconds. If i run 3 jobs simultaneous, all three finished
in 240 seconds.

If i run 6 jobs, i expect to first 3 jobs to finish simultaneous in 240
seconds, and next 3 jobs finish in 480 seconds from cluster start time. But
that doesn’t happened. My firs job finished after 120 second, second
finished after 180 seconds, third finished after 240 second, the fourth and
the fifth finished simultaneous after 360 seconds, and the last finished
after 400 seconds.

I expected to run in a FIFO mode, but that doesn’t happened. Seems to be a
combination of FIFO and FAIR.

Is this the correct behavior of spark?

Thank you!

2017-02-08 9:29 GMT+02:00 Gourav Sengupta <go...@gmail.com>:

> Hi,
>
> Michael's answer will solve the problem in case you using only SQL based
> solution.
>
> Otherwise please refer to the wonderful details mentioned here
> https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0
> released  SPARK 2.1.0 is available in AWS.
>
> (note that there is an issue with using zeppelin in it and I have raised
> it as an issue to AWS and they are looking into it now)
>
> Regards,
> Gourav Sengupta
>
> On Tue, Feb 7, 2017 at 10:37 PM, Michael Segel <ms...@hotmail.com>
> wrote:
>
>> Why couldn’t you use the spark thrift server?
>>
>>
>> On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca <co...@gmail.com>
>> wrote:
>>
>> answer for Gourav Sengupta
>>
>> I want to use same spark application because i want to work as a FIFO
>> scheduler. My problem is that i have many jobs(not so big) and if i run an
>> application for every job my cluster will split resources as a FAIR
>> scheduler(it's what i observe, maybe i'm wrong) and exist the possibility
>> to create bottleneck effect. The start time isn't a problem for me, because
>> it isn't a real-time application.
>>
>> I need a business solution, that's the reason why i can't use code from
>> github.
>>
>> Thanks!
>>
>> 2017-02-07 19:55 GMT+02:00 Gourav Sengupta <go...@gmail.com>:
>>
>>> Hi,
>>>
>>> May I ask the reason for using the same spark application? Is it because
>>> of the time it takes in order to start a spark context?
>>>
>>> On another note you may want to look at the number of contributors in a
>>> github repo before choosing a solution.
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski <
>>> vincent.gromakowski@gmail.com> wrote:
>>>
>>>> Spark jobserver or Livy server are the best options for pure technical
>>>> API.
>>>> If you want to publish business API you will probably have to build you
>>>> own app like the one I wrote a year ago https://github.com/elppc/akka-
>>>> spark-experiments
>>>> It combines Akka actors and a shared Spark context to serve concurrent
>>>> subsecond jobs
>>>>
>>>>
>>>> 2017-02-07 15:28 GMT+01:00 ayan guha <gu...@gmail.com>:
>>>>
>>>>> I think you are loking for livy or spark  jobserver
>>>>>
>>>>> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <
>>>>> cosmin.posteuca@gmail.com> wrote:
>>>>>
>>>>>> I want to run different jobs on demand with same spark context, but i
>>>>>> don't know how exactly i can do this.
>>>>>>
>>>>>> I try to get current context, but seems it create a new spark
>>>>>> context(with new executors).
>>>>>>
>>>>>> I call spark-submit to add new jobs.
>>>>>>
>>>>>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance),
>>>>>> with yarn as resource manager.
>>>>>>
>>>>>> My code:
>>>>>>
>>>>>> val sparkContext = SparkContext.getOrCreate()
>>>>>> val content = 1 to 40000
>>>>>> val result = sparkContext.parallelize(content, 5)
>>>>>> result.map(value => value.toString).foreach(loop)
>>>>>>
>>>>>> def loop(x: String): Unit = {
>>>>>>    for (a <- 1 to 30000000) {
>>>>>>
>>>>>>    }
>>>>>> }
>>>>>>
>>>>>> spark-submit:
>>>>>>
>>>>>> spark-submit --executor-cores 1 \
>>>>>>              --executor-memory 1g \
>>>>>>              --driver-memory 1g \
>>>>>>              --master yarn \
>>>>>>              --deploy-mode cluster \
>>>>>>              --conf spark.dynamicAllocation.enabled=true \
>>>>>>              --conf spark.shuffle.service.enabled=true \
>>>>>>              --conf spark.dynamicAllocation.minExecutors=1 \
>>>>>>              --conf spark.dynamicAllocation.maxExecutors=3 \
>>>>>>              --conf spark.dynamicAllocation.initialExecutors=3 \
>>>>>>              --conf spark.executor.instances=3 \
>>>>>>
>>>>>> If i run twice spark-submit it create 6 executors, but i want to run
>>>>>> all this jobs on same spark application.
>>>>>>
>>>>>> How can achieve adding jobs to an existing spark application?
>>>>>>
>>>>>> I don't understand why SparkContext.getOrCreate() don't get existing
>>>>>> spark context.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Cosmin P.
>>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

Michael's answer will solve the problem in case you using only SQL based
solution.

Otherwise please refer to the wonderful details mentioned here
https://spark.apache.org/docs/latest/job-scheduling.html. With EMR 5.3.0
released  SPARK 2.1.0 is available in AWS.

(note that there is an issue with using zeppelin in it and I have raised it
as an issue to AWS and they are looking into it now)

Regards,
Gourav Sengupta

On Tue, Feb 7, 2017 at 10:37 PM, Michael Segel <ms...@hotmail.com>
wrote:

> Why couldn’t you use the spark thrift server?
>
>
> On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca <co...@gmail.com>
> wrote:
>
> answer for Gourav Sengupta
>
> I want to use same spark application because i want to work as a FIFO
> scheduler. My problem is that i have many jobs(not so big) and if i run an
> application for every job my cluster will split resources as a FAIR
> scheduler(it's what i observe, maybe i'm wrong) and exist the possibility
> to create bottleneck effect. The start time isn't a problem for me, because
> it isn't a real-time application.
>
> I need a business solution, that's the reason why i can't use code from
> github.
>
> Thanks!
>
> 2017-02-07 19:55 GMT+02:00 Gourav Sengupta <go...@gmail.com>:
>
>> Hi,
>>
>> May I ask the reason for using the same spark application? Is it because
>> of the time it takes in order to start a spark context?
>>
>> On another note you may want to look at the number of contributors in a
>> github repo before choosing a solution.
>>
>>
>> Regards,
>> Gourav
>>
>> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski <
>> vincent.gromakowski@gmail.com> wrote:
>>
>>> Spark jobserver or Livy server are the best options for pure technical
>>> API.
>>> If you want to publish business API you will probably have to build you
>>> own app like the one I wrote a year ago https://github.com/elppc/akka-
>>> spark-experiments
>>> It combines Akka actors and a shared Spark context to serve concurrent
>>> subsecond jobs
>>>
>>>
>>> 2017-02-07 15:28 GMT+01:00 ayan guha <gu...@gmail.com>:
>>>
>>>> I think you are loking for livy or spark  jobserver
>>>>
>>>> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <
>>>> cosmin.posteuca@gmail.com> wrote:
>>>>
>>>>> I want to run different jobs on demand with same spark context, but i
>>>>> don't know how exactly i can do this.
>>>>>
>>>>> I try to get current context, but seems it create a new spark
>>>>> context(with new executors).
>>>>>
>>>>> I call spark-submit to add new jobs.
>>>>>
>>>>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance),
>>>>> with yarn as resource manager.
>>>>>
>>>>> My code:
>>>>>
>>>>> val sparkContext = SparkContext.getOrCreate()
>>>>> val content = 1 to 40000
>>>>> val result = sparkContext.parallelize(content, 5)
>>>>> result.map(value => value.toString).foreach(loop)
>>>>>
>>>>> def loop(x: String): Unit = {
>>>>>    for (a <- 1 to 30000000) {
>>>>>
>>>>>    }
>>>>> }
>>>>>
>>>>> spark-submit:
>>>>>
>>>>> spark-submit --executor-cores 1 \
>>>>>              --executor-memory 1g \
>>>>>              --driver-memory 1g \
>>>>>              --master yarn \
>>>>>              --deploy-mode cluster \
>>>>>              --conf spark.dynamicAllocation.enabled=true \
>>>>>              --conf spark.shuffle.service.enabled=true \
>>>>>              --conf spark.dynamicAllocation.minExecutors=1 \
>>>>>              --conf spark.dynamicAllocation.maxExecutors=3 \
>>>>>              --conf spark.dynamicAllocation.initialExecutors=3 \
>>>>>              --conf spark.executor.instances=3 \
>>>>>
>>>>> If i run twice spark-submit it create 6 executors, but i want to run
>>>>> all this jobs on same spark application.
>>>>>
>>>>> How can achieve adding jobs to an existing spark application?
>>>>>
>>>>> I don't understand why SparkContext.getOrCreate() don't get existing
>>>>> spark context.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Cosmin P.
>>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Posted by Michael Segel <ms...@hotmail.com>.

Why couldn’t you use the spark thrift server?


On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca <co...@gmail.com>> wrote:

answer for Gourav Sengupta

I want to use same spark application because i want to work as a FIFO scheduler. My problem is that i have many jobs(not so big) and if i run an application for every job my cluster will split resources as a FAIR scheduler(it's what i observe, maybe i'm wrong) and exist the possibility to create bottleneck effect. The start time isn't a problem for me, because it isn't a real-time application.

I need a business solution, that's the reason why i can't use code from github.

Thanks!

2017-02-07 19:55 GMT+02:00 Gourav Sengupta <go...@gmail.com>>:
Hi,

May I ask the reason for using the same spark application? Is it because of the time it takes in order to start a spark context?

On another note you may want to look at the number of contributors in a github repo before choosing a solution.


Regards,
Gourav

On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski <vi...@gmail.com>> wrote:
Spark jobserver or Livy server are the best options for pure technical API.
If you want to publish business API you will probably have to build you own app like the one I wrote a year ago https://github.com/elppc/akka-spark-experiments
It combines Akka actors and a shared Spark context to serve concurrent subsecond jobs


2017-02-07 15:28 GMT+01:00 ayan guha <gu...@gmail.com>>:
I think you are loking for livy or spark  jobserver

On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <co...@gmail.com>> wrote:

I want to run different jobs on demand with same spark context, but i don't know how exactly i can do this.

I try to get current context, but seems it create a new spark context(with new executors).

I call spark-submit to add new jobs.

I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with yarn as resource manager.

My code:

val sparkContext = SparkContext.getOrCreate()
val content = 1 to 40000
val result = sparkContext.parallelize(content, 5)
result.map(value => value.toString).foreach(loop)

def loop(x: String): Unit = {
   for (a <- 1 to 30000000) {

   }
}


spark-submit:

spark-submit --executor-cores 1 \
             --executor-memory 1g \
             --driver-memory 1g \
             --master yarn \
             --deploy-mode cluster \
             --conf spark.dynamicAllocation.enabled=true \
             --conf spark.shuffle.service.enabled=true \
             --conf spark.dynamicAllocation.minExecutors=1 \
             --conf spark.dynamicAllocation.maxExecutors=3 \
             --conf spark.dynamicAllocation.initialExecutors=3 \
             --conf spark.executor.instances=3 \


If i run twice spark-submit it create 6 executors, but i want to run all this jobs on same spark application.

How can achieve adding jobs to an existing spark application?

I don't understand why SparkContext.getOrCreate() don't get existing spark context.


Thanks,

Cosmin P.

--
Best Regards,
Ayan Guha

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Posted by Cosmin Posteuca <co...@gmail.com>.

answer for Gourav Sengupta

I want to use same spark application because i want to work as a FIFO
scheduler. My problem is that i have many jobs(not so big) and if i run an
application for every job my cluster will split resources as a FAIR
scheduler(it's what i observe, maybe i'm wrong) and exist the possibility
to create bottleneck effect. The start time isn't a problem for me, because
it isn't a real-time application.

I need a business solution, that's the reason why i can't use code from
github.

Thanks!

2017-02-07 19:55 GMT+02:00 Gourav Sengupta <go...@gmail.com>:

> Hi,
>
> May I ask the reason for using the same spark application? Is it because
> of the time it takes in order to start a spark context?
>
> On another note you may want to look at the number of contributors in a
> github repo before choosing a solution.
>
>
> Regards,
> Gourav
>
> On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski <
> vincent.gromakowski@gmail.com> wrote:
>
>> Spark jobserver or Livy server are the best options for pure technical
>> API.
>> If you want to publish business API you will probably have to build you
>> own app like the one I wrote a year ago https://github.com/elppc/akka-
>> spark-experiments
>> It combines Akka actors and a shared Spark context to serve concurrent
>> subsecond jobs
>>
>>
>> 2017-02-07 15:28 GMT+01:00 ayan guha <gu...@gmail.com>:
>>
>>> I think you are loking for livy or spark  jobserver
>>>
>>> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <
>>> cosmin.posteuca@gmail.com> wrote:
>>>
>>>> I want to run different jobs on demand with same spark context, but i
>>>> don't know how exactly i can do this.
>>>>
>>>> I try to get current context, but seems it create a new spark
>>>> context(with new executors).
>>>>
>>>> I call spark-submit to add new jobs.
>>>>
>>>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance),
>>>> with yarn as resource manager.
>>>>
>>>> My code:
>>>>
>>>> val sparkContext = SparkContext.getOrCreate()
>>>> val content = 1 to 40000
>>>> val result = sparkContext.parallelize(content, 5)
>>>> result.map(value => value.toString).foreach(loop)
>>>>
>>>> def loop(x: String): Unit = {
>>>>    for (a <- 1 to 30000000) {
>>>>
>>>>    }
>>>> }
>>>>
>>>> spark-submit:
>>>>
>>>> spark-submit --executor-cores 1 \
>>>>              --executor-memory 1g \
>>>>              --driver-memory 1g \
>>>>              --master yarn \
>>>>              --deploy-mode cluster \
>>>>              --conf spark.dynamicAllocation.enabled=true \
>>>>              --conf spark.shuffle.service.enabled=true \
>>>>              --conf spark.dynamicAllocation.minExecutors=1 \
>>>>              --conf spark.dynamicAllocation.maxExecutors=3 \
>>>>              --conf spark.dynamicAllocation.initialExecutors=3 \
>>>>              --conf spark.executor.instances=3 \
>>>>
>>>> If i run twice spark-submit it create 6 executors, but i want to run
>>>> all this jobs on same spark application.
>>>>
>>>> How can achieve adding jobs to an existing spark application?
>>>>
>>>> I don't understand why SparkContext.getOrCreate() don't get existing
>>>> spark context.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Cosmin P.
>>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

May I ask the reason for using the same spark application? Is it because of
the time it takes in order to start a spark context?

On another note you may want to look at the number of contributors in a
github repo before choosing a solution.


Regards,
Gourav

On Tue, Feb 7, 2017 at 5:26 PM, vincent gromakowski <
vincent.gromakowski@gmail.com> wrote:

> Spark jobserver or Livy server are the best options for pure technical API.
> If you want to publish business API you will probably have to build you
> own app like the one I wrote a year ago https://github.com/elppc/akka-
> spark-experiments
> It combines Akka actors and a shared Spark context to serve concurrent
> subsecond jobs
>
>
> 2017-02-07 15:28 GMT+01:00 ayan guha <gu...@gmail.com>:
>
>> I think you are loking for livy or spark  jobserver
>>
>> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <
>> cosmin.posteuca@gmail.com> wrote:
>>
>>> I want to run different jobs on demand with same spark context, but i
>>> don't know how exactly i can do this.
>>>
>>> I try to get current context, but seems it create a new spark
>>> context(with new executors).
>>>
>>> I call spark-submit to add new jobs.
>>>
>>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance),
>>> with yarn as resource manager.
>>>
>>> My code:
>>>
>>> val sparkContext = SparkContext.getOrCreate()
>>> val content = 1 to 40000
>>> val result = sparkContext.parallelize(content, 5)
>>> result.map(value => value.toString).foreach(loop)
>>>
>>> def loop(x: String): Unit = {
>>>    for (a <- 1 to 30000000) {
>>>
>>>    }
>>> }
>>>
>>> spark-submit:
>>>
>>> spark-submit --executor-cores 1 \
>>>              --executor-memory 1g \
>>>              --driver-memory 1g \
>>>              --master yarn \
>>>              --deploy-mode cluster \
>>>              --conf spark.dynamicAllocation.enabled=true \
>>>              --conf spark.shuffle.service.enabled=true \
>>>              --conf spark.dynamicAllocation.minExecutors=1 \
>>>              --conf spark.dynamicAllocation.maxExecutors=3 \
>>>              --conf spark.dynamicAllocation.initialExecutors=3 \
>>>              --conf spark.executor.instances=3 \
>>>
>>> If i run twice spark-submit it create 6 executors, but i want to run all
>>> this jobs on same spark application.
>>>
>>> How can achieve adding jobs to an existing spark application?
>>>
>>> I don't understand why SparkContext.getOrCreate() don't get existing
>>> spark context.
>>>
>>>
>>> Thanks,
>>>
>>> Cosmin P.
>>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Posted by vincent gromakowski <vi...@gmail.com>.

Spark jobserver or Livy server are the best options for pure technical API.
If you want to publish business API you will probably have to build you own
app like the one I wrote a year ago
https://github.com/elppc/akka-spark-experiments
It combines Akka actors and a shared Spark context to serve concurrent
subsecond jobs


2017-02-07 15:28 GMT+01:00 ayan guha <gu...@gmail.com>:

> I think you are loking for livy or spark  jobserver
>
> On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <co...@gmail.com>
> wrote:
>
>> I want to run different jobs on demand with same spark context, but i
>> don't know how exactly i can do this.
>>
>> I try to get current context, but seems it create a new spark
>> context(with new executors).
>>
>> I call spark-submit to add new jobs.
>>
>> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with
>> yarn as resource manager.
>>
>> My code:
>>
>> val sparkContext = SparkContext.getOrCreate()
>> val content = 1 to 40000
>> val result = sparkContext.parallelize(content, 5)
>> result.map(value => value.toString).foreach(loop)
>>
>> def loop(x: String): Unit = {
>>    for (a <- 1 to 30000000) {
>>
>>    }
>> }
>>
>> spark-submit:
>>
>> spark-submit --executor-cores 1 \
>>              --executor-memory 1g \
>>              --driver-memory 1g \
>>              --master yarn \
>>              --deploy-mode cluster \
>>              --conf spark.dynamicAllocation.enabled=true \
>>              --conf spark.shuffle.service.enabled=true \
>>              --conf spark.dynamicAllocation.minExecutors=1 \
>>              --conf spark.dynamicAllocation.maxExecutors=3 \
>>              --conf spark.dynamicAllocation.initialExecutors=3 \
>>              --conf spark.executor.instances=3 \
>>
>> If i run twice spark-submit it create 6 executors, but i want to run all
>> this jobs on same spark application.
>>
>> How can achieve adding jobs to an existing spark application?
>>
>> I don't understand why SparkContext.getOrCreate() don't get existing
>> spark context.
>>
>>
>> Thanks,
>>
>> Cosmin P.
>>
> --
> Best Regards,
> Ayan Guha
>

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

Posted by ayan guha <gu...@gmail.com>.

I think you are loking for livy or spark  jobserver
On Wed, 8 Feb 2017 at 12:37 am, Cosmin Posteuca <co...@gmail.com>
wrote:

> I want to run different jobs on demand with same spark context, but i
> don't know how exactly i can do this.
>
> I try to get current context, but seems it create a new spark context(with
> new executors).
>
> I call spark-submit to add new jobs.
>
> I run code on Amazon EMR(3 instances, 4 core & 16GB ram / instance), with
> yarn as resource manager.
>
> My code:
>
> val sparkContext = SparkContext.getOrCreate()
> val content = 1 to 40000
> val result = sparkContext.parallelize(content, 5)
> result.map(value => value.toString).foreach(loop)
>
> def loop(x: String): Unit = {
>    for (a <- 1 to 30000000) {
>
>    }
> }
>
> spark-submit:
>
> spark-submit --executor-cores 1 \
>              --executor-memory 1g \
>              --driver-memory 1g \
>              --master yarn \
>              --deploy-mode cluster \
>              --conf spark.dynamicAllocation.enabled=true \
>              --conf spark.shuffle.service.enabled=true \
>              --conf spark.dynamicAllocation.minExecutors=1 \
>              --conf spark.dynamicAllocation.maxExecutors=3 \
>              --conf spark.dynamicAllocation.initialExecutors=3 \
>              --conf spark.executor.instances=3 \
>
> If i run twice spark-submit it create 6 executors, but i want to run all
> this jobs on same spark application.
>
> How can achieve adding jobs to an existing spark application?
>
> I don't understand why SparkContext.getOrCreate() don't get existing
> spark context.
>
>
> Thanks,
>
> Cosmin P.
>
-- 
Best Regards,
Ayan Guha