You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Philip Weaver <ph...@gmail.com> on 2015/10/03 06:02:50 UTC

Re: Limiting number of cores per job in multi-threaded driver.

You can't really say 8 cores is not much horsepower when you have no idea
what my use case is. That's silly.

On Fri, Sep 18, 2015 at 10:33 PM, Adrian Tanase <at...@adobe.com> wrote:

> Forgot to mention that you could also restrict the parallelism to 4,
> essentially using only 4 cores at any given time, however if your job is
> complex, a stage might be broken into more than 1 task...
>
> Sent from my iPhone
>
> On 19 Sep 2015, at 08:30, Adrian Tanase <at...@adobe.com> wrote:
>
> Reading through the docs it seems that with a combination of FAIR
> scheduler and maybe pools you can get pretty far.
>
> However the smallest unit of scheduled work is the task so probably you
> need to think about the parallelism of each transformation.
>
> I'm guessing that by increasing the level of parallelism you get many
> smaller tasks that the scheduler can then run across the many jobs you
> might have - as opposed to fewer, longer tasks...
>
> Lastly, 8 cores is not that much horsepower :)
> You may consider running with beefier machines or a larger cluster, to get
> at least tens of cores.
>
> Hope this helps,
> -adrian
>
> Sent from my iPhone
>
> On 18 Sep 2015, at 18:37, Philip Weaver <ph...@gmail.com> wrote:
>
> Here's a specific example of what I want to do. My Spark application is
> running with total-executor-cores=8. A request comes in, it spawns a thread
> to handle that request, and starts a job. That job should use only 4 cores,
> not all 8 of the cores available to the cluster.. When the first job is
> scheduled, it should take only 4 cores, not all 8 of the cores that are
> available to the driver.
>
> Is there any way to accomplish this? This is on mesos.
>
> In order to support the use cases described in
> https://spark.apache.org/docs/latest/job-scheduling.html, where a spark
> application runs for a long time and handles requests from multiple users,
> I believe what I'm asking about is a very important feature. One of the
> goals is to get lower latency for each request, but if the first request
> takes all resources and we can't guarantee any free resources for the
> second request, then that defeats the purpose. Does that make sense?
>
> Thanks in advance for any advice you can provide!
>
> - Philip
>
> On Sat, Sep 12, 2015 at 10:40 PM, Philip Weaver <ph...@gmail.com>
> wrote:
>
>> I'm playing around with dynamic allocation in spark-1.5.0, with the FAIR
>> scheduler, so I can define a long-running application capable of executing
>> multiple simultaneous spark jobs.
>>
>> The kind of jobs that I'm running do not benefit from more than 4 cores,
>> but I want my application to be able to take several times that in order to
>> run multiple jobs at the same time.
>>
>> I suppose my question is more basic: How can I limit the number of cores
>> used to load an RDD or DataFrame? I can immediately repartition or coalesce
>> my RDD or DataFrame to 4 partitions after I load it, but that doesn't stop
>> Spark from using more cores to load it.
>>
>> Does it make sense what I am trying to accomplish, and is there any way
>> to do it?
>>
>> - Philip
>>
>>
>

Re: Limiting number of cores per job in multi-threaded driver.

Posted by Philip Weaver <ph...@gmail.com>.

Since I'm running Spark on Mesos, to be fair I should give Mesos credit,
too! And I should also put some effort into describing what I'm trying to
accomplish of more clearly. There are really three levels of scheduling
that I'm hoping to exploit:

- Scheduling in Mesos across all frameworks, where the particular type of
Spark job that I previously described is only one of many types of
frameworks.
- Scheduling of multiple jobs (maybe that's not the right terminology?)
within the same SparkContext; this SparkContext would run in a persistent
application with an API for users to submit jobs.
- Scheduling of individual tasks with a single user submitted Spark job.

Through some brief testing, I've found that the performance of my jobs
scales almost linearly until about 8 cores, and after that the gains are
very small or sometimes even negative. These are small jobs, that typically
count uniques across only about 40M rows.

This setup works very well, with one exception: when a user submits a job,
if there are no others running, then that job will take all of the cores
that the SparkContext has available to it. This is undesirable for two
reasons:
1.) As I mentioned above, the jobs don't scale beyond about 8 cores.
2.) The next submitted job will have to wait for resources to become
available.

- Philip


On Sun, Oct 4, 2015 at 2:33 PM, Philip Weaver <ph...@gmail.com>
wrote:

> I believe I've described my use case clearly, and I'm being questioned
> that it's legitimate. I will assert again that if you don't understand my
> use case, it really doesn't make sense to make any statement about how many
> resources I should need.
>
> And I'm sorry, but I completely disagree with your logic. Your suggestion
> is not simpler. The development effort that Spark saves is what you would
> have to do to parallelize an algorithm from single-threaded to 4 cores. So
> the big win comes from getting to 4 cores, not taking it from 4 to 128
> (though that also is nice). Everything that Spark does I could do myself,
> but it would take much longer. Keep in mind I'm not just trying to reduce
> the level of effort for scheduling jobs, but also scheduling the tasks
> within each job, and those are both something that Spark does really well.
>
> - Philip
>
>
> On Sun, Oct 4, 2015 at 10:57 AM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Philip, the guy is trying to help you. Calling him silly is a bit too
>> far. He might assume your problem is IO bound which might not be the case.
>> If you need only 4 cores per job no matter what there is little advantage
>> to use spark in my opinion because you can easily do this with just a
>> worker farm that take the job and process it in a single machine. let the
>> scheduler figures out which node in the farm is idled and spawns jobs on
>> those until all of them are saturated. Call me silly but this seems much
>> simpler.
>>
>> Sent from my iPhone
>>
>> On 3 Oct, 2015, at 12:02 am, Philip Weaver <ph...@gmail.com>
>> wrote:
>>
>> You can't really say 8 cores is not much horsepower when you have no idea
>> what my use case is. That's silly.
>>
>> On Fri, Sep 18, 2015 at 10:33 PM, Adrian Tanase <at...@adobe.com>
>> wrote:
>>
>>> Forgot to mention that you could also restrict the parallelism to 4,
>>> essentially using only 4 cores at any given time, however if your job is
>>> complex, a stage might be broken into more than 1 task...
>>>
>>> Sent from my iPhone
>>>
>>> On 19 Sep 2015, at 08:30, Adrian Tanase <at...@adobe.com> wrote:
>>>
>>> Reading through the docs it seems that with a combination of FAIR
>>> scheduler and maybe pools you can get pretty far.
>>>
>>> However the smallest unit of scheduled work is the task so probably you
>>> need to think about the parallelism of each transformation.
>>>
>>> I'm guessing that by increasing the level of parallelism you get many
>>> smaller tasks that the scheduler can then run across the many jobs you
>>> might have - as opposed to fewer, longer tasks...
>>>
>>> Lastly, 8 cores is not that much horsepower :)
>>> You may consider running with beefier machines or a larger cluster, to
>>> get at least tens of cores.
>>>
>>> Hope this helps,
>>> -adrian
>>>
>>> Sent from my iPhone
>>>
>>> On 18 Sep 2015, at 18:37, Philip Weaver <ph...@gmail.com> wrote:
>>>
>>> Here's a specific example of what I want to do. My Spark application is
>>> running with total-executor-cores=8. A request comes in, it spawns a thread
>>> to handle that request, and starts a job. That job should use only 4 cores,
>>> not all 8 of the cores available to the cluster.. When the first job is
>>> scheduled, it should take only 4 cores, not all 8 of the cores that are
>>> available to the driver.
>>>
>>> Is there any way to accomplish this? This is on mesos.
>>>
>>> In order to support the use cases described in
>>> https://spark.apache.org/docs/latest/job-scheduling.html, where a spark
>>> application runs for a long time and handles requests from multiple users,
>>> I believe what I'm asking about is a very important feature. One of the
>>> goals is to get lower latency for each request, but if the first request
>>> takes all resources and we can't guarantee any free resources for the
>>> second request, then that defeats the purpose. Does that make sense?
>>>
>>> Thanks in advance for any advice you can provide!
>>>
>>> - Philip
>>>
>>> On Sat, Sep 12, 2015 at 10:40 PM, Philip Weaver <philip.weaver@gmail.com
>>> > wrote:
>>>
>>>> I'm playing around with dynamic allocation in spark-1.5.0, with the
>>>> FAIR scheduler, so I can define a long-running application capable of
>>>> executing multiple simultaneous spark jobs.
>>>>
>>>> The kind of jobs that I'm running do not benefit from more than 4
>>>> cores, but I want my application to be able to take several times that in
>>>> order to run multiple jobs at the same time.
>>>>
>>>> I suppose my question is more basic: How can I limit the number of
>>>> cores used to load an RDD or DataFrame? I can immediately repartition or
>>>> coalesce my RDD or DataFrame to 4 partitions after I load it, but that
>>>> doesn't stop Spark from using more cores to load it.
>>>>
>>>> Does it make sense what I am trying to accomplish, and is there any way
>>>> to do it?
>>>>
>>>> - Philip
>>>>
>>>>
>>>
>>
>

Re: Limiting number of cores per job in multi-threaded driver.

Posted by Philip Weaver <ph...@gmail.com>.

I believe I've described my use case clearly, and I'm being questioned that
it's legitimate. I will assert again that if you don't understand my use
case, it really doesn't make sense to make any statement about how many
resources I should need.

And I'm sorry, but I completely disagree with your logic. Your suggestion
is not simpler. The development effort that Spark saves is what you would
have to do to parallelize an algorithm from single-threaded to 4 cores. So
the big win comes from getting to 4 cores, not taking it from 4 to 128
(though that also is nice). Everything that Spark does I could do myself,
but it would take much longer. Keep in mind I'm not just trying to reduce
the level of effort for scheduling jobs, but also scheduling the tasks
within each job, and those are both something that Spark does really well.

- Philip


On Sun, Oct 4, 2015 at 10:57 AM, Jerry Lam <ch...@gmail.com> wrote:

> Philip, the guy is trying to help you. Calling him silly is a bit too far.
> He might assume your problem is IO bound which might not be the case. If
> you need only 4 cores per job no matter what there is little advantage to
> use spark in my opinion because you can easily do this with just a worker
> farm that take the job and process it in a single machine. let the
> scheduler figures out which node in the farm is idled and spawns jobs on
> those until all of them are saturated. Call me silly but this seems much
> simpler.
>
> Sent from my iPhone
>
> On 3 Oct, 2015, at 12:02 am, Philip Weaver <ph...@gmail.com>
> wrote:
>
> You can't really say 8 cores is not much horsepower when you have no idea
> what my use case is. That's silly.
>
> On Fri, Sep 18, 2015 at 10:33 PM, Adrian Tanase <at...@adobe.com> wrote:
>
>> Forgot to mention that you could also restrict the parallelism to 4,
>> essentially using only 4 cores at any given time, however if your job is
>> complex, a stage might be broken into more than 1 task...
>>
>> Sent from my iPhone
>>
>> On 19 Sep 2015, at 08:30, Adrian Tanase <at...@adobe.com> wrote:
>>
>> Reading through the docs it seems that with a combination of FAIR
>> scheduler and maybe pools you can get pretty far.
>>
>> However the smallest unit of scheduled work is the task so probably you
>> need to think about the parallelism of each transformation.
>>
>> I'm guessing that by increasing the level of parallelism you get many
>> smaller tasks that the scheduler can then run across the many jobs you
>> might have - as opposed to fewer, longer tasks...
>>
>> Lastly, 8 cores is not that much horsepower :)
>> You may consider running with beefier machines or a larger cluster, to
>> get at least tens of cores.
>>
>> Hope this helps,
>> -adrian
>>
>> Sent from my iPhone
>>
>> On 18 Sep 2015, at 18:37, Philip Weaver <ph...@gmail.com> wrote:
>>
>> Here's a specific example of what I want to do. My Spark application is
>> running with total-executor-cores=8. A request comes in, it spawns a thread
>> to handle that request, and starts a job. That job should use only 4 cores,
>> not all 8 of the cores available to the cluster.. When the first job is
>> scheduled, it should take only 4 cores, not all 8 of the cores that are
>> available to the driver.
>>
>> Is there any way to accomplish this? This is on mesos.
>>
>> In order to support the use cases described in
>> https://spark.apache.org/docs/latest/job-scheduling.html, where a spark
>> application runs for a long time and handles requests from multiple users,
>> I believe what I'm asking about is a very important feature. One of the
>> goals is to get lower latency for each request, but if the first request
>> takes all resources and we can't guarantee any free resources for the
>> second request, then that defeats the purpose. Does that make sense?
>>
>> Thanks in advance for any advice you can provide!
>>
>> - Philip
>>
>> On Sat, Sep 12, 2015 at 10:40 PM, Philip Weaver <ph...@gmail.com>
>> wrote:
>>
>>> I'm playing around with dynamic allocation in spark-1.5.0, with the FAIR
>>> scheduler, so I can define a long-running application capable of executing
>>> multiple simultaneous spark jobs.
>>>
>>> The kind of jobs that I'm running do not benefit from more than 4 cores,
>>> but I want my application to be able to take several times that in order to
>>> run multiple jobs at the same time.
>>>
>>> I suppose my question is more basic: How can I limit the number of cores
>>> used to load an RDD or DataFrame? I can immediately repartition or coalesce
>>> my RDD or DataFrame to 4 partitions after I load it, but that doesn't stop
>>> Spark from using more cores to load it.
>>>
>>> Does it make sense what I am trying to accomplish, and is there any way
>>> to do it?
>>>
>>> - Philip
>>>
>>>
>>
>

Re: Limiting number of cores per job in multi-threaded driver.

Posted by Jerry Lam <ch...@gmail.com>.

Philip, the guy is trying to help you. Calling him silly is a bit too far. He might assume your problem is IO bound which might not be the case. If you need only 4 cores per job no matter what there is little advantage to use spark in my opinion because you can easily do this with just a worker farm that take the job and process it in a single machine. let the scheduler figures out which node in the farm is idled and spawns jobs on those until all of them are saturated. Call me silly but this seems much simpler.

Sent from my iPhone

> On 3 Oct, 2015, at 12:02 am, Philip Weaver <ph...@gmail.com> wrote:
> 
> You can't really say 8 cores is not much horsepower when you have no idea what my use case is. That's silly.
> 
>> On Fri, Sep 18, 2015 at 10:33 PM, Adrian Tanase <at...@adobe.com> wrote:
>> Forgot to mention that you could also restrict the parallelism to 4, essentially using only 4 cores at any given time, however if your job is complex, a stage might be broken into more than 1 task...
>> 
>> Sent from my iPhone
>> 
>> On 19 Sep 2015, at 08:30, Adrian Tanase <at...@adobe.com> wrote:
>> 
>>> Reading through the docs it seems that with a combination of FAIR scheduler and maybe pools you can get pretty far.
>>> 
>>> However the smallest unit of scheduled work is the task so probably you need to think about the parallelism of each transformation.
>>> 
>>> I'm guessing that by increasing the level of parallelism you get many smaller tasks that the scheduler can then run across the many jobs you might have - as opposed to fewer, longer tasks...
>>> 
>>> Lastly, 8 cores is not that much horsepower :) 
>>> You may consider running with beefier machines or a larger cluster, to get at least tens of cores.
>>> 
>>> Hope this helps,
>>> -adrian
>>> 
>>> Sent from my iPhone
>>> 
>>> On 18 Sep 2015, at 18:37, Philip Weaver <ph...@gmail.com> wrote:
>>> 
>>>> Here's a specific example of what I want to do. My Spark application is running with total-executor-cores=8. A request comes in, it spawns a thread to handle that request, and starts a job. That job should use only 4 cores, not all 8 of the cores available to the cluster.. When the first job is scheduled, it should take only 4 cores, not all 8 of the cores that are available to the driver.
>>>> 
>>>> Is there any way to accomplish this? This is on mesos.
>>>> 
>>>> In order to support the use cases described in https://spark.apache.org/docs/latest/job-scheduling.html, where a spark application runs for a long time and handles requests from multiple users, I believe what I'm asking about is a very important feature. One of the goals is to get lower latency for each request, but if the first request takes all resources and we can't guarantee any free resources for the second request, then that defeats the purpose. Does that make sense?
>>>> 
>>>> Thanks in advance for any advice you can provide!
>>>> 
>>>> - Philip
>>>> 
>>>>> On Sat, Sep 12, 2015 at 10:40 PM, Philip Weaver <ph...@gmail.com> wrote:
>>>>> I'm playing around with dynamic allocation in spark-1.5.0, with the FAIR scheduler, so I can define a long-running application capable of executing multiple simultaneous spark jobs.
>>>>> 
>>>>> The kind of jobs that I'm running do not benefit from more than 4 cores, but I want my application to be able to take several times that in order to run multiple jobs at the same time.
>>>>> 
>>>>> I suppose my question is more basic: How can I limit the number of cores used to load an RDD or DataFrame? I can immediately repartition or coalesce my RDD or DataFrame to 4 partitions after I load it, but that doesn't stop Spark from using more cores to load it.
>>>>> 
>>>>> Does it make sense what I am trying to accomplish, and is there any way to do it?
>>>>> 
>>>>> - Philip
>

Re: Limiting number of cores per job in multi-threaded driver.

Posted by Philip Weaver <ph...@gmail.com>.

Yes, I am sharing the cluster across many jobs, and each jobs only needs 8
cores (in fact, because the jobs are so small and are counting uniques, it
only gets slower as you add more cores). My question is how to limit each
job to only use 8 cores, but have the entire cluster available for that
SparkContext; e.g. if I have a cluster of 128 cores, and I want to limit
the SparkCOntext to 64 cores, and each job to 8 cores, so I can run up to 8
jobs at once.

On Sun, Oct 4, 2015 at 9:38 AM, Adrian Tanase <at...@adobe.com> wrote:

> You are absolutely correct, I apologize.
>
> My understanding was that you are sharing the machine across many jobs.
> That was the context in which I was making that comment.
>
> -adrian
>
> Sent from my iPhone
>
> On 03 Oct 2015, at 07:03, Philip Weaver <ph...@gmail.com> wrote:
>
> You can't really say 8 cores is not much horsepower when you have no idea
> what my use case is. That's silly.
>
> On Fri, Sep 18, 2015 at 10:33 PM, Adrian Tanase <at...@adobe.com> wrote:
>
>> Forgot to mention that you could also restrict the parallelism to 4,
>> essentially using only 4 cores at any given time, however if your job is
>> complex, a stage might be broken into more than 1 task...
>>
>> Sent from my iPhone
>>
>> On 19 Sep 2015, at 08:30, Adrian Tanase <at...@adobe.com> wrote:
>>
>> Reading through the docs it seems that with a combination of FAIR
>> scheduler and maybe pools you can get pretty far.
>>
>> However the smallest unit of scheduled work is the task so probably you
>> need to think about the parallelism of each transformation.
>>
>> I'm guessing that by increasing the level of parallelism you get many
>> smaller tasks that the scheduler can then run across the many jobs you
>> might have - as opposed to fewer, longer tasks...
>>
>> Lastly, 8 cores is not that much horsepower :)
>> You may consider running with beefier machines or a larger cluster, to
>> get at least tens of cores.
>>
>> Hope this helps,
>> -adrian
>>
>> Sent from my iPhone
>>
>> On 18 Sep 2015, at 18:37, Philip Weaver <ph...@gmail.com> wrote:
>>
>> Here's a specific example of what I want to do. My Spark application is
>> running with total-executor-cores=8. A request comes in, it spawns a thread
>> to handle that request, and starts a job. That job should use only 4 cores,
>> not all 8 of the cores available to the cluster.. When the first job is
>> scheduled, it should take only 4 cores, not all 8 of the cores that are
>> available to the driver.
>>
>> Is there any way to accomplish this? This is on mesos.
>>
>> In order to support the use cases described in
>> https://spark.apache.org/docs/latest/job-scheduling.html, where a spark
>> application runs for a long time and handles requests from multiple users,
>> I believe what I'm asking about is a very important feature. One of the
>> goals is to get lower latency for each request, but if the first request
>> takes all resources and we can't guarantee any free resources for the
>> second request, then that defeats the purpose. Does that make sense?
>>
>> Thanks in advance for any advice you can provide!
>>
>> - Philip
>>
>> On Sat, Sep 12, 2015 at 10:40 PM, Philip Weaver <ph...@gmail.com>
>> wrote:
>>
>>> I'm playing around with dynamic allocation in spark-1.5.0, with the FAIR
>>> scheduler, so I can define a long-running application capable of executing
>>> multiple simultaneous spark jobs.
>>>
>>> The kind of jobs that I'm running do not benefit from more than 4 cores,
>>> but I want my application to be able to take several times that in order to
>>> run multiple jobs at the same time.
>>>
>>> I suppose my question is more basic: How can I limit the number of cores
>>> used to load an RDD or DataFrame? I can immediately repartition or coalesce
>>> my RDD or DataFrame to 4 partitions after I load it, but that doesn't stop
>>> Spark from using more cores to load it.
>>>
>>> Does it make sense what I am trying to accomplish, and is there any way
>>> to do it?
>>>
>>> - Philip
>>>
>>>
>>
>

Re: Limiting number of cores per job in multi-threaded driver.

Posted by Adrian Tanase <at...@adobe.com>.

You are absolutely correct, I apologize.

My understanding was that you are sharing the machine across many jobs. That was the context in which I was making that comment.

-adrian

Sent from my iPhone

On 03 Oct 2015, at 07:03, Philip Weaver <ph...@gmail.com>> wrote:

You can't really say 8 cores is not much horsepower when you have no idea what my use case is. That's silly.

On Fri, Sep 18, 2015 at 10:33 PM, Adrian Tanase <at...@adobe.com>> wrote:
Forgot to mention that you could also restrict the parallelism to 4, essentially using only 4 cores at any given time, however if your job is complex, a stage might be broken into more than 1 task...

Sent from my iPhone

On 19 Sep 2015, at 08:30, Adrian Tanase <at...@adobe.com>> wrote:

Reading through the docs it seems that with a combination of FAIR scheduler and maybe pools you can get pretty far.

However the smallest unit of scheduled work is the task so probably you need to think about the parallelism of each transformation.

I'm guessing that by increasing the level of parallelism you get many smaller tasks that the scheduler can then run across the many jobs you might have - as opposed to fewer, longer tasks...

Lastly, 8 cores is not that much horsepower :)
You may consider running with beefier machines or a larger cluster, to get at least tens of cores.

Hope this helps,
-adrian

Sent from my iPhone

On 18 Sep 2015, at 18:37, Philip Weaver <ph...@gmail.com>> wrote:

Here's a specific example of what I want to do. My Spark application is running with total-executor-cores=8. A request comes in, it spawns a thread to handle that request, and starts a job. That job should use only 4 cores, not all 8 of the cores available to the cluster.. When the first job is scheduled, it should take only 4 cores, not all 8 of the cores that are available to the driver.

Is there any way to accomplish this? This is on mesos.

In order to support the use cases described in https://spark.apache.org/docs/latest/job-scheduling.html, where a spark application runs for a long time and handles requests from multiple users, I believe what I'm asking about is a very important feature. One of the goals is to get lower latency for each request, but if the first request takes all resources and we can't guarantee any free resources for the second request, then that defeats the purpose. Does that make sense?

Thanks in advance for any advice you can provide!

- Philip

On Sat, Sep 12, 2015 at 10:40 PM, Philip Weaver <ph...@gmail.com>> wrote:
I'm playing around with dynamic allocation in spark-1.5.0, with the FAIR scheduler, so I can define a long-running application capable of executing multiple simultaneous spark jobs.

The kind of jobs that I'm running do not benefit from more than 4 cores, but I want my application to be able to take several times that in order to run multiple jobs at the same time.

I suppose my question is more basic: How can I limit the number of cores used to load an RDD or DataFrame? I can immediately repartition or coalesce my RDD or DataFrame to 4 partitions after I load it, but that doesn't stop Spark from using more cores to load it.

Does it make sense what I am trying to accomplish, and is there any way to do it?

- Philip