You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Darin McBeath <dd...@yahoo.com> on 2014/07/30 21:56:20 UTC

Number of partitions and Number of concurrent tasks

I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.

I have an RDD<String> which I've repartitioned so it has 100 partitions (hoping to increase the parallelism).

When I do a transformation (such as filter) on this RDD, I can't  seem to get more than 24 tasks (my total number of cores across the 3 nodes) going at one point in time.  By tasks, I mean the number of tasks that appear under the Application UI.  I tried explicitly setting the spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect.  Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.

I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental.  Any help would be appreciated.

Thanks.

Darin.

Re: Number of partitions and Number of concurrent tasks

Posted by Daniel Siegmann <da...@velos.io>.

It is definitely possible to run multiple workers on a single node and have
each worker with the maximum number of cores (e.g. if you have 8 cores and
2 workers you'd have 16 cores per node). I don't know if it's possible with
the out of the box scripts though.

It's actually not really that difficult. You just run start-slave.sh
multiple times on the same node, with different IDs. Here is the usage:

# Usage: start-slave.sh <worker#> <master-spark-URL>

But we have custom scripts to do that. I'm not sure whether it is possible
using the standard start-all.sh script or that EC2 script. Probably not.

I haven't set up or managed such a cluster myself, so that's about the
extent of my knowledge. But I've deployed jobs to that cluster and enjoyed
the benefit of double the cores - we had a fair amount of I/O though, which
may be why it helped in our case. I recommend taking a look at the CPU
utilization on the nodes when running a flow before jumping through these
hoops.


On Fri, Aug 1, 2014 at 12:05 PM, Nicholas Chammas <
nicholas.chammas@gmail.com> wrote:

> Darin,
>
> I think the number of cores in your cluster is a hard limit on how many
> concurrent tasks you can execute at one time. If you want more parallelism,
> I think you just need more cores in your cluster--that is, bigger nodes, or
> more nodes.
>
> Daniel,
>
> Have you been able to get around this limit?
>
> Nick
>
>
>
> On Fri, Aug 1, 2014 at 11:49 AM, Daniel Siegmann <daniel.siegmann@velos.io
> > wrote:
>
>> Sorry, but I haven't used Spark on EC2 and I'm not sure what the problem
>> could be. Hopefully someone else will be able to help. The only thing I
>> could suggest is to try setting both the worker instances and the number of
>> cores (assuming spark-ec2 has such a parameter).
>>
>>
>> On Thu, Jul 31, 2014 at 3:03 PM, Darin McBeath <dd...@yahoo.com>
>> wrote:
>>
>>> Ok, I set the number of spark worker instances to 2 (below is my startup
>>> command).  But, this essentially had the effect of increasing my number of
>>> workers from 3 to 6 (which was good) but it also reduced my number of cores
>>> per worker from 8 to 4 (which was not so good).  In the end, I would still
>>> only be able to concurrently process 24 partitions in parallel.  I'm
>>> starting a stand-alone cluster using the spark provided ec2 scripts .  I
>>> tried setting the env variable for SPARK_WORKER_CORES in the spark_ec2.py
>>> but this had no effect. So, it's not clear if I could even set the
>>> SPARK_WORKER_CORES with the ec2 scripts.  Anyway, not sure if there is
>>> anything else I can try but at least wanted to document what I did try and
>>> the net effect.  I'm open to any suggestions/advice.
>>>
>>>  ./spark-ec2 -k *key* -i key.pem --hadoop-major-version=2 launch -s 3
>>> -t m3.2xlarge -w 3600 --spot-price=.08 -z us-east-1e --worker-instances=2
>>> *my-cluster*
>>>
>>>
>>>   ------------------------------
>>>  *From:* Daniel Siegmann <da...@velos.io>
>>> *To:* Darin McBeath <dd...@yahoo.com>
>>> *Cc:* Daniel Siegmann <da...@velos.io>; "user@spark.apache.org"
>>> <us...@spark.apache.org>
>>> *Sent:* Thursday, July 31, 2014 10:04 AM
>>>
>>> *Subject:* Re: Number of partitions and Number of concurrent tasks
>>>
>>> I haven't configured this myself. I'd start with setting
>>> SPARK_WORKER_CORES to a higher value, since that's a bit simpler than
>>> adding more workers. This defaults to "all available cores" according to
>>> the documentation, so I'm not sure if you can actually set it higher. If
>>> not, you can get around this by adding more worker instances; I believe
>>> simply setting SPARK_WORKER_INSTANCES to 2 would be sufficient.
>>>
>>> I don't think you *have* to set the cores if you have more workers - it
>>> will default to 8 cores per worker (in your case). But maybe 16 cores per
>>> node will be too many. You'll have to test. Keep in mind that more workers
>>> means more memory and such too, so you may need to tweak some other
>>> settings downward in this case.
>>>
>>> On a side note: I've read some people found performance was better when
>>> they had more workers with less memory each, instead of a single worker
>>> with tons of memory, because it cut down on garbage collection time. But I
>>> can't speak to that myself.
>>>
>>> In any case, if you increase the number of cores available in your
>>> cluster (whether per worker, or adding more workers per node, or of course
>>> adding more nodes) you should see more tasks running concurrently. Whether
>>> this will actually be *faster* probably depends mainly on whether the
>>> CPUs in your nodes were really being fully utilized with the current number
>>> of cores.
>>>
>>>
>>> On Wed, Jul 30, 2014 at 8:30 PM, Darin McBeath <dd...@yahoo.com>
>>> wrote:
>>>
>>> Thanks.
>>>
>>>  So to make sure I understand.  Since I'm using a 'stand-alone'
>>> cluster, I would set SPARK_WORKER_INSTANCES to something like 2
>>> (instead of the default value of 1).  Is that correct?  But, it also sounds
>>> like I need to explicitly set a value for SPARKER_WORKER_CORES (based on
>>> what the documentation states).  What would I want that value to be based
>>> on my configuration below?  Or, would I leave that alone?
>>>
>>>   ------------------------------
>>>  *From:* Daniel Siegmann <da...@velos.io>
>>> *To:* user@spark.apache.org; Darin McBeath <dd...@yahoo.com>
>>> *Sent:* Wednesday, July 30, 2014 5:58 PM
>>> *Subject:* Re: Number of partitions and Number of concurrent tasks
>>>
>>> This is correct behavior. Each "core" can execute exactly one task at a
>>> time, with each task corresponding to a partition. If your cluster only has
>>> 24 cores, you can only run at most 24 tasks at once.
>>>
>>> You could run multiple workers per node to get more executors. That
>>> would give you more cores in the cluster. But however many cores you have,
>>> each core will run only one task at a time.
>>>
>>>
>>> On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <dd...@yahoo.com>
>>> wrote:
>>>
>>>  I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
>>>
>>> I have an RDD<String> which I've repartitioned so it has 100 partitions
>>> (hoping to increase the parallelism).
>>>
>>> When I do a transformation (such as filter) on this RDD, I can't  seem
>>> to get more than 24 tasks (my total number of cores across the 3 nodes)
>>> going at one point in time.  By tasks, I mean the number of tasks that
>>> appear under the Application UI.  I tried explicitly setting the
>>> spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently
>>> running) and verified this in the Application UI for the running
>>> application but this had no effect.  Perhaps, this is ignored for a
>>> 'filter' and the default is the total number of cores available.
>>>
>>> I'm fairly new with Spark so maybe I'm just missing or misunderstanding
>>> something fundamental.  Any help would be appreciated.
>>>
>>> Thanks.
>>>
>>> Darin.
>>>
>>>
>>>
>>>
>>> --
>>> Daniel Siegmann, Software Developer
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>>> E: daniel.siegmann@velos.io W: www.velos.io
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Daniel Siegmann, Software Developer
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>>> E: daniel.siegmann@velos.io W: www.velos.io
>>>
>>>
>>>
>>
>>
>> --
>> Daniel Siegmann, Software Developer
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>> E: daniel.siegmann@velos.io W: www.velos.io
>>
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io

Re: Number of partitions and Number of concurrent tasks

Posted by Nicholas Chammas <ni...@gmail.com>.

Darin,

I think the number of cores in your cluster is a hard limit on how many
concurrent tasks you can execute at one time. If you want more parallelism,
I think you just need more cores in your cluster--that is, bigger nodes, or
more nodes.

Daniel,

Have you been able to get around this limit?

Nick



On Fri, Aug 1, 2014 at 11:49 AM, Daniel Siegmann <da...@velos.io>
wrote:

> Sorry, but I haven't used Spark on EC2 and I'm not sure what the problem
> could be. Hopefully someone else will be able to help. The only thing I
> could suggest is to try setting both the worker instances and the number of
> cores (assuming spark-ec2 has such a parameter).
>
>
> On Thu, Jul 31, 2014 at 3:03 PM, Darin McBeath <dd...@yahoo.com>
> wrote:
>
>> Ok, I set the number of spark worker instances to 2 (below is my startup
>> command).  But, this essentially had the effect of increasing my number of
>> workers from 3 to 6 (which was good) but it also reduced my number of cores
>> per worker from 8 to 4 (which was not so good).  In the end, I would still
>> only be able to concurrently process 24 partitions in parallel.  I'm
>> starting a stand-alone cluster using the spark provided ec2 scripts .  I
>> tried setting the env variable for SPARK_WORKER_CORES in the spark_ec2.py
>> but this had no effect. So, it's not clear if I could even set the
>> SPARK_WORKER_CORES with the ec2 scripts.  Anyway, not sure if there is
>> anything else I can try but at least wanted to document what I did try and
>> the net effect.  I'm open to any suggestions/advice.
>>
>>  ./spark-ec2 -k *key* -i key.pem --hadoop-major-version=2 launch -s 3 -t
>> m3.2xlarge -w 3600 --spot-price=.08 -z us-east-1e --worker-instances=2
>> *my-cluster*
>>
>>
>>   ------------------------------
>>  *From:* Daniel Siegmann <da...@velos.io>
>> *To:* Darin McBeath <dd...@yahoo.com>
>> *Cc:* Daniel Siegmann <da...@velos.io>; "user@spark.apache.org"
>> <us...@spark.apache.org>
>> *Sent:* Thursday, July 31, 2014 10:04 AM
>>
>> *Subject:* Re: Number of partitions and Number of concurrent tasks
>>
>> I haven't configured this myself. I'd start with setting
>> SPARK_WORKER_CORES to a higher value, since that's a bit simpler than
>> adding more workers. This defaults to "all available cores" according to
>> the documentation, so I'm not sure if you can actually set it higher. If
>> not, you can get around this by adding more worker instances; I believe
>> simply setting SPARK_WORKER_INSTANCES to 2 would be sufficient.
>>
>> I don't think you *have* to set the cores if you have more workers - it
>> will default to 8 cores per worker (in your case). But maybe 16 cores per
>> node will be too many. You'll have to test. Keep in mind that more workers
>> means more memory and such too, so you may need to tweak some other
>> settings downward in this case.
>>
>> On a side note: I've read some people found performance was better when
>> they had more workers with less memory each, instead of a single worker
>> with tons of memory, because it cut down on garbage collection time. But I
>> can't speak to that myself.
>>
>> In any case, if you increase the number of cores available in your
>> cluster (whether per worker, or adding more workers per node, or of course
>> adding more nodes) you should see more tasks running concurrently. Whether
>> this will actually be *faster* probably depends mainly on whether the
>> CPUs in your nodes were really being fully utilized with the current number
>> of cores.
>>
>>
>> On Wed, Jul 30, 2014 at 8:30 PM, Darin McBeath <dd...@yahoo.com>
>> wrote:
>>
>> Thanks.
>>
>>  So to make sure I understand.  Since I'm using a 'stand-alone' cluster,
>> I would set SPARK_WORKER_INSTANCES to something like 2 (instead of the
>> default value of 1).  Is that correct?  But, it also sounds like I need to
>> explicitly set a value for SPARKER_WORKER_CORES (based on what the
>> documentation states).  What would I want that value to be based on my
>> configuration below?  Or, would I leave that alone?
>>
>>   ------------------------------
>>  *From:* Daniel Siegmann <da...@velos.io>
>> *To:* user@spark.apache.org; Darin McBeath <dd...@yahoo.com>
>> *Sent:* Wednesday, July 30, 2014 5:58 PM
>> *Subject:* Re: Number of partitions and Number of concurrent tasks
>>
>> This is correct behavior. Each "core" can execute exactly one task at a
>> time, with each task corresponding to a partition. If your cluster only has
>> 24 cores, you can only run at most 24 tasks at once.
>>
>> You could run multiple workers per node to get more executors. That would
>> give you more cores in the cluster. But however many cores you have, each
>> core will run only one task at a time.
>>
>>
>> On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <dd...@yahoo.com>
>> wrote:
>>
>>  I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
>>
>> I have an RDD<String> which I've repartitioned so it has 100 partitions
>> (hoping to increase the parallelism).
>>
>> When I do a transformation (such as filter) on this RDD, I can't  seem to
>> get more than 24 tasks (my total number of cores across the 3 nodes) going
>> at one point in time.  By tasks, I mean the number of tasks that appear
>> under the Application UI.  I tried explicitly setting the
>> spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently
>> running) and verified this in the Application UI for the running
>> application but this had no effect.  Perhaps, this is ignored for a
>> 'filter' and the default is the total number of cores available.
>>
>> I'm fairly new with Spark so maybe I'm just missing or misunderstanding
>> something fundamental.  Any help would be appreciated.
>>
>> Thanks.
>>
>> Darin.
>>
>>
>>
>>
>> --
>> Daniel Siegmann, Software Developer
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>> E: daniel.siegmann@velos.io W: www.velos.io
>>
>>
>>
>>
>>
>> --
>> Daniel Siegmann, Software Developer
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>> E: daniel.siegmann@velos.io W: www.velos.io
>>
>>
>>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegmann@velos.io W: www.velos.io
>

Re: Number of partitions and Number of concurrent tasks

Posted by Daniel Siegmann <da...@velos.io>.

Sorry, but I haven't used Spark on EC2 and I'm not sure what the problem
could be. Hopefully someone else will be able to help. The only thing I
could suggest is to try setting both the worker instances and the number of
cores (assuming spark-ec2 has such a parameter).


On Thu, Jul 31, 2014 at 3:03 PM, Darin McBeath <dd...@yahoo.com> wrote:

> Ok, I set the number of spark worker instances to 2 (below is my startup
> command).  But, this essentially had the effect of increasing my number of
> workers from 3 to 6 (which was good) but it also reduced my number of cores
> per worker from 8 to 4 (which was not so good).  In the end, I would still
> only be able to concurrently process 24 partitions in parallel.  I'm
> starting a stand-alone cluster using the spark provided ec2 scripts .  I
> tried setting the env variable for SPARK_WORKER_CORES in the spark_ec2.py
> but this had no effect. So, it's not clear if I could even set the
> SPARK_WORKER_CORES with the ec2 scripts.  Anyway, not sure if there is
> anything else I can try but at least wanted to document what I did try and
> the net effect.  I'm open to any suggestions/advice.
>
>  ./spark-ec2 -k *key* -i key.pem --hadoop-major-version=2 launch -s 3 -t
> m3.2xlarge -w 3600 --spot-price=.08 -z us-east-1e --worker-instances=2
> *my-cluster*
>
>
>   ------------------------------
>  *From:* Daniel Siegmann <da...@velos.io>
> *To:* Darin McBeath <dd...@yahoo.com>
> *Cc:* Daniel Siegmann <da...@velos.io>; "user@spark.apache.org"
> <us...@spark.apache.org>
> *Sent:* Thursday, July 31, 2014 10:04 AM
>
> *Subject:* Re: Number of partitions and Number of concurrent tasks
>
> I haven't configured this myself. I'd start with setting
> SPARK_WORKER_CORES to a higher value, since that's a bit simpler than
> adding more workers. This defaults to "all available cores" according to
> the documentation, so I'm not sure if you can actually set it higher. If
> not, you can get around this by adding more worker instances; I believe
> simply setting SPARK_WORKER_INSTANCES to 2 would be sufficient.
>
> I don't think you *have* to set the cores if you have more workers - it
> will default to 8 cores per worker (in your case). But maybe 16 cores per
> node will be too many. You'll have to test. Keep in mind that more workers
> means more memory and such too, so you may need to tweak some other
> settings downward in this case.
>
> On a side note: I've read some people found performance was better when
> they had more workers with less memory each, instead of a single worker
> with tons of memory, because it cut down on garbage collection time. But I
> can't speak to that myself.
>
> In any case, if you increase the number of cores available in your cluster
> (whether per worker, or adding more workers per node, or of course adding
> more nodes) you should see more tasks running concurrently. Whether this
> will actually be *faster* probably depends mainly on whether the CPUs in
> your nodes were really being fully utilized with the current number of
> cores.
>
>
> On Wed, Jul 30, 2014 at 8:30 PM, Darin McBeath <dd...@yahoo.com>
> wrote:
>
> Thanks.
>
> So to make sure I understand.  Since I'm using a 'stand-alone' cluster, I
> would set SPARK_WORKER_INSTANCES to something like 2 (instead of the
> default value of 1).  Is that correct?  But, it also sounds like I need to
> explicitly set a value for SPARKER_WORKER_CORES (based on what the
> documentation states).  What would I want that value to be based on my
> configuration below?  Or, would I leave that alone?
>
>   ------------------------------
>  *From:* Daniel Siegmann <da...@velos.io>
> *To:* user@spark.apache.org; Darin McBeath <dd...@yahoo.com>
> *Sent:* Wednesday, July 30, 2014 5:58 PM
> *Subject:* Re: Number of partitions and Number of concurrent tasks
>
> This is correct behavior. Each "core" can execute exactly one task at a
> time, with each task corresponding to a partition. If your cluster only has
> 24 cores, you can only run at most 24 tasks at once.
>
> You could run multiple workers per node to get more executors. That would
> give you more cores in the cluster. But however many cores you have, each
> core will run only one task at a time.
>
>
> On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <dd...@yahoo.com>
> wrote:
>
>  I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
>
> I have an RDD<String> which I've repartitioned so it has 100 partitions
> (hoping to increase the parallelism).
>
> When I do a transformation (such as filter) on this RDD, I can't  seem to
> get more than 24 tasks (my total number of cores across the 3 nodes) going
> at one point in time.  By tasks, I mean the number of tasks that appear
> under the Application UI.  I tried explicitly setting the
> spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently
> running) and verified this in the Application UI for the running
> application but this had no effect.  Perhaps, this is ignored for a
> 'filter' and the default is the total number of cores available.
>
> I'm fairly new with Spark so maybe I'm just missing or misunderstanding
> something fundamental.  Any help would be appreciated.
>
> Thanks.
>
> Darin.
>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegmann@velos.io W: www.velos.io
>
>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegmann@velos.io W: www.velos.io
>
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io

Re: Number of partitions and Number of concurrent tasks

Posted by Darin McBeath <dd...@yahoo.com>.

Ok, I set the number of spark worker instances to 2 (below is my startup command).  But, this essentially had the effect of increasing my number of workers from 3 to 6 (which was good) but it also reduced my number of cores per worker from 8 to 4 (which was not so good).  In the end, I would still only be able to concurrently process 24 partitions in parallel.  I'm starting a stand-alone cluster using the spark provided ec2 scripts .  I tried setting the env variable for SPARK_WORKER_CORES in the spark_ec2.py but this had no effect. So, it's not clear if I could even set the SPARK_WORKER_CORES with the ec2 scripts.  Anyway, not sure if there is anything else I can try but at least wanted to document what I did try and the net effect.  I'm open to any suggestions/advice.

 ./spark-ec2 -k key -i key.pem --hadoop-major-version=2 launch -s 3 -t m3.2xlarge -w 3600 --spot-price=.08 -z us-east-1e --worker-instances=2 my-cluster

________________________________
 From: Daniel Siegmann <da...@velos.io>
To: Darin McBeath <dd...@yahoo.com> 
Cc: Daniel Siegmann <da...@velos.io>; "user@spark.apache.org" <us...@spark.apache.org> 
Sent: Thursday, July 31, 2014 10:04 AM
Subject: Re: Number of partitions and Number of concurrent tasks

I haven't configured this myself. I'd start with setting SPARK_WORKER_CORES to a higher value, since that's a bit simpler than adding more workers. This defaults to "all available cores" according to the documentation, so I'm not sure if you can actually set it higher. If not, you can get around this by adding more worker instances; I believe simply setting SPARK_WORKER_INSTANCES to 2 would be sufficient.

I don't think you have to set the cores if you have more workers - it will default to 8 cores per worker (in your case). But maybe 16 cores per node will be too many. You'll have to test. Keep in mind that more workers means more memory and such too, so you may need to tweak some other settings downward in this case.

On a side note: I've read some people found performance was better when they had more workers with less memory each, instead of a single worker with tons of memory, because it cut down on garbage collection time. But I can't speak to that myself.

In any case, if you increase the number of cores available in your cluster (whether per worker, or adding more workers per node, or of course adding more nodes) you should see more tasks running concurrently. Whether this will actually be faster probably depends mainly on whether the CPUs in your nodes were really being fully utilized with the current number of cores.

On Wed, Jul 30, 2014 at 8:30 PM, Darin McBeath <dd...@yahoo.com> wrote:

Thanks.
>
>
>So to make sure I understand.  Since I'm using a 'stand-alone' cluster, I would set SPARK_WORKER_INSTANCES to something like 2 (instead of the default value of 1).  Is that correct?  But, it also sounds like I need to explicitly set a value for SPARKER_WORKER_CORES (based on what the documentation states).  What would I want that value to be based on my configuration below?  Or, would I leave that alone?
>
>
>
>________________________________
> From: Daniel Siegmann <da...@velos.io>
>To: user@spark.apache.org; Darin McBeath <dd...@yahoo.com> 
>Sent: Wednesday, July 30, 2014 5:58 PM
>Subject: Re: Number of partitions and Number of concurrent tasks
> 
>
>
>This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 24 cores, you can only run at most 24 tasks at once.
>
>You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.
>
>
>
>
>On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <dd...@yahoo.com> wrote:
>
>I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
>>
>>
>>I have an RDD<String> which I've repartitioned so it has 100 partitions (hoping to increase the parallelism).
>>
>>
>>When I do a transformation (such as filter) on this RDD, I can't  seem to get more than 24 tasks (my total number of cores across the 3 nodes) going at one point in time.  By tasks, I mean the number of tasks that appear under the Application UI.  I tried explicitly setting the spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect.  Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
>>
>>
>>I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental.  Any help would be appreciated.
>>
>>
>>Thanks.
>>
>>
>>Darin.
>>
>>
>
>
>-- 
>
>Daniel Siegmann, Software Developer
>Velos
>Accelerating Machine Learning
>
>
>440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>E: daniel.siegmann@velos.ioW: www.velos.io
>
>

-- 

Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.ioW: www.velos.io

Re: Number of partitions and Number of concurrent tasks

Posted by Daniel Siegmann <da...@velos.io>.

I haven't configured this myself. I'd start with setting SPARK_WORKER_CORES
to a higher value, since that's a bit simpler than adding more workers.
This defaults to "all available cores" according to the documentation, so
I'm not sure if you can actually set it higher. If not, you can get around
this by adding more worker instances; I believe simply setting
SPARK_WORKER_INSTANCES to 2 would be sufficient.

I don't think you *have* to set the cores if you have more workers - it
will default to 8 cores per worker (in your case). But maybe 16 cores per
node will be too many. You'll have to test. Keep in mind that more workers
means more memory and such too, so you may need to tweak some other
settings downward in this case.

On a side note: I've read some people found performance was better when
they had more workers with less memory each, instead of a single worker
with tons of memory, because it cut down on garbage collection time. But I
can't speak to that myself.

In any case, if you increase the number of cores available in your cluster
(whether per worker, or adding more workers per node, or of course adding
more nodes) you should see more tasks running concurrently. Whether this
will actually be *faster* probably depends mainly on whether the CPUs in
your nodes were really being fully utilized with the current number of
cores.

On Wed, Jul 30, 2014 at 8:30 PM, Darin McBeath <dd...@yahoo.com> wrote:

> Thanks.
>
> So to make sure I understand.  Since I'm using a 'stand-alone' cluster, I
> would set SPARK_WORKER_INSTANCES to something like 2 (instead of the
> default value of 1).  Is that correct?  But, it also sounds like I need to
> explicitly set a value for SPARKER_WORKER_CORES (based on what the
> documentation states).  What would I want that value to be based on my
> configuration below?  Or, would I leave that alone?
>
>   ------------------------------
>  *From:* Daniel Siegmann <da...@velos.io>
> *To:* user@spark.apache.org; Darin McBeath <dd...@yahoo.com>
> *Sent:* Wednesday, July 30, 2014 5:58 PM
> *Subject:* Re: Number of partitions and Number of concurrent tasks
>
> This is correct behavior. Each "core" can execute exactly one task at a
> time, with each task corresponding to a partition. If your cluster only has
> 24 cores, you can only run at most 24 tasks at once.
>
> You could run multiple workers per node to get more executors. That would
> give you more cores in the cluster. But however many cores you have, each
> core will run only one task at a time.
>
>
> On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <dd...@yahoo.com>
> wrote:
>
>  I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
>
> I have an RDD<String> which I've repartitioned so it has 100 partitions
> (hoping to increase the parallelism).
>
> When I do a transformation (such as filter) on this RDD, I can't  seem to
> get more than 24 tasks (my total number of cores across the 3 nodes) going
> at one point in time.  By tasks, I mean the number of tasks that appear
> under the Application UI.  I tried explicitly setting the
> spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently
> running) and verified this in the Application UI for the running
> application but this had no effect.  Perhaps, this is ignored for a
> 'filter' and the default is the total number of cores available.
>
> I'm fairly new with Spark so maybe I'm just missing or misunderstanding
> something fundamental.  Any help would be appreciated.
>
> Thanks.
>
> Darin.
>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegmann@velos.io W: www.velos.io
>
>
>

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io

Re: Number of partitions and Number of concurrent tasks

Posted by Darin McBeath <dd...@yahoo.com>.

Thanks.

So to make sure I understand.  Since I'm using a 'stand-alone' cluster, I would set SPARK_WORKER_INSTANCES to something like 2 (instead of the default value of 1).  Is that correct?  But, it also sounds like I need to explicitly set a value for SPARKER_WORKER_CORES (based on what the documentation states).  What would I want that value to be based on my configuration below?  Or, would I leave that alone?

________________________________
 From: Daniel Siegmann <da...@velos.io>
To: user@spark.apache.org; Darin McBeath <dd...@yahoo.com> 
Sent: Wednesday, July 30, 2014 5:58 PM
Subject: Re: Number of partitions and Number of concurrent tasks

This is correct behavior. Each "core" can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 24 cores, you can only run at most 24 tasks at once.

You could run multiple workers per node to get more executors. That would give you more cores in the cluster. But however many cores you have, each core will run only one task at a time.

On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <dd...@yahoo.com> wrote:

I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
>
>
>I have an RDD<String> which I've repartitioned so it has 100 partitions (hoping to increase the parallelism).
>
>
>When I do a transformation (such as filter) on this RDD, I can't  seem to get more than 24 tasks (my total number of cores across the 3 nodes) going at one point in time.  By tasks, I mean the number of tasks that appear under the Application UI.  I tried explicitly setting the spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently running) and verified this in the Application UI for the running application but this had no effect.  Perhaps, this is ignored for a 'filter' and the default is the total number of cores available.
>
>
>I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental.  Any help would be appreciated.
>
>
>Thanks.
>
>
>Darin.
>
>

-- 

Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.ioW: www.velos.io

Re: Number of partitions and Number of concurrent tasks

Posted by Daniel Siegmann <da...@velos.io>.

This is correct behavior. Each "core" can execute exactly one task at a
time, with each task corresponding to a partition. If your cluster only has
24 cores, you can only run at most 24 tasks at once.

You could run multiple workers per node to get more executors. That would
give you more cores in the cluster. But however many cores you have, each
core will run only one task at a time.


On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <dd...@yahoo.com> wrote:

> I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1.
>
> I have an RDD<String> which I've repartitioned so it has 100 partitions
> (hoping to increase the parallelism).
>
> When I do a transformation (such as filter) on this RDD, I can't  seem to
> get more than 24 tasks (my total number of cores across the 3 nodes) going
> at one point in time.  By tasks, I mean the number of tasks that appear
> under the Application UI.  I tried explicitly setting the
> spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently
> running) and verified this in the Application UI for the running
> application but this had no effect.  Perhaps, this is ignored for a
> 'filter' and the default is the total number of cores available.
>
> I'm fairly new with Spark so maybe I'm just missing or misunderstanding
> something fundamental.  Any help would be appreciated.
>
> Thanks.
>
> Darin.
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io