You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jestin Ma <je...@gmail.com> on 2016/07/29 16:02:13 UTC

Tuning level of Parallelism: Increase or decrease?

I am processing ~2 TB of hdfs data using DataFrames. The size of a task is
equal to the block size specified by hdfs, which happens to be 128 MB,
leading to about 15000 tasks.

I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
I'm performing groupBy, count, and an outer-join with another DataFrame of
~200 MB size (~80 MB cached but I don't need to cache it), then saving to
disk.

Right now it takes about 55 minutes, and I've been trying to tune it.

I read on the Spark Tuning guide that:
*In general, we recommend 2-3 tasks per CPU core in your cluster.*

This means that I should have about 30-50 tasks instead of 15000, and each
task would be much bigger in size. Is my understanding correct, and is this
suggested? I've read from difference sources to decrease or increase
parallelism, or even keep it default.

Thank you for your help,
Jestin

Re: Tuning level of Parallelism: Increase or decrease?

Posted by Sonal Goyal <so...@gmail.com>.

Hi Jestin,

Which of your actions is the bottleneck? Is it group by, count or the join?
Or all of them? It may help to tune the most time consuming ask first.

On Monday, August 1, 2016, Nikolay Zhebet <ph...@gmail.com> wrote:

> Yes, Spark always trying to deliver snippet of code to the data (not vice
> versa). But you should realize, that if you try to run groupBY or Join on
> the large dataset, then you always should migrate temporary localy grouped
> data from one worker node to the another(It is shuffle operation as i
> know). In the end of all batch proceses, you can fetch your grouped
> dataset. But in underhood you can see alot of network connection between
> worker-nodes, because all your 2TB data was splitted on 128MB parts and was
> writed on the different HDFSDataNodes.
>
> As example: You analyze your workflow and realized, that in most cases,
> you  grouped your data by date(YYYY-mm-dd). In this case you can save data
> from all day in one Region Server(if you use Spark-on-HBase DataFrame). In
> this case your "group By date" operation can be done on the local
> worker-node and without shuffling your temporary data between other
> workers-nodes. Maybe this article can be usefull:
> http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
>
> 2016-08-01 18:56 GMT+03:00 Jestin Ma <jestinwith.an.e@gmail.com
> <javascript:_e(%7B%7D,'cvml','jestinwith.an.e@gmail.com');>>:
>
>> Hi Nikolay, I'm looking at data locality improvements for Spark, and I
>> have conflicting sources on using YARN for Spark.
>>
>> Reynold said that Spark workers automatically take care of data locality
>> here:
>> https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
>>
>> However, I've read elsewhere (
>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
>> that Spark on YARN increases data locality because YARN tries to place
>> tasks next to HDFS blocks.
>>
>> Can anyone verify/support one side or the other?
>>
>> Thank you,
>> Jestin
>>
>> On Mon, Aug 1, 2016 at 1:15 AM, Nikolay Zhebet <phpapple@gmail.com
>> <javascript:_e(%7B%7D,'cvml','phpapple@gmail.com');>> wrote:
>>
>>> Hi.
>>> Maybe you can help "data locality"..
>>> If you use groupBY and joins, than most likely you will see alot of
>>> network operations. This can be werry slow. You can try prepare, transform
>>> your information in that way, what can minimize transporting temporary
>>> information between worker-nodes.
>>>
>>> Try google in this way "Data locality in Hadoop"
>>>
>>>
>>> 2016-08-01 4:41 GMT+03:00 Jestin Ma <jestinwith.an.e@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','jestinwith.an.e@gmail.com');>>:
>>>
>>>> It seems that the number of tasks being this large do not matter. Each
>>>> task was set default by the HDFS as 128 MB (block size) which I've heard to
>>>> be ok. I've tried tuning the block (task) size to be larger and smaller to
>>>> no avail.
>>>>
>>>> I tried coalescing to 50 but that introduced large data skew and slowed
>>>> down my job a lot.
>>>>
>>>> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich <andrew@aehrlich.com
>>>> <javascript:_e(%7B%7D,'cvml','andrew@aehrlich.com');>> wrote:
>>>>
>>>>> 15000 seems like a lot of tasks for that size. Test it out with a
>>>>> .coalesce(50) placed right after loading the data. It will probably either
>>>>> run faster or crash with out of memory errors.
>>>>>
>>>>> On Jul 29, 2016, at 9:02 AM, Jestin Ma <jestinwith.an.e@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','jestinwith.an.e@gmail.com');>> wrote:
>>>>>
>>>>> I am processing ~2 TB of hdfs data using DataFrames. The size of a
>>>>> task is equal to the block size specified by hdfs, which happens to be 128
>>>>> MB, leading to about 15000 tasks.
>>>>>
>>>>> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
>>>>> I'm performing groupBy, count, and an outer-join with another
>>>>> DataFrame of ~200 MB size (~80 MB cached but I don't need to cache it),
>>>>> then saving to disk.
>>>>>
>>>>> Right now it takes about 55 minutes, and I've been trying to tune it.
>>>>>
>>>>> I read on the Spark Tuning guide that:
>>>>> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>>>>>
>>>>> This means that I should have about 30-50 tasks instead of 15000, and
>>>>> each task would be much bigger in size. Is my understanding correct, and is
>>>>> this suggested? I've read from difference sources to decrease or increase
>>>>> parallelism, or even keep it default.
>>>>>
>>>>> Thank you for your help,
>>>>> Jestin
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

-- 
Best Regards,
Sonal
Founder, Nube Technologies <http://www.nubetech.co>
Reifier at Strata Hadoop World <https://www.youtube.com/watch?v=eD3LkpPQIgM>
Reifier at Spark Summit 2015
<https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>

<http://in.linkedin.com/in/sonalgoyal>

Re: Tuning level of Parallelism: Increase or decrease?

Posted by Nikolay Zhebet <ph...@gmail.com>.

Yes, Spark always trying to deliver snippet of code to the data (not vice
versa). But you should realize, that if you try to run groupBY or Join on
the large dataset, then you always should migrate temporary localy grouped
data from one worker node to the another(It is shuffle operation as i
know). In the end of all batch proceses, you can fetch your grouped
dataset. But in underhood you can see alot of network connection between
worker-nodes, because all your 2TB data was splitted on 128MB parts and was
writed on the different HDFSDataNodes.

As example: You analyze your workflow and realized, that in most cases, you
 grouped your data by date(YYYY-mm-dd). In this case you can save data from
all day in one Region Server(if you use Spark-on-HBase DataFrame). In this
case your "group By date" operation can be done on the local worker-node
and without shuffling your temporary data between other workers-nodes.
Maybe this article can be usefull:
http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/

2016-08-01 18:56 GMT+03:00 Jestin Ma <je...@gmail.com>:

> Hi Nikolay, I'm looking at data locality improvements for Spark, and I
> have conflicting sources on using YARN for Spark.
>
> Reynold said that Spark workers automatically take care of data locality
> here:
> https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
>
> However, I've read elsewhere (
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
> that Spark on YARN increases data locality because YARN tries to place
> tasks next to HDFS blocks.
>
> Can anyone verify/support one side or the other?
>
> Thank you,
> Jestin
>
> On Mon, Aug 1, 2016 at 1:15 AM, Nikolay Zhebet <ph...@gmail.com> wrote:
>
>> Hi.
>> Maybe you can help "data locality"..
>> If you use groupBY and joins, than most likely you will see alot of
>> network operations. This can be werry slow. You can try prepare, transform
>> your information in that way, what can minimize transporting temporary
>> information between worker-nodes.
>>
>> Try google in this way "Data locality in Hadoop"
>>
>>
>> 2016-08-01 4:41 GMT+03:00 Jestin Ma <je...@gmail.com>:
>>
>>> It seems that the number of tasks being this large do not matter. Each
>>> task was set default by the HDFS as 128 MB (block size) which I've heard to
>>> be ok. I've tried tuning the block (task) size to be larger and smaller to
>>> no avail.
>>>
>>> I tried coalescing to 50 but that introduced large data skew and slowed
>>> down my job a lot.
>>>
>>> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich <an...@aehrlich.com>
>>> wrote:
>>>
>>>> 15000 seems like a lot of tasks for that size. Test it out with a
>>>> .coalesce(50) placed right after loading the data. It will probably either
>>>> run faster or crash with out of memory errors.
>>>>
>>>> On Jul 29, 2016, at 9:02 AM, Jestin Ma <je...@gmail.com>
>>>> wrote:
>>>>
>>>> I am processing ~2 TB of hdfs data using DataFrames. The size of a task
>>>> is equal to the block size specified by hdfs, which happens to be 128 MB,
>>>> leading to about 15000 tasks.
>>>>
>>>> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
>>>> I'm performing groupBy, count, and an outer-join with another DataFrame
>>>> of ~200 MB size (~80 MB cached but I don't need to cache it), then saving
>>>> to disk.
>>>>
>>>> Right now it takes about 55 minutes, and I've been trying to tune it.
>>>>
>>>> I read on the Spark Tuning guide that:
>>>> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>>>>
>>>> This means that I should have about 30-50 tasks instead of 15000, and
>>>> each task would be much bigger in size. Is my understanding correct, and is
>>>> this suggested? I've read from difference sources to decrease or increase
>>>> parallelism, or even keep it default.
>>>>
>>>> Thank you for your help,
>>>> Jestin
>>>>
>>>>
>>>>
>>>
>>
>

Re: Tuning level of Parallelism: Increase or decrease?

Posted by Yong Zhang <ja...@hotmail.com>.

Data Locality is part of job/task scheduling responsibility. So both links you specified originally are correct, one is for the standalone mode comes with Spark, another is for the YARN. Both have this ability.

But YARN, as a very popular scheduling component, comes with MUCH, MUCH more features than the Standalone mode. You can research more on google about it.

Yong

________________________________
From: Jestin Ma <je...@gmail.com>
Sent: Tuesday, August 2, 2016 7:11 PM
To: Jacek Laskowski
Cc: Nikolay Zhebet; Andrew Ehrlich; user
Subject: Re: Tuning level of Parallelism: Increase or decrease?

Hi Jacek,
I found this page of your book here: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html
Data Locality · Mastering Apache Spark<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html>
jaceklaskowski.gitbooks.io
Spark relies on data locality, aka data placement or proximity to data source, that makes Spark jobs sensitive to where the data is located. It is therefore important ...

which says:  "It is therefore important to have Spark running on Hadoop YARN cluster<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/> if the data comes from HDFS. In Spark on YARN<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/> Spark tries to place tasks alongside HDFS blocks."

So my reasoning was that since Spark takes care of data locality when workers load data from HDFS, I can't see why running on YARN is more important.

Hope this makes my question clearer.

On Tue, Aug 2, 2016 at 3:54 PM, Jacek Laskowski <ja...@japila.pl>> wrote:
On Mon, Aug 1, 2016 at 5:56 PM, Jestin Ma <je...@gmail.com>> wrote:
> Hi Nikolay, I'm looking at data locality improvements for Spark, and I have
> conflicting sources on using YARN for Spark.
>
> Reynold said that Spark workers automatically take care of data locality
> here:
> https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
>
> However, I've read elsewhere
> (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
> that Spark on YARN increases data locality because YARN tries to place tasks
> next to HDFS blocks.
>
> Can anyone verify/support one side or the other?

Hi Jestin,

I'm the author of the latter. I can't seem to find how Reynold
"conflicts" with what I wrote in the notes? Could you elaborate?

I certainly may be wrong.

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

Re: Tuning level of Parallelism: Increase or decrease?

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi Jestin,

I need to expand on this in the Spark notes.

Spark can handle data locality itself but if the Spark nodes run on
separate nodes than HDFS' there's always the network between them that
makes the performance worse comparing to co-location of Spark and HDFS
nodes.

These are mere details and does not really influence the main question
about level of parallelism (yet it is related).

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Aug 3, 2016 at 1:11 AM, Jestin Ma <je...@gmail.com> wrote:
> Hi Jacek,
> I found this page of your book here:
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html
>
> which says:  "It is therefore important to have Spark running on Hadoop YARN
> cluster if the data comes from HDFS. In Spark on YARN Spark tries to place
> tasks alongside HDFS blocks."
>
>
> So my reasoning was that since Spark takes care of data locality when
> workers load data from HDFS, I can't see why running on YARN is more
> important.
>
> Hope this makes my question clearer.
>
>
> On Tue, Aug 2, 2016 at 3:54 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>>
>> On Mon, Aug 1, 2016 at 5:56 PM, Jestin Ma <je...@gmail.com>
>> wrote:
>> > Hi Nikolay, I'm looking at data locality improvements for Spark, and I
>> > have
>> > conflicting sources on using YARN for Spark.
>> >
>> > Reynold said that Spark workers automatically take care of data locality
>> > here:
>> >
>> > https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
>> >
>> > However, I've read elsewhere
>> >
>> > (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
>> > that Spark on YARN increases data locality because YARN tries to place
>> > tasks
>> > next to HDFS blocks.
>> >
>> > Can anyone verify/support one side or the other?
>>
>> Hi Jestin,
>>
>> I'm the author of the latter. I can't seem to find how Reynold
>> "conflicts" with what I wrote in the notes? Could you elaborate?
>>
>> I certainly may be wrong.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Tuning level of Parallelism: Increase or decrease?

Posted by Jestin Ma <je...@gmail.com>.

Hi Jacek,
I found this page of your book here:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html

which says:  "It is therefore important to have Spark running on Hadoop
YARN cluster
<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/> if
the data comes from HDFS. In Spark on YARN
<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/> Spark
tries to place tasks alongside HDFS blocks."

So my reasoning was that since Spark takes care of data locality when
workers load data from HDFS, I can't see why running on YARN is more
important.

Hope this makes my question clearer.

On Tue, Aug 2, 2016 at 3:54 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> On Mon, Aug 1, 2016 at 5:56 PM, Jestin Ma <je...@gmail.com>
> wrote:
> > Hi Nikolay, I'm looking at data locality improvements for Spark, and I
> have
> > conflicting sources on using YARN for Spark.
> >
> > Reynold said that Spark workers automatically take care of data locality
> > here:
> >
> https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
> >
> > However, I've read elsewhere
> > (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/
> )
> > that Spark on YARN increases data locality because YARN tries to place
> tasks
> > next to HDFS blocks.
> >
> > Can anyone verify/support one side or the other?
>
> Hi Jestin,
>
> I'm the author of the latter. I can't seem to find how Reynold
> "conflicts" with what I wrote in the notes? Could you elaborate?
>
> I certainly may be wrong.
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>

Re: Tuning level of Parallelism: Increase or decrease?

Posted by Jacek Laskowski <ja...@japila.pl>.

On Mon, Aug 1, 2016 at 5:56 PM, Jestin Ma <je...@gmail.com> wrote:
> Hi Nikolay, I'm looking at data locality improvements for Spark, and I have
> conflicting sources on using YARN for Spark.
>
> Reynold said that Spark workers automatically take care of data locality
> here:
> https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
>
> However, I've read elsewhere
> (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
> that Spark on YARN increases data locality because YARN tries to place tasks
> next to HDFS blocks.
>
> Can anyone verify/support one side or the other?

Hi Jestin,

I'm the author of the latter. I can't seem to find how Reynold
"conflicts" with what I wrote in the notes? Could you elaborate?

I certainly may be wrong.

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Tuning level of Parallelism: Increase or decrease?

Posted by Jestin Ma <je...@gmail.com>.

Hi Nikolay, I'm looking at data locality improvements for Spark, and I have
conflicting sources on using YARN for Spark.

Reynold said that Spark workers automatically take care of data locality
here:
https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS

However, I've read elsewhere (
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
that Spark on YARN increases data locality because YARN tries to place
tasks next to HDFS blocks.

Can anyone verify/support one side or the other?

Thank you,
Jestin

On Mon, Aug 1, 2016 at 1:15 AM, Nikolay Zhebet <ph...@gmail.com> wrote:

> Hi.
> Maybe you can help "data locality"..
> If you use groupBY and joins, than most likely you will see alot of
> network operations. This can be werry slow. You can try prepare, transform
> your information in that way, what can minimize transporting temporary
> information between worker-nodes.
>
> Try google in this way "Data locality in Hadoop"
>
>
> 2016-08-01 4:41 GMT+03:00 Jestin Ma <je...@gmail.com>:
>
>> It seems that the number of tasks being this large do not matter. Each
>> task was set default by the HDFS as 128 MB (block size) which I've heard to
>> be ok. I've tried tuning the block (task) size to be larger and smaller to
>> no avail.
>>
>> I tried coalescing to 50 but that introduced large data skew and slowed
>> down my job a lot.
>>
>> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich <an...@aehrlich.com>
>> wrote:
>>
>>> 15000 seems like a lot of tasks for that size. Test it out with a
>>> .coalesce(50) placed right after loading the data. It will probably either
>>> run faster or crash with out of memory errors.
>>>
>>> On Jul 29, 2016, at 9:02 AM, Jestin Ma <je...@gmail.com>
>>> wrote:
>>>
>>> I am processing ~2 TB of hdfs data using DataFrames. The size of a task
>>> is equal to the block size specified by hdfs, which happens to be 128 MB,
>>> leading to about 15000 tasks.
>>>
>>> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
>>> I'm performing groupBy, count, and an outer-join with another DataFrame
>>> of ~200 MB size (~80 MB cached but I don't need to cache it), then saving
>>> to disk.
>>>
>>> Right now it takes about 55 minutes, and I've been trying to tune it.
>>>
>>> I read on the Spark Tuning guide that:
>>> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>>>
>>> This means that I should have about 30-50 tasks instead of 15000, and
>>> each task would be much bigger in size. Is my understanding correct, and is
>>> this suggested? I've read from difference sources to decrease or increase
>>> parallelism, or even keep it default.
>>>
>>> Thank you for your help,
>>> Jestin
>>>
>>>
>>>
>>
>

Re: Tuning level of Parallelism: Increase or decrease?

Posted by Nikolay Zhebet <ph...@gmail.com>.

Hi.
Maybe you can help "data locality"..
If you use groupBY and joins, than most likely you will see alot of network
operations. This can be werry slow. You can try prepare, transform your
information in that way, what can minimize transporting temporary
information between worker-nodes.

Try google in this way "Data locality in Hadoop"


2016-08-01 4:41 GMT+03:00 Jestin Ma <je...@gmail.com>:

> It seems that the number of tasks being this large do not matter. Each
> task was set default by the HDFS as 128 MB (block size) which I've heard to
> be ok. I've tried tuning the block (task) size to be larger and smaller to
> no avail.
>
> I tried coalescing to 50 but that introduced large data skew and slowed
> down my job a lot.
>
> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich <an...@aehrlich.com>
> wrote:
>
>> 15000 seems like a lot of tasks for that size. Test it out with a
>> .coalesce(50) placed right after loading the data. It will probably either
>> run faster or crash with out of memory errors.
>>
>> On Jul 29, 2016, at 9:02 AM, Jestin Ma <je...@gmail.com> wrote:
>>
>> I am processing ~2 TB of hdfs data using DataFrames. The size of a task
>> is equal to the block size specified by hdfs, which happens to be 128 MB,
>> leading to about 15000 tasks.
>>
>> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
>> I'm performing groupBy, count, and an outer-join with another DataFrame
>> of ~200 MB size (~80 MB cached but I don't need to cache it), then saving
>> to disk.
>>
>> Right now it takes about 55 minutes, and I've been trying to tune it.
>>
>> I read on the Spark Tuning guide that:
>> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>>
>> This means that I should have about 30-50 tasks instead of 15000, and
>> each task would be much bigger in size. Is my understanding correct, and is
>> this suggested? I've read from difference sources to decrease or increase
>> parallelism, or even keep it default.
>>
>> Thank you for your help,
>> Jestin
>>
>>
>>
>

Re: Tuning level of Parallelism: Increase or decrease?

Posted by Jestin Ma <je...@gmail.com>.

It seems that the number of tasks being this large do not matter. Each task
was set default by the HDFS as 128 MB (block size) which I've heard to be
ok. I've tried tuning the block (task) size to be larger and smaller to no
avail.

I tried coalescing to 50 but that introduced large data skew and slowed
down my job a lot.

On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich <an...@aehrlich.com> wrote:

> 15000 seems like a lot of tasks for that size. Test it out with a
> .coalesce(50) placed right after loading the data. It will probably either
> run faster or crash with out of memory errors.
>
> On Jul 29, 2016, at 9:02 AM, Jestin Ma <je...@gmail.com> wrote:
>
> I am processing ~2 TB of hdfs data using DataFrames. The size of a task is
> equal to the block size specified by hdfs, which happens to be 128 MB,
> leading to about 15000 tasks.
>
> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
> I'm performing groupBy, count, and an outer-join with another DataFrame of
> ~200 MB size (~80 MB cached but I don't need to cache it), then saving to
> disk.
>
> Right now it takes about 55 minutes, and I've been trying to tune it.
>
> I read on the Spark Tuning guide that:
> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>
> This means that I should have about 30-50 tasks instead of 15000, and each
> task would be much bigger in size. Is my understanding correct, and is this
> suggested? I've read from difference sources to decrease or increase
> parallelism, or even keep it default.
>
> Thank you for your help,
> Jestin
>
>
>

Re: Tuning level of Parallelism: Increase or decrease?

Posted by Andrew Ehrlich <an...@aehrlich.com>.

15000 seems like a lot of tasks for that size. Test it out with a .coalesce(50) placed right after loading the data. It will probably either run faster or crash with out of memory errors.

> On Jul 29, 2016, at 9:02 AM, Jestin Ma <je...@gmail.com> wrote:
> 
> I am processing ~2 TB of hdfs data using DataFrames. The size of a task is equal to the block size specified by hdfs, which happens to be 128 MB, leading to about 15000 tasks.
> 
> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
> I'm performing groupBy, count, and an outer-join with another DataFrame of ~200 MB size (~80 MB cached but I don't need to cache it), then saving to disk.
> 
> Right now it takes about 55 minutes, and I've been trying to tune it.
> 
> I read on the Spark Tuning guide that:
> In general, we recommend 2-3 tasks per CPU core in your cluster.
> 
> This means that I should have about 30-50 tasks instead of 15000, and each task would be much bigger in size. Is my understanding correct, and is this suggested? I've read from difference sources to decrease or increase parallelism, or even keep it default.
> 
> Thank you for your help,
> Jestin