You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by unk1102 <um...@gmail.com> on 2015/08/19 21:13:34 UTC

spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

Hi I have a Spark job which deals with large skewed dataset. I have around
1000 Hive partitions to process in four different tables every day. So if I
go with 200 spark.sql.shuffle.partitions default partitions created by Spark
I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which wont be
good for HDFS name node I have been told if you keep on creating such large
no of small small files namenode will crash is it true? please help me
understand. Anyways so to avoid creating small files I did set
spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and as
per my understanding because of only one output there is so much shuffling
to do to bring all data to once reducer please correct me if I am wrong.
This is causing memory/timeout issues how do I deal with it

I tried to give spark.shuffle.storage=0.7 also still this memory seems not
enough for it. I have 25 gb executor with 4 cores and 20 such executors
still Spark job fails please guide.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

Posted by Umesh Kacha <um...@gmail.com>.

Hi Hemant sorry for the confusion I meant final output part files in the
final directory hdfs I never meant intermediate files. Thanks. My goal is
to reduce those many files because of my use case explained in the first
email with calculations.
On Aug 20, 2015 5:59 PM, "Hemant Bhanawat" <he...@gmail.com> wrote:

> Sorry, I misread your mail. Thanks for pointing that out.
>
> BTW, are the 80000 files shuffle intermediate output and not the final
> output? I assume yes. I didn't know that you can keep intermediate output
> on HDFS and I don't think that is recommended.
>
>
>
>
> On Thu, Aug 20, 2015 at 2:43 PM, Hemant Bhanawat <he...@gmail.com>
> wrote:
>
>> Looks like you are using hash based shuffling and not sort based
>> shuffling which creates a single file per maptask.
>>
>> On Thu, Aug 20, 2015 at 12:43 AM, unk1102 <um...@gmail.com> wrote:
>>
>>> Hi I have a Spark job which deals with large skewed dataset. I have
>>> around
>>> 1000 Hive partitions to process in four different tables every day. So
>>> if I
>>> go with 200 spark.sql.shuffle.partitions default partitions created by
>>> Spark
>>> I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which
>>> wont be
>>> good for HDFS name node I have been told if you keep on creating such
>>> large
>>> no of small small files namenode will crash is it true? please help me
>>> understand. Anyways so to avoid creating small files I did set
>>> spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and
>>> as
>>> per my understanding because of only one output there is so much
>>> shuffling
>>> to do to bring all data to once reducer please correct me if I am wrong.
>>> This is causing memory/timeout issues how do I deal with it
>>>
>>> I tried to give spark.shuffle.storage=0.7 also still this memory seems
>>> not
>>> enough for it. I have 25 gb executor with 4 cores and 20 such executors
>>> still Spark job fails please guide.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

Posted by Hemant Bhanawat <he...@gmail.com>.

Sorry, I misread your mail. Thanks for pointing that out.

BTW, are the 80000 files shuffle intermediate output and not the final
output? I assume yes. I didn't know that you can keep intermediate output
on HDFS and I don't think that is recommended.




On Thu, Aug 20, 2015 at 2:43 PM, Hemant Bhanawat <he...@gmail.com>
wrote:

> Looks like you are using hash based shuffling and not sort based shuffling
> which creates a single file per maptask.
>
> On Thu, Aug 20, 2015 at 12:43 AM, unk1102 <um...@gmail.com> wrote:
>
>> Hi I have a Spark job which deals with large skewed dataset. I have around
>> 1000 Hive partitions to process in four different tables every day. So if
>> I
>> go with 200 spark.sql.shuffle.partitions default partitions created by
>> Spark
>> I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which wont
>> be
>> good for HDFS name node I have been told if you keep on creating such
>> large
>> no of small small files namenode will crash is it true? please help me
>> understand. Anyways so to avoid creating small files I did set
>> spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and
>> as
>> per my understanding because of only one output there is so much shuffling
>> to do to bring all data to once reducer please correct me if I am wrong.
>> This is causing memory/timeout issues how do I deal with it
>>
>> I tried to give spark.shuffle.storage=0.7 also still this memory seems not
>> enough for it. I have 25 gb executor with 4 cores and 20 such executors
>> still Spark job fails please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

Posted by Hemant Bhanawat <he...@gmail.com>.

Looks like you are using hash based shuffling and not sort based shuffling
which creates a single file per maptask.

On Thu, Aug 20, 2015 at 12:43 AM, unk1102 <um...@gmail.com> wrote:

> Hi I have a Spark job which deals with large skewed dataset. I have around
> 1000 Hive partitions to process in four different tables every day. So if I
> go with 200 spark.sql.shuffle.partitions default partitions created by
> Spark
> I end up with 4 * 1000 * 200 = 80000 small small files in HDFS which wont
> be
> good for HDFS name node I have been told if you keep on creating such large
> no of small small files namenode will crash is it true? please help me
> understand. Anyways so to avoid creating small files I did set
> spark.sql.shuffle.partitions=1 it seems to be creating 1 output file and as
> per my understanding because of only one output there is so much shuffling
> to do to bring all data to once reducer please correct me if I am wrong.
> This is causing memory/timeout issues how do I deal with it
>
> I tried to give spark.shuffle.storage=0.7 also still this memory seems not
> enough for it. I have 25 gb executor with 4 cores and 20 such executors
> still Spark job fails please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-shuffle-partitions-1-seems-to-be-working-fine-but-creates-timeout-for-large-skewed-data-tp24346.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>