You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by t3l <t3...@threelights.de> on 2015/10/20 14:00:03 UTC

Ahhhh... Spark creates >30000 partitions... What can I do?

I have dataset consisting of 50000 binary files (each between 500kb and
2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
cluster are also the workers for Spark. I open the files as a RDD using
sc.binaryFiles("hdfs:///path_to_directory").When I run the first action that
involves this RDD, Spark spawns a RDD with more than 30000 partitions. And
this takes ages to process these partitions even if you simply run "count".
Performing a "repartition" directly after loading does not help, because
Spark seems to insist on materializing the RDD created by binaryFiles first.

How I can get around this?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ahhhh-Spark-creates-30000-partitions-What-can-I-do-tp25140.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

(SOLVED) Ahhhh... Spark creates >30000 partitions... What can I do?

Posted by t3l <t3...@threelights.de>.

I was able to solve this by myself. What I did is changing the way spark
computes the partitioning for binary files.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ahhhh-Spark-creates-30000-partitions-What-can-I-do-tp25140p25170.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

Posted by Ranadip Chatterjee <ra...@gmail.com>.

T3l,

Did Sean Owen's suggestion help? If not, can you please share the behaviour?

Cheers.
On 20 Oct 2015 11:02 pm, "Lan Jiang" <lj...@gmail.com> wrote:

> I think the data file is binary per the original post. So in this case,
> sc.binaryFiles should be used. However, I still recommend against using so
> many small binary files as
>
> 1. They are not good for batch I/O
> 2. They put too many memory pressure on namenode.
>
> Lan
>
>
> On Oct 20, 2015, at 11:20 AM, Deenar Toraskar <de...@gmail.com>
> wrote:
>
> also check out wholeTextFiles
>
>
> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int)
>
> On 20 October 2015 at 15:04, Lan Jiang <lj...@gmail.com> wrote:
>
>> As Francois pointed out, you are encountering a classic small file
>> anti-pattern. One solution I used in the past is to wrap all these small
>> binary files into a sequence file or avro file. For example, the avro
>> schema can have two fields: filename: string and binaryname:byte[]. Thus
>> your file is splittable and will not create so many partitions.
>>
>> Lan
>>
>>
>> On Oct 20, 2015, at 8:03 AM, François Pelletier <
>> newsletters@francoispelletier.org> wrote:
>>
>> You should aggregate your files in larger chunks before doing anything
>> else. HDFS is not fit for small files. It will bloat it and cause you a lot
>> of performance issues. Target a few hundred MB chunks partition size and
>> then save those files back to hdfs and then delete the original ones. You
>> can read, use coalesce and the saveAsXXX on the result.
>>
>> I had the same kind of problem once and solved it in bunching 100's of
>> files together in larger ones. I used text files with bzip2 compression.
>>
>>
>>
>> Le 2015-10-20 08:42, Sean Owen a écrit :
>>
>> coalesce without a shuffle? it shouldn't be an action. It just treats
>> many partitions as one.
>>
>> On Tue, Oct 20, 2015 at 1:00 PM, t3l <t3...@threelights.de> wrote:
>>
>>>
>>> I have dataset consisting of 50000 binary files (each between 500kb and
>>> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
>>> cluster are also the workers for Spark. I open the files as a RDD using
>>> sc.binaryFiles("hdfs:///path_to_directory").When I run the first action
>>> that
>>> involves this RDD, Spark spawns a RDD with more than 30000 partitions.
>>> And
>>> this takes ages to process these partitions even if you simply run
>>> "count".
>>> Performing a "repartition" directly after loading does not help, because
>>> Spark seems to insist on materializing the RDD created by binaryFiles
>>> first.
>>>
>>> How I can get around this?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Ahhhh-Spark-creates-30000-partitions-What-can-I-do-tp25140.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> <http://nabble.com/>.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>>
>>
>
>

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

Posted by Lan Jiang <lj...@gmail.com>.

I think the data file is binary per the original post. So in this case, sc.binaryFiles should be used. However, I still recommend against using so many small binary files as 

1. They are not good for batch I/O
2. They put too many memory pressure on namenode.

Lan


> On Oct 20, 2015, at 11:20 AM, Deenar Toraskar <de...@gmail.com> wrote:
> 
> also check out wholeTextFiles
> 
> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int) <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int)>
> 
> On 20 October 2015 at 15:04, Lan Jiang <ljiang2@gmail.com <ma...@gmail.com>> wrote:
> As Francois pointed out, you are encountering a classic small file anti-pattern. One solution I used in the past is to wrap all these small binary files into a sequence file or avro file. For example, the avro schema can have two fields: filename: string and binaryname:byte[]. Thus your file is splittable and will not create so many partitions.
> 
> Lan
> 
> 
>> On Oct 20, 2015, at 8:03 AM, François Pelletier <newsletters@francoispelletier.org <ma...@francoispelletier.org>> wrote:
>> 
>> You should aggregate your files in larger chunks before doing anything else. HDFS is not fit for small files. It will bloat it and cause you a lot of performance issues. Target a few hundred MB chunks partition size and then save those files back to hdfs and then delete the original ones. You can read, use coalesce and the saveAsXXX on the result.
>> 
>> I had the same kind of problem once and solved it in bunching 100's of files together in larger ones. I used text files with bzip2 compression.
>> 
>> 
>> 
>> Le 2015-10-20 08:42, Sean Owen a écrit :
>>> coalesce without a shuffle? it shouldn't be an action. It just treats many partitions as one.
>>> 
>>> On Tue, Oct 20, 2015 at 1:00 PM, t3l <t3l@threelights.de <ma...@threelights.de>> wrote:
>>> 
>>> I have dataset consisting of 50000 binary files (each between 500kb and
>>> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
>>> cluster are also the workers for Spark. I open the files as a RDD using
>>> sc.binaryFiles("hdfs:///path_to_directory <>").When I run the first action that
>>> involves this RDD, Spark spawns a RDD with more than 30000 partitions. And
>>> this takes ages to process these partitions even if you simply run "count".
>>> Performing a "repartition" directly after loading does not help, because
>>> Spark seems to insist on materializing the RDD created by binaryFiles first.
>>> 
>>> How I can get around this?
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ahhhh-Spark-creates-30000-partitions-What-can-I-do-tp25140.html <http://apache-spark-user-list.1001560.n3.nabble.com/Ahhhh-Spark-creates-30000-partitions-What-can-I-do-tp25140.html>
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com <http://nabble.com/>.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
>>> 
>>> 
>> 
> 
>

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

Posted by Deenar Toraskar <de...@gmail.com>.

also check out wholeTextFiles

https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int)

On 20 October 2015 at 15:04, Lan Jiang <lj...@gmail.com> wrote:

> As Francois pointed out, you are encountering a classic small file
> anti-pattern. One solution I used in the past is to wrap all these small
> binary files into a sequence file or avro file. For example, the avro
> schema can have two fields: filename: string and binaryname:byte[]. Thus
> your file is splittable and will not create so many partitions.
>
> Lan
>
>
> On Oct 20, 2015, at 8:03 AM, François Pelletier <
> newsletters@francoispelletier.org> wrote:
>
> You should aggregate your files in larger chunks before doing anything
> else. HDFS is not fit for small files. It will bloat it and cause you a lot
> of performance issues. Target a few hundred MB chunks partition size and
> then save those files back to hdfs and then delete the original ones. You
> can read, use coalesce and the saveAsXXX on the result.
>
> I had the same kind of problem once and solved it in bunching 100's of
> files together in larger ones. I used text files with bzip2 compression.
>
>
>
> Le 2015-10-20 08:42, Sean Owen a écrit :
>
> coalesce without a shuffle? it shouldn't be an action. It just treats many
> partitions as one.
>
> On Tue, Oct 20, 2015 at 1:00 PM, t3l <t3...@threelights.de> wrote:
>
>>
>> I have dataset consisting of 50000 binary files (each between 500kb and
>> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
>> cluster are also the workers for Spark. I open the files as a RDD using
>> sc.binaryFiles("hdfs:///path_to_directory").When I run the first action
>> that
>> involves this RDD, Spark spawns a RDD with more than 30000 partitions. And
>> this takes ages to process these partitions even if you simply run
>> "count".
>> Performing a "repartition" directly after loading does not help, because
>> Spark seems to insist on materializing the RDD created by binaryFiles
>> first.
>>
>> How I can get around this?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Ahhhh-Spark-creates-30000-partitions-What-can-I-do-tp25140.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
>

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

Posted by Lan Jiang <lj...@gmail.com>.

As Francois pointed out, you are encountering a classic small file anti-pattern. One solution I used in the past is to wrap all these small binary files into a sequence file or avro file. For example, the avro schema can have two fields: filename: string and binaryname:byte[]. Thus your file is splittable and will not create so many partitions.

Lan


> On Oct 20, 2015, at 8:03 AM, François Pelletier <ne...@francoispelletier.org> wrote:
> 
> You should aggregate your files in larger chunks before doing anything else. HDFS is not fit for small files. It will bloat it and cause you a lot of performance issues. Target a few hundred MB chunks partition size and then save those files back to hdfs and then delete the original ones. You can read, use coalesce and the saveAsXXX on the result.
> 
> I had the same kind of problem once and solved it in bunching 100's of files together in larger ones. I used text files with bzip2 compression.
> 
> 
> 
> Le 2015-10-20 08:42, Sean Owen a écrit :
>> coalesce without a shuffle? it shouldn't be an action. It just treats many partitions as one.
>> 
>> On Tue, Oct 20, 2015 at 1:00 PM, t3l <t3l@threelights.de <ma...@threelights.de>> wrote:
>> 
>> I have dataset consisting of 50000 binary files (each between 500kb and
>> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
>> cluster are also the workers for Spark. I open the files as a RDD using
>> sc.binaryFiles("hdfs:///path_to_directory").When I run the first action that
>> involves this RDD, Spark spawns a RDD with more than 30000 partitions. And
>> this takes ages to process these partitions even if you simply run "count".
>> Performing a "repartition" directly after loading does not help, because
>> Spark seems to insist on materializing the RDD created by binaryFiles first.
>> 
>> How I can get around this?
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ahhhh-Spark-creates-30000-partitions-What-can-I-do-tp25140.html <http://apache-spark-user-list.1001560.n3.nabble.com/Ahhhh-Spark-creates-30000-partitions-What-can-I-do-tp25140.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
>> 
>> 
>

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

Posted by François Pelletier <ne...@francoispelletier.org>.

You should aggregate your files in larger chunks before doing anything
else. HDFS is not fit for small files. It will bloat it and cause you a
lot of performance issues. Target a few hundred MB chunks partition size
and then save those files back to hdfs and then delete the original
ones. You can read, use coalesce and the saveAsXXX on the result.

I had the same kind of problem once and solved it in bunching 100's of
files together in larger ones. I used text files with bzip2 compression.



Le 2015-10-20 08:42, Sean Owen a écrit :
> coalesce without a shuffle? it shouldn't be an action. It just treats
> many partitions as one.
>
> On Tue, Oct 20, 2015 at 1:00 PM, t3l <t3l@threelights.de
> <ma...@threelights.de>> wrote:
>
>
>     I have dataset consisting of 50000 binary files (each between
>     500kb and
>     2MB). They are stored in HDFS on a Hadoop cluster. The datanodes
>     of the
>     cluster are also the workers for Spark. I open the files as a RDD
>     using
>     sc.binaryFiles("hdfs:///path_to_directory").When I run the first
>     action that
>     involves this RDD, Spark spawns a RDD with more than 30000
>     partitions. And
>     this takes ages to process these partitions even if you simply run
>     "count".
>     Performing a "repartition" directly after loading does not help,
>     because
>     Spark seems to insist on materializing the RDD created by
>     binaryFiles first.
>
>     How I can get around this?
>
>
>
>     --
>     View this message in context:
>     http://apache-spark-user-list.1001560.n3.nabble.com/Ahhhh-Spark-creates-30000-partitions-What-can-I-do-tp25140.html
>     Sent from the Apache Spark User List mailing list archive at
>     Nabble.com.
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>     For additional commands, e-mail: user-help@spark.apache.org
>     <ma...@spark.apache.org>
>
>

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

Posted by Sean Owen <so...@cloudera.com>.

coalesce without a shuffle? it shouldn't be an action. It just treats many
partitions as one.

On Tue, Oct 20, 2015 at 1:00 PM, t3l <t3...@threelights.de> wrote:

>
> I have dataset consisting of 50000 binary files (each between 500kb and
> 2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
> cluster are also the workers for Spark. I open the files as a RDD using
> sc.binaryFiles("hdfs:///path_to_directory").When I run the first action
> that
> involves this RDD, Spark spawns a RDD with more than 30000 partitions. And
> this takes ages to process these partitions even if you simply run "count".
> Performing a "repartition" directly after loading does not help, because
> Spark seems to insist on materializing the RDD created by binaryFiles
> first.
>
> How I can get around this?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Ahhhh-Spark-creates-30000-partitions-What-can-I-do-tp25140.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>