You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by unk1102 <um...@gmail.com> on 2015/10/08 20:13:02 UTC

How to increase Spark partitions for the DataFrame?

Hi I have the following code where I read ORC files from HDFS and it loads
directory which contains 12 ORC files. Now since HDFS directory contains 12
files it will create 12 partitions by default. These directory is huge and
when ORC files gets decompressed it becomes around 10 GB how do I increase
partitions for the below code so that my Spark job runs faster and does not
hang for long time because of reading 10 GB files through shuffle in 12
partitions. Please guide. 

DataFrame df =
hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
df.select().groupby(..)




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to increase Spark partitions for the DataFrame?

Posted by Ted Yu <yu...@gmail.com>.

bq. contains 12 files/blocks

Looks like you hit the limit of parallelism these files can provide.

If you have larger dataset, you would have more partitions.

On Thu, Oct 8, 2015 at 12:21 PM, Umesh Kacha <um...@gmail.com> wrote:

> Hi Lan thanks for the reply. I have tried to do the following but it did
> not increase partition
>
> DataFrame df = hiveContext.read().format("orc").load("/hdfs/path/to/orc/
> files/").repartition(100);
>
> Yes I have checked in namenode ui ORC files contains 12 files/blocks of
> 128 MB each and ORC files when decompressed its around 10 GB and its
> uncompressed file size is around 1 GB
>
> On Fri, Oct 9, 2015 at 12:43 AM, Lan Jiang <lj...@gmail.com> wrote:
>
>> Hmm, that’s odd.
>>
>> You can always use repartition(n) to increase the partition number, but
>> then there will be shuffle. How large is your ORC file? Have you used
>> NameNode UI to check how many HDFS blocks each ORC file has?
>>
>> Lan
>>
>>
>> On Oct 8, 2015, at 2:08 PM, Umesh Kacha <um...@gmail.com> wrote:
>>
>> Hi Lan, thanks for the response yes I know and I have confirmed in UI
>> that it has only 12 partitions because of 12 HDFS blocks and hive orc file
>> strip size is 33554432.
>>
>> On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <lj...@gmail.com> wrote:
>>
>>> The partition number should be the same as the HDFS block number instead
>>> of file number. Did you confirmed from the spark UI that only 12 partitions
>>> were created? What is your ORC orc.stripe.size?
>>>
>>> Lan
>>>
>>>
>>> > On Oct 8, 2015, at 1:13 PM, unk1102 <um...@gmail.com> wrote:
>>> >
>>> > Hi I have the following code where I read ORC files from HDFS and it
>>> loads
>>> > directory which contains 12 ORC files. Now since HDFS directory
>>> contains 12
>>> > files it will create 12 partitions by default. These directory is huge
>>> and
>>> > when ORC files gets decompressed it becomes around 10 GB how do I
>>> increase
>>> > partitions for the below code so that my Spark job runs faster and
>>> does not
>>> > hang for long time because of reading 10 GB files through shuffle in 12
>>> > partitions. Please guide.
>>> >
>>> > DataFrame df =
>>> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
>>> > df.select().groupby(..)
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> > For additional commands, e-mail: user-help@spark.apache.org
>>> >
>>>
>>>
>>
>>
>

Re: How to increase Spark partitions for the DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Lan thanks for the reply. I have tried to do the following but it did
not increase partition

DataFrame df = hiveContext.read().format("orc").load("/hdfs/path/to/orc/
files/").repartition(100);

Yes I have checked in namenode ui ORC files contains 12 files/blocks of 128
MB each and ORC files when decompressed its around 10 GB and its
uncompressed file size is around 1 GB

On Fri, Oct 9, 2015 at 12:43 AM, Lan Jiang <lj...@gmail.com> wrote:

> Hmm, that’s odd.
>
> You can always use repartition(n) to increase the partition number, but
> then there will be shuffle. How large is your ORC file? Have you used
> NameNode UI to check how many HDFS blocks each ORC file has?
>
> Lan
>
>
> On Oct 8, 2015, at 2:08 PM, Umesh Kacha <um...@gmail.com> wrote:
>
> Hi Lan, thanks for the response yes I know and I have confirmed in UI that
> it has only 12 partitions because of 12 HDFS blocks and hive orc file strip
> size is 33554432.
>
> On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <lj...@gmail.com> wrote:
>
>> The partition number should be the same as the HDFS block number instead
>> of file number. Did you confirmed from the spark UI that only 12 partitions
>> were created? What is your ORC orc.stripe.size?
>>
>> Lan
>>
>>
>> > On Oct 8, 2015, at 1:13 PM, unk1102 <um...@gmail.com> wrote:
>> >
>> > Hi I have the following code where I read ORC files from HDFS and it
>> loads
>> > directory which contains 12 ORC files. Now since HDFS directory
>> contains 12
>> > files it will create 12 partitions by default. These directory is huge
>> and
>> > when ORC files gets decompressed it becomes around 10 GB how do I
>> increase
>> > partitions for the below code so that my Spark job runs faster and does
>> not
>> > hang for long time because of reading 10 GB files through shuffle in 12
>> > partitions. Please guide.
>> >
>> > DataFrame df =
>> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
>> > df.select().groupby(..)
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: user-help@spark.apache.org
>> >
>>
>>
>
>

Re: How to increase Spark partitions for the DataFrame?

Posted by Lan Jiang <lj...@gmail.com>.

Hmm, that’s odd. 

You can always use repartition(n) to increase the partition number, but then there will be shuffle. How large is your ORC file? Have you used NameNode UI to check how many HDFS blocks each ORC file has?

Lan


> On Oct 8, 2015, at 2:08 PM, Umesh Kacha <um...@gmail.com> wrote:
> 
> Hi Lan, thanks for the response yes I know and I have confirmed in UI that it has only 12 partitions because of 12 HDFS blocks and hive orc file strip size is 33554432.
> 
> On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <ljiang2@gmail.com <ma...@gmail.com>> wrote:
> The partition number should be the same as the HDFS block number instead of file number. Did you confirmed from the spark UI that only 12 partitions were created? What is your ORC orc.stripe.size?
> 
> Lan
> 
> 
> > On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.kacha@gmail.com <ma...@gmail.com>> wrote:
> >
> > Hi I have the following code where I read ORC files from HDFS and it loads
> > directory which contains 12 ORC files. Now since HDFS directory contains 12
> > files it will create 12 partitions by default. These directory is huge and
> > when ORC files gets decompressed it becomes around 10 GB how do I increase
> > partitions for the below code so that my Spark job runs faster and does not
> > hang for long time because of reading 10 GB files through shuffle in 12
> > partitions. Please guide.
> >
> > DataFrame df =
> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
> > df.select().groupby(..)
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html <http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html>
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> > For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> >
> 
>

Re: How to increase Spark partitions for the DataFrame?

Posted by Umesh Kacha <um...@gmail.com>.

Hi Lan, thanks for the response yes I know and I have confirmed in UI that
it has only 12 partitions because of 12 HDFS blocks and hive orc file strip
size is 33554432.

On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <lj...@gmail.com> wrote:

> The partition number should be the same as the HDFS block number instead
> of file number. Did you confirmed from the spark UI that only 12 partitions
> were created? What is your ORC orc.stripe.size?
>
> Lan
>
>
> > On Oct 8, 2015, at 1:13 PM, unk1102 <um...@gmail.com> wrote:
> >
> > Hi I have the following code where I read ORC files from HDFS and it
> loads
> > directory which contains 12 ORC files. Now since HDFS directory contains
> 12
> > files it will create 12 partitions by default. These directory is huge
> and
> > when ORC files gets decompressed it becomes around 10 GB how do I
> increase
> > partitions for the below code so that my Spark job runs faster and does
> not
> > hang for long time because of reading 10 GB files through shuffle in 12
> > partitions. Please guide.
> >
> > DataFrame df =
> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
> > df.select().groupby(..)
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
>

Re: How to increase Spark partitions for the DataFrame?

Posted by Lan Jiang <lj...@gmail.com>.

The partition number should be the same as the HDFS block number instead of file number. Did you confirmed from the spark UI that only 12 partitions were created? What is your ORC orc.stripe.size?

Lan


> On Oct 8, 2015, at 1:13 PM, unk1102 <um...@gmail.com> wrote:
> 
> Hi I have the following code where I read ORC files from HDFS and it loads
> directory which contains 12 ORC files. Now since HDFS directory contains 12
> files it will create 12 partitions by default. These directory is huge and
> when ORC files gets decompressed it becomes around 10 GB how do I increase
> partitions for the below code so that my Spark job runs faster and does not
> hang for long time because of reading 10 GB files through shuffle in 12
> partitions. Please guide. 
> 
> DataFrame df =
> hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/");
> df.select().groupby(..)
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org