You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Darshan Singh <da...@gmail.com> on 2016/06/23 20:46:35 UTC

Partitioning in spark

Hi,

My default parallelism is 100. Now I join 2 dataframes with 20 partitions
each , joined dataframe has 100 partition. I want to know what is the way
to keep it to 20 (except re-partition and coalesce.

Also, when i join these 2 dataframes I am using 4 columns as joined
columns. The dataframes are partitions based on first 2 columns of join and
thus, in effect one partition should be joined corresponding joins and
doesn't need to join with rest of partitions so why spark is shuffling all
the data.

Simialrly, when my dataframe is partitioned by col1,col2 and if i use group
by on col1,col2,col3,col4 then why does it shuffle everything whereas it
need to sort each partitions and then should grouping there itself.

Bit confusing , I am using 1.5.1

Is it fixed in future versions.

Thanks

Re: Partitioning in spark

Posted by Darshan Singh <da...@gmail.com>.
Thanks but the whole point is not setting it explicitly but it should be
derived from its parent RDDS.

Thanks

On Fri, Jun 24, 2016 at 6:09 AM, ayan guha <gu...@gmail.com> wrote:

> You can change paralllism like following:
>
> conf = SparkConf()
> conf.set('spark.sql.shuffle.partitions',10)
> sc = SparkContext(conf=conf)
>
>
>
> On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh <da...@gmail.com>
> wrote:
>
>> Hi,
>>
>> My default parallelism is 100. Now I join 2 dataframes with 20 partitions
>> each , joined dataframe has 100 partition. I want to know what is the way
>> to keep it to 20 (except re-partition and coalesce.
>>
>> Also, when i join these 2 dataframes I am using 4 columns as joined
>> columns. The dataframes are partitions based on first 2 columns of join and
>> thus, in effect one partition should be joined corresponding joins and
>> doesn't need to join with rest of partitions so why spark is shuffling all
>> the data.
>>
>> Simialrly, when my dataframe is partitioned by col1,col2 and if i use
>> group by on col1,col2,col3,col4 then why does it shuffle everything whereas
>> it need to sort each partitions and then should grouping there itself.
>>
>> Bit confusing , I am using 1.5.1
>>
>> Is it fixed in future versions.
>>
>> Thanks
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Partitioning in spark

Posted by Darshan Singh <da...@gmail.com>.
Thanks but the whole point is not setting it explicitly but it should be
derived from its parent RDDS.

Thanks

On Fri, Jun 24, 2016 at 6:09 AM, ayan guha <gu...@gmail.com> wrote:

> You can change paralllism like following:
>
> conf = SparkConf()
> conf.set('spark.sql.shuffle.partitions',10)
> sc = SparkContext(conf=conf)
>
>
>
> On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh <da...@gmail.com>
> wrote:
>
>> Hi,
>>
>> My default parallelism is 100. Now I join 2 dataframes with 20 partitions
>> each , joined dataframe has 100 partition. I want to know what is the way
>> to keep it to 20 (except re-partition and coalesce.
>>
>> Also, when i join these 2 dataframes I am using 4 columns as joined
>> columns. The dataframes are partitions based on first 2 columns of join and
>> thus, in effect one partition should be joined corresponding joins and
>> doesn't need to join with rest of partitions so why spark is shuffling all
>> the data.
>>
>> Simialrly, when my dataframe is partitioned by col1,col2 and if i use
>> group by on col1,col2,col3,col4 then why does it shuffle everything whereas
>> it need to sort each partitions and then should grouping there itself.
>>
>> Bit confusing , I am using 1.5.1
>>
>> Is it fixed in future versions.
>>
>> Thanks
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Partitioning in spark

Posted by ayan guha <gu...@gmail.com>.
You can change paralllism like following:

conf = SparkConf()
conf.set('spark.sql.shuffle.partitions',10)
sc = SparkContext(conf=conf)



On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh <da...@gmail.com>
wrote:

> Hi,
>
> My default parallelism is 100. Now I join 2 dataframes with 20 partitions
> each , joined dataframe has 100 partition. I want to know what is the way
> to keep it to 20 (except re-partition and coalesce.
>
> Also, when i join these 2 dataframes I am using 4 columns as joined
> columns. The dataframes are partitions based on first 2 columns of join and
> thus, in effect one partition should be joined corresponding joins and
> doesn't need to join with rest of partitions so why spark is shuffling all
> the data.
>
> Simialrly, when my dataframe is partitioned by col1,col2 and if i use
> group by on col1,col2,col3,col4 then why does it shuffle everything whereas
> it need to sort each partitions and then should grouping there itself.
>
> Bit confusing , I am using 1.5.1
>
> Is it fixed in future versions.
>
> Thanks
>



-- 
Best Regards,
Ayan Guha