You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rohit Verma <ro...@rokittech.com> on 2017/02/23 10:17:46 UTC

Spark join over sorted columns of dataset.

Hi

While joining two columns of different dataset, how to optimize join if both the columns are pre sorted within the dataset.
So that when spark do sort merge join the sorting phase can skipped.

Regards
Rohit
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Spark join over sorted columns of dataset.

Posted by Li Jin <ic...@gmail.com>.

I am not an expert on this but here is what I think:

Catalyst maintains information on whether a plan node is ordered. If your
dataframe is a result of a order by, catalyst will skip the sorting when it
does merge sort join. If you dataframe is created from storage, for
instance. ParquetRelation, then I am not sure if there is an API that
allows user to tell Catalyst that the ParquetRelation is ordered on column
x. If there isn't, it's probably useful to add.

Li
On Fri, Mar 3, 2017 at 11:23 AM Koert Kuipers <ko...@tresata.com> wrote:

> For RDD the shuffle is already skipped but the sort is not. In
> spark-sorted we track partitioning and sorting within partitions for
> key-value RDDs and can avoid the sort. See:
> https://github.com/tresata/spark-sorted
>
> For Dataset/DataFrame such optimizations are done automatically, however
> it's currently not always working for Dataset, see:
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-19468
>
> On Mar 3, 2017 11:06 AM, "Rohit Verma" <ro...@rokittech.com> wrote:
>
> Sending it to dev’s.
> Can you please help me providing some ideas for below.
>
> Regards
> Rohit
> > On Feb 23, 2017, at 3:47 PM, Rohit Verma <ro...@rokittech.com>
> wrote:
> >
> > Hi
> >
> > While joining two columns of different dataset, how to optimize join if
> both the columns are pre sorted within the dataset.
> > So that when spark do sort merge join the sorting phase can skipped.
> >
> > Regards
> > Rohit
>
>
>

Re: Spark join over sorted columns of dataset.

Posted by Koert Kuipers <ko...@tresata.com>.

For RDD the shuffle is already skipped but the sort is not. In spark-sorted
we track partitioning and sorting within partitions for key-value RDDs and
can avoid the sort. See:
https://github.com/tresata/spark-sorted

For Dataset/DataFrame such optimizations are done automatically, however
it's currently not always working for Dataset, see:
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-19468

On Mar 3, 2017 11:06 AM, "Rohit Verma" <ro...@rokittech.com> wrote:

Sending it to dev’s.
Can you please help me providing some ideas for below.

Regards
Rohit
> On Feb 23, 2017, at 3:47 PM, Rohit Verma <ro...@rokittech.com>
wrote:
>
> Hi
>
> While joining two columns of different dataset, how to optimize join if
both the columns are pre sorted within the dataset.
> So that when spark do sort merge join the sorting phase can skipped.
>
> Regards
> Rohit

Re: Spark join over sorted columns of dataset.

Posted by Koert Kuipers <ko...@tresata.com>.

For RDD the shuffle is already skipped but the sort is not. In spark-sorted
we track partitioning and sorting within partitions for key-value RDDs and
can avoid the sort. See:
https://github.com/tresata/spark-sorted

For Dataset/DataFrame such optimizations are done automatically, however
it's currently not always working for Dataset, see:
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-19468

On Mar 3, 2017 11:06 AM, "Rohit Verma" <ro...@rokittech.com> wrote:

Sending it to dev’s.
Can you please help me providing some ideas for below.

Regards
Rohit
> On Feb 23, 2017, at 3:47 PM, Rohit Verma <ro...@rokittech.com>
wrote:
>
> Hi
>
> While joining two columns of different dataset, how to optimize join if
both the columns are pre sorted within the dataset.
> So that when spark do sort merge join the sorting phase can skipped.
>
> Regards
> Rohit

Re: Spark join over sorted columns of dataset.

Posted by Rohit Verma <ro...@rokittech.com>.

Sending it to dev’s.
Can you please help me providing some ideas for below.

Regards
Rohit
> On Feb 23, 2017, at 3:47 PM, Rohit Verma <ro...@rokittech.com> wrote:
> 
> Hi
> 
> While joining two columns of different dataset, how to optimize join if both the columns are pre sorted within the dataset.
> So that when spark do sort merge join the sorting phase can skipped.
> 
> Regards
> Rohit