You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ken Geis <ge...@gmail.com> on 2016/02/26 06:22:25 UTC

merge join already sorted data?

I am loading data from two different databases and joining it in Spark. The
data is indexed in the database, so it is efficient to retrieve the data
ordered by a key. Can I tell Spark that my data is coming in ordered on
that key so that when I join the data sets, they will be joined with little
shuffling via a merge join?

I know that Flink supports this, but its JDBC support is pretty lacking in
general.


Thanks,

Ken

Re: merge join already sorted data?

Posted by Takeshi Yamamuro <li...@gmail.com>.
Hi,

SparkSQL inside can put order assumptions on columns (OrderedDistribution)
though,
JDBC datasources does not support this; spark is not sure how columns
loaded from databases are ordered.
Also, there is no way to let spark know this order.

thanks,



On Fri, Feb 26, 2016 at 2:22 PM, Ken Geis <ge...@gmail.com> wrote:

> I am loading data from two different databases and joining it in Spark.
> The data is indexed in the database, so it is efficient to retrieve the
> data ordered by a key. Can I tell Spark that my data is coming in ordered
> on that key so that when I join the data sets, they will be joined with
> little shuffling via a merge join?
>
> I know that Flink supports this, but its JDBC support is pretty lacking in
> general.
>
>
> Thanks,
>
> Ken
>
>


-- 
---
Takeshi Yamamuro