You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Alex Nastetsky <al...@vervemobile.com> on 2015/11/04 15:37:53 UTC

Sort Merge Join from the filesystem

(this is kind of a cross-post from the user list)

Does Spark support doing a sort merge join on two datasets on the file
system that have already been partitioned the same with the same number of
partitions and sorted within each partition, without needing to
repartition/sort them again?

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)

If this is not supported in Spark, is a ticket already open for it? Does
the Spark architecture present unique difficulties to having this feature?

It is very useful to have this ability, as you can prepare dataset A to be
joined with dataset B before B even exists, by pre-processing A with a
partition/sort.

Thanks.

Re: Sort Merge Join from the filesystem

Posted by Alex Nastetsky <al...@vervemobile.com>.

Done, thanks.

On Mon, Nov 9, 2015 at 7:23 PM, Cheng, Hao <ha...@intel.com> wrote:

> Yes, we definitely need to think how to handle this case, probably even
> more common than both sorted/partitioned tables case, can you jump to the
> jira and leave comment there?
>
>
>
> *From:* Alex Nastetsky [mailto:alex.nastetsky@vervemobile.com]
> *Sent:* Tuesday, November 10, 2015 3:03 AM
> *To:* Cheng, Hao
> *Cc:* Reynold Xin; dev@spark.apache.org
> *Subject:* Re: Sort Merge Join from the filesystem
>
>
>
> Thanks for creating that ticket.
>
>
>
> Another thing I was thinking of, is doing this type of join between
> dataset A which is already partitioned/sorted on disk and dataset B, which
> gets generated during the run of the application.
>
>
>
> Dataset B would need something like repartitionAndSortWithinPartitions to
> be performed on it, using the same partitioner that was used with dataset
> A. Then dataset B could be joined with dataset A without needing to write
> it to disk first (unless it's too big to fit in memory, then it would need
> to be [partially] spilled).
>
>
>
> On Wed, Nov 4, 2015 at 7:51 PM, Cheng, Hao <ha...@intel.com> wrote:
>
> Yes, we probably need more change for the data source API if we need to
> implement it in a generic way.
>
> BTW, I create the JIRA by copy most of words from Alex. J
>
>
>
> https://issues.apache.org/jira/browse/SPARK-11512
>
>
>
>
>
> *From:* Reynold Xin [mailto:rxin@databricks.com]
> *Sent:* Thursday, November 5, 2015 1:36 AM
> *To:* Alex Nastetsky
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Sort Merge Join from the filesystem
>
>
>
> It's not supported yet, and not sure if there is a ticket for it. I don't
> think there is anything fundamentally hard here either.
>
>
>
>
>
> On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky <
> alex.nastetsky@vervemobile.com> wrote:
>
> (this is kind of a cross-post from the user list)
>
>
>
> Does Spark support doing a sort merge join on two datasets on the file
> system that have already been partitioned the same with the same number of
> partitions and sorted within each partition, without needing to
> repartition/sort them again?
>
>
>
> This functionality exists in
>
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
>
> - Pig (USING 'merge')
>
> - MapReduce (CompositeInputFormat)
>
>
>
> If this is not supported in Spark, is a ticket already open for it? Does
> the Spark architecture present unique difficulties to having this feature?
>
>
>
> It is very useful to have this ability, as you can prepare dataset A to be
> joined with dataset B before B even exists, by pre-processing A with a
> partition/sort.
>
>
>
> Thanks.
>
>
>
>
>

RE: Sort Merge Join from the filesystem

Posted by "Cheng, Hao" <ha...@intel.com>.

Yes, we definitely need to think how to handle this case, probably even more common than both sorted/partitioned tables case, can you jump to the jira and leave comment there?

From: Alex Nastetsky [mailto:alex.nastetsky@vervemobile.com]
Sent: Tuesday, November 10, 2015 3:03 AM
To: Cheng, Hao
Cc: Reynold Xin; dev@spark.apache.org
Subject: Re: Sort Merge Join from the filesystem

Thanks for creating that ticket.

Another thing I was thinking of, is doing this type of join between dataset A which is already partitioned/sorted on disk and dataset B, which gets generated during the run of the application.

Dataset B would need something like repartitionAndSortWithinPartitions to be performed on it, using the same partitioner that was used with dataset A. Then dataset B could be joined with dataset A without needing to write it to disk first (unless it's too big to fit in memory, then it would need to be [partially] spilled).

On Wed, Nov 4, 2015 at 7:51 PM, Cheng, Hao <ha...@intel.com>> wrote:
Yes, we probably need more change for the data source API if we need to implement it in a generic way.
BTW, I create the JIRA by copy most of words from Alex. ☺

https://issues.apache.org/jira/browse/SPARK-11512

From: Reynold Xin [mailto:rxin@databricks.com<ma...@databricks.com>]
Sent: Thursday, November 5, 2015 1:36 AM
To: Alex Nastetsky
Cc: dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Sort Merge Join from the filesystem

It's not supported yet, and not sure if there is a ticket for it. I don't think there is anything fundamentally hard here either.

On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky <al...@vervemobile.com>> wrote:
(this is kind of a cross-post from the user list)

Does Spark support doing a sort merge join on two datasets on the file system that have already been partitioned the same with the same number of partitions and sorted within each partition, without needing to repartition/sort them again?

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)

If this is not supported in Spark, is a ticket already open for it? Does the Spark architecture present unique difficulties to having this feature?

It is very useful to have this ability, as you can prepare dataset A to be joined with dataset B before B even exists, by pre-processing A with a partition/sort.

Thanks.

Re: Sort Merge Join from the filesystem

Posted by Alex Nastetsky <al...@vervemobile.com>.

Thanks for creating that ticket.

Another thing I was thinking of, is doing this type of join between dataset
A which is already partitioned/sorted on disk and dataset B, which gets
generated during the run of the application.

Dataset B would need something like repartitionAndSortWithinPartitions to
be performed on it, using the same partitioner that was used with dataset
A. Then dataset B could be joined with dataset A without needing to write
it to disk first (unless it's too big to fit in memory, then it would need
to be [partially] spilled).

On Wed, Nov 4, 2015 at 7:51 PM, Cheng, Hao <ha...@intel.com> wrote:

> Yes, we probably need more change for the data source API if we need to
> implement it in a generic way.
>
> BTW, I create the JIRA by copy most of words from Alex. J
>
>
>
> https://issues.apache.org/jira/browse/SPARK-11512
>
>
>
>
>
> *From:* Reynold Xin [mailto:rxin@databricks.com]
> *Sent:* Thursday, November 5, 2015 1:36 AM
> *To:* Alex Nastetsky
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Sort Merge Join from the filesystem
>
>
>
> It's not supported yet, and not sure if there is a ticket for it. I don't
> think there is anything fundamentally hard here either.
>
>
>
>
>
> On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky <
> alex.nastetsky@vervemobile.com> wrote:
>
> (this is kind of a cross-post from the user list)
>
>
>
> Does Spark support doing a sort merge join on two datasets on the file
> system that have already been partitioned the same with the same number of
> partitions and sorted within each partition, without needing to
> repartition/sort them again?
>
>
>
> This functionality exists in
>
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
>
> - Pig (USING 'merge')
>
> - MapReduce (CompositeInputFormat)
>
>
>
> If this is not supported in Spark, is a ticket already open for it? Does
> the Spark architecture present unique difficulties to having this feature?
>
>
>
> It is very useful to have this ability, as you can prepare dataset A to be
> joined with dataset B before B even exists, by pre-processing A with a
> partition/sort.
>
>
>
> Thanks.
>
>
>

RE: Sort Merge Join from the filesystem

Posted by "Cheng, Hao" <ha...@intel.com>.

Yes, we probably need more change for the data source API if we need to implement it in a generic way.
BTW, I create the JIRA by copy most of words from Alex. ☺

https://issues.apache.org/jira/browse/SPARK-11512

From: Reynold Xin [mailto:rxin@databricks.com]
Sent: Thursday, November 5, 2015 1:36 AM
To: Alex Nastetsky
Cc: dev@spark.apache.org
Subject: Re: Sort Merge Join from the filesystem

It's not supported yet, and not sure if there is a ticket for it. I don't think there is anything fundamentally hard here either.

On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky <al...@vervemobile.com>> wrote:
(this is kind of a cross-post from the user list)

Does Spark support doing a sort merge join on two datasets on the file system that have already been partitioned the same with the same number of partitions and sorted within each partition, without needing to repartition/sort them again?

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)

If this is not supported in Spark, is a ticket already open for it? Does the Spark architecture present unique difficulties to having this feature?

It is very useful to have this ability, as you can prepare dataset A to be joined with dataset B before B even exists, by pre-processing A with a partition/sort.

Thanks.

Re: Sort Merge Join from the filesystem

Posted by Reynold Xin <rx...@databricks.com>.

It's not supported yet, and not sure if there is a ticket for it. I don't
think there is anything fundamentally hard here either.


On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky <
alex.nastetsky@vervemobile.com> wrote:

> (this is kind of a cross-post from the user list)
>
> Does Spark support doing a sort merge join on two datasets on the file
> system that have already been partitioned the same with the same number of
> partitions and sorted within each partition, without needing to
> repartition/sort them again?
>
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)
>
> If this is not supported in Spark, is a ticket already open for it? Does
> the Spark architecture present unique difficulties to having this feature?
>
> It is very useful to have this ability, as you can prepare dataset A to be
> joined with dataset B before B even exists, by pre-processing A with a
> partition/sort.
>
> Thanks.
>