You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aakash Basu <aa...@gmail.com> on 2016/11/17 09:47:17 UTC

Join Query

Hi,




Conceptually I can understand below spark joins, when it comes to
implementation I don’t find much information in Google. Please help me with
code/pseudo code for below joins using java-spark or scala-spark.



*Replication Join:*

                Given two datasets, where one is small enough to fit into
the memory, perform a Replicated join using Spark.

Note: Need a program to justify this fits for Replication Join.



*Semi-Join:*

                Given a huge dataset, do a semi-join using spark. Note
that, with semi-join, one dataset needs to do Filter and projection to fit
into the cache.

Note: Need a program to justify this fits for Semi-Join.





*Composite Join:*

                Given a dataset whereby a dataset is still too big after
filtering and cannot fit into the memory. Perform composite join on a
pre-sorted and pre-partitioned data using spark.

Note: Need a program to justify this fits for composite Join.





*Repartition join:*

                Join two datasets by performing Repartition join in spark.

Note: Need a program to justify this fits for repartition Join.






Thanks,

Aakash.

RE: Join Query

Posted by Shreya Agarwal <sh...@microsoft.com>.
Replication join = broadcast join. Look for that term on google. Many examples.

Semi join can be done on dataframes/dataset by passing “semi join” as the third parameter on the join/joinWith function.

Not sure about the other two.

Sent from my Windows 10 phone

From: Aakash Basu<ma...@gmail.com>
Sent: Thursday, November 17, 2016 3:17 PM
To: user@spark.apache.org<ma...@spark.apache.org>
Subject: Join Query

Hi,



Conceptually I can understand below spark joins, when it comes to implementation I don’t find much information in Google. Please help me with code/pseudo code for below joins using java-spark or scala-spark.

Replication Join:
                Given two datasets, where one is small enough to fit into the memory, perform a Replicated join using Spark.
Note: Need a program to justify this fits for Replication Join.

Semi-Join:
                Given a huge dataset, do a semi-join using spark. Note that, with semi-join, one dataset needs to do Filter and projection to fit into the cache.
Note: Need a program to justify this fits for Semi-Join.


Composite Join:
                Given a dataset whereby a dataset is still too big after filtering and cannot fit into the memory. Perform composite join on a pre-sorted and pre-partitioned data using spark.
Note: Need a program to justify this fits for composite Join.


Repartition join:
                Join two datasets by performing Repartition join in spark.
Note: Need a program to justify this fits for repartition Join.





Thanks,
Aakash.