You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aakash Basu <aa...@gmail.com> on 2016/11/17 09:47:17 UTC
Join Query
Hi,
Conceptually I can understand below spark joins, when it comes to
implementation I don’t find much information in Google. Please help me with
code/pseudo code for below joins using java-spark or scala-spark.
*Replication Join:*
Given two datasets, where one is small enough to fit into
the memory, perform a Replicated join using Spark.
Note: Need a program to justify this fits for Replication Join.
*Semi-Join:*
Given a huge dataset, do a semi-join using spark. Note
that, with semi-join, one dataset needs to do Filter and projection to fit
into the cache.
Note: Need a program to justify this fits for Semi-Join.
*Composite Join:*
Given a dataset whereby a dataset is still too big after
filtering and cannot fit into the memory. Perform composite join on a
pre-sorted and pre-partitioned data using spark.
Note: Need a program to justify this fits for composite Join.
*Repartition join:*
Join two datasets by performing Repartition join in spark.
Note: Need a program to justify this fits for repartition Join.
Thanks,
Aakash.
RE: Join Query
Posted by Shreya Agarwal <sh...@microsoft.com>.
Replication join = broadcast join. Look for that term on google. Many examples.
Semi join can be done on dataframes/dataset by passing “semi join” as the third parameter on the join/joinWith function.
Not sure about the other two.
Sent from my Windows 10 phone
From: Aakash Basu<ma...@gmail.com>
Sent: Thursday, November 17, 2016 3:17 PM
To: user@spark.apache.org<ma...@spark.apache.org>
Subject: Join Query
Hi,
Conceptually I can understand below spark joins, when it comes to implementation I don’t find much information in Google. Please help me with code/pseudo code for below joins using java-spark or scala-spark.
Replication Join:
Given two datasets, where one is small enough to fit into the memory, perform a Replicated join using Spark.
Note: Need a program to justify this fits for Replication Join.
Semi-Join:
Given a huge dataset, do a semi-join using spark. Note that, with semi-join, one dataset needs to do Filter and projection to fit into the cache.
Note: Need a program to justify this fits for Semi-Join.
Composite Join:
Given a dataset whereby a dataset is still too big after filtering and cannot fit into the memory. Perform composite join on a pre-sorted and pre-partitioned data using spark.
Note: Need a program to justify this fits for composite Join.
Repartition join:
Join two datasets by performing Repartition join in spark.
Note: Need a program to justify this fits for repartition Join.
Thanks,
Aakash.