You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by tanejagagan <ta...@yahoo.com> on 2014/12/24 09:48:30 UTC

Re: Support for Hive buckets

(for example, you might be 
able to avoid a shuffle when doing joins on tables that are already 
bucketed by exposing more metastore information to the planner). 

Can you provide more input on how to implement this functionality so that i
can speed up join between 2 hive tables, both with few billion rows 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Support-for-Hive-buckets-tp8421p9905.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Support for Hive buckets

Posted by welder404 <wa...@gmail.com>.

https://issues.apache.org/jira/browse/SPARK-19256 is an active umbrella
feature.

But as of 2.2, you can invoke APIs on DataFrames today to bucketize them on
serialization using Hive.

If you invoke

val bucketCount = 100

df1
.repartition(bucketCount, col("a"), col("b"))
.bucketBy(bucketCount, "a","b")
.sortBy("a", "b")
.saveAsTable("default.table_1")

df2
.repartition(bucketCount, col("a"), col("b"))
.bucketBy(bucketCount, "a","b")
.sortBy("a", "b")
.saveAsTable("default.table_2")

Then, join table_1 on table_2 on "a", "b",  you'll find that your query plan
involves no sort or exchange, only a SortMergeJoin. 




--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org