You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by tanejagagan <ta...@yahoo.com> on 2014/12/24 09:48:30 UTC
Re: Support for Hive buckets
(for example, you might be
able to avoid a shuffle when doing joins on tables that are already
bucketed by exposing more metastore information to the planner).
Can you provide more input on how to implement this functionality so that i
can speed up join between 2 hive tables, both with few billion rows
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Support-for-Hive-buckets-tp8421p9905.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org
Re: Support for Hive buckets
Posted by welder404 <wa...@gmail.com>.
https://issues.apache.org/jira/browse/SPARK-19256 is an active umbrella
feature.
But as of 2.2, you can invoke APIs on DataFrames today to bucketize them on
serialization using Hive.
If you invoke
val bucketCount = 100
df1
.repartition(bucketCount, col("a"), col("b"))
.bucketBy(bucketCount, "a","b")
.sortBy("a", "b")
.saveAsTable("default.table_1")
df2
.repartition(bucketCount, col("a"), col("b"))
.bucketBy(bucketCount, "a","b")
.sortBy("a", "b")
.saveAsTable("default.table_2")
Then, join table_1 on table_2 on "a", "b", you'll find that your query plan
involves no sort or exchange, only a SortMergeJoin.
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org