You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/28 08:58:01 UTC

[GitHub] [spark] habren opened a new pull request #24726: [SPARK-27865][SQL] Support 1:N sort merge bucket join without shuffle

habren opened a new pull request #24726: [SPARK-27865][SQL] Support 1:N sort merge bucket join without shuffle
URL: https://github.com/apache/spark/pull/24726
 
 
   ## Support 1:N sort merge bucket join without shuffle
   
   
   ## Test
   Here is the code for verification
   ```scala
   val spark = SparkSession.builder()
       .master("local[*]")
       .appName("TestBucketJoin")
       .config("spark.sql.autoBroadcastJoinThreshold", 1)
       .getOrCreate()
   
   spark.sql(
       """create table tbl1(a int, b int)
         |using csv 
         |clustered by (a) 
         |sorted by (a) 
         |into 4 buckets
         |""".stripMargin)
     spark.sql(
       """create table tbl2(a int, b int)
         |using csv 
         |clustered by (a) 
         |sorted by (a) 
         |into 4 buckets
         |""".stripMargin)
     spark.sql(
       """create table tbl3(a int, b int)
         |using csv 
         |clustered by (a) 
         |sorted by (a) 
         |into 12 buckets
         |""".stripMargin)
   
     import spark.implicits._
     val data = spark.sparkContext.parallelize(0 until 12, 1)
     spark.createDataset(data).createOrReplaceTempView("data")
   
     spark.sql("insert overwrite table tbl1 select value, value from data")
     spark.sql("insert overwrite table tbl2 select value, value from data")
     spark.sql("insert overwrite table tbl3 select value, value from data")
     
     spark.sql("select * from tbl1 join tbl3 on tbl1.a = tbl3.a").show()
   ```
   
   For the join in the last line, this feature make sure that the sort merge bucket join is used to join the two tables which has 4 and 12 buckets respectively.
   
   
   ![1 N bucket join DAG](https://user-images.githubusercontent.com/3096874/58465000-5b9bd800-8169-11e9-9d0c-6031b7dc20d0.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org