You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "szehon-ho (via GitHub)" <gi...@apache.org> on 2023/05/31 03:47:11 UTC

[GitHub] [spark] szehon-ho opened a new pull request, #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

szehon-ho opened a new pull request, #41398:
URL: https://github.com/apache/spark/pull/41398

   ### What changes were proposed in this pull request?
   Add support for shuffle-hash join for following scenarios:
   
   * left outer join with left-side build
   * right outer join with right-side build
   The algorithm is similar to SPARK-32399, which supports shuffle-hash join for full outer join.
   
   The same methods fullOuterJoinWithUniqueKey and fullOuterJoinWithNonUniqueKey are improved to support the new case. These methods are called after the HashedRelation is already constructed of the build side, and do these two iterations:
   
   1.  Iterate Stream side.
     a. If find match on build side, mark.
     b. If no match on build side, join with null build-side row and add to result
   2. Iterate build side.
     a. If find marked for match, add joined row to result
     b. If no match marked, join with null stream-side row
   
   The left outer join with left-side build, and right outer join with right-side build, need only a subset of these logics, namely replacing 1b above with a no-op.
   
   Codegen is left for a follow-up PR.
   
   ### Why are the changes needed?
   For joins of these types, shuffle-hash join can be more performant than sort-merge join, especially if the big table is large, as it skips an expensive sort of the big table.
   
   ### Does this PR introduce any user-facing change?
   No
   
   ### How was this patch tested?
   Unit test in JoinSuite.scala


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213344488


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -98,16 +105,21 @@ case class ShuffledHashJoinExec(
     streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, buildIter) =>
       val hashed = buildHashedRelation(buildIter)
       joinType match {
-        case FullOuter => fullOuterJoin(streamIter, hashed, numOutputRows)
+        case FullOuter => buildSideOuterJoin(streamIter, hashed, numOutputRows, full = true)
+        case LeftOuter if buildSide.equals(BuildLeft) =>
+          buildSideOuterJoin(streamIter, hashed, numOutputRows, full = false)
+        case RightOuter if buildSide.equals(BuildRight) =>
+          buildSideOuterJoin(streamIter, hashed, numOutputRows, full = false)
         case _ => join(streamIter, hashed, numOutputRows)
       }
     }
   }
 
-  private def fullOuterJoin(
+  private def buildSideOuterJoin(
       streamIter: Iterator[InternalRow],
       hashedRelation: HashedRelation,
-      numOutputRows: SQLMetric): Iterator[InternalRow] = {
+      numOutputRows: SQLMetric,
+      full: Boolean): Iterator[InternalRow] = {

Review Comment:
   Shall we also change this to `isFullOuterJoin`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1212568294


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -145,31 +154,38 @@ case class ShuffledHashJoinExec(
   }
 
   /**
-   * Full outer shuffled hash join with unique join keys:
+   * Shuffled hash join with unique join keys, where an outer side is the build side.
    * 1. Process rows from stream side by looking up hash relation.
    *    Mark the matched rows from build side be looked up.
    *    A bit set is used to track matched rows with key index.
    * 2. Process rows from build side by iterating hash relation.
    *    Filter out rows from build side being matched already,
    *    by checking key index from bit set.
    */
-  private def fullOuterJoinWithUniqueKey(
+  private def buildSideOuterJoinUniqueKey(
       streamIter: Iterator[InternalRow],
       hashedRelation: HashedRelation,
       joinKeys: UnsafeProjection,
       joinRowWithStream: InternalRow => JoinedRow,
       joinRowWithBuild: InternalRow => JoinedRow,
       streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
-      buildNullRow: GenericInternalRow): Iterator[InternalRow] = {
+      buildNullRow: GenericInternalRow,
+      full: Boolean): Iterator[InternalRow] = {

Review Comment:
   ```suggestion
         isFullOuterJoin: Boolean): Iterator[InternalRow] = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213318755


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -83,8 +85,10 @@ case class ShuffledHashJoinExec(
       iter,
       buildBoundKeys,
       taskMemoryManager = context.taskMemoryManager(),
-      // Full outer join needs support for NULL key in HashedRelation.
-      allowsNullKey = joinType == FullOuter,
+      // build-side outer join needs support for NULL key in HashedRelation.

Review Comment:
   Good catch, removed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #41398:
URL: https://github.com/apache/spark/pull/41398#issuecomment-1569555999

   Also, cc @viirya , @huaxingao , @sunchao , too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao closed pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao closed pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join
URL: https://github.com/apache/spark/pull/41398


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1215049402


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala:
##########
@@ -489,10 +489,16 @@ class JoinHintSuite extends PlanTest with SharedSparkSession with AdaptiveSparkP
         assertShuffleHashJoin(
           sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil)), BuildLeft)
         assertShuffleHashJoin(
-          sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil, "left")), BuildRight)
+          sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil, "left")), BuildLeft)

Review Comment:
   Yes, I meant that the original test coverage (`BuildRight`) is removed.



##########
sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala:
##########
@@ -489,10 +489,16 @@ class JoinHintSuite extends PlanTest with SharedSparkSession with AdaptiveSparkP
         assertShuffleHashJoin(
           sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil)), BuildLeft)
         assertShuffleHashJoin(
-          sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil, "left")), BuildRight)
+          sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil, "left")), BuildLeft)

Review Comment:
   Yes, I meant that the original test coverage (`BuildRight`) is removed and lost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #41398:
URL: https://github.com/apache/spark/pull/41398#issuecomment-1574901969

   Thanks everyone for the warm welcome to Spark, and really fast reviews!  
   
   As I'm out of town, I will look at any follow up improvements when I'm back.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1214604146


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1249,6 +1249,83 @@ class JoinSuite extends QueryTest with SharedSparkSession with AdaptiveSparkPlan
     }
   }
 
+  test("SPARK-36612: Support left outer join build left or right outer join build right in " +
+    "shuffled hash join") {
+    val inputDFs = Seq(
+      // Test unique join key
+      (spark.range(10).selectExpr("id as k1"),
+        spark.range(30).selectExpr("id as k2"),
+        $"k1" === $"k2"),
+      // Test non-unique join key
+      (spark.range(10).selectExpr("id % 5 as k1"),
+        spark.range(30).selectExpr("id % 5 as k2"),
+        $"k1" === $"k2"),
+      // Test empty build side
+      (spark.range(10).selectExpr("id as k1").filter("k1 < -1"),
+        spark.range(30).selectExpr("id as k2"),
+        $"k1" === $"k2"),
+      // Test empty stream side
+      (spark.range(10).selectExpr("id as k1"),
+        spark.range(30).selectExpr("id as k2").filter("k2 < -1"),
+        $"k1" === $"k2"),
+      // Test empty build and stream side
+      (spark.range(10).selectExpr("id as k1").filter("k1 < -1"),
+        spark.range(30).selectExpr("id as k2").filter("k2 < -1"),
+        $"k1" === $"k2"),
+      // Test string join key
+      (spark.range(10).selectExpr("cast(id * 3 as string) as k1"),
+        spark.range(30).selectExpr("cast(id as string) as k2"),
+        $"k1" === $"k2"),
+      // Test build side at right
+      (spark.range(30).selectExpr("cast(id / 3 as string) as k1"),
+        spark.range(10).selectExpr("cast(id as string) as k2"),
+        $"k1" === $"k2"),
+      // Test NULL join key
+      (spark.range(10).map(i => if (i % 2 == 0) i else null).selectExpr("value as k1"),
+        spark.range(30).map(i => if (i % 4 == 0) i else null).selectExpr("value as k2"),
+        $"k1" === $"k2"),
+      (spark.range(10).map(i => if (i % 3 == 0) i else null).selectExpr("value as k1"),
+        spark.range(30).map(i => if (i % 5 == 0) i else null).selectExpr("value as k2"),
+        $"k1" === $"k2"),
+      // Test multiple join keys
+      (spark.range(10).map(i => if (i % 2 == 0) i else null).selectExpr(
+        "value as k1", "cast(value % 5 as short) as k2", "cast(value * 3 as long) as k3"),
+        spark.range(30).map(i => if (i % 3 == 0) i else null).selectExpr(
+          "value as k4", "cast(value % 5 as short) as k5", "cast(value * 3 as long) as k6"),
+        $"k1" === $"k4" && $"k2" === $"k5" && $"k3" === $"k6")
+    )
+
+    // test left outer with left side build
+    inputDFs.foreach { case (df1, df2, joinExprs) =>
+      val smjDF = df1.hint("SHUFFLE_MERGE").join(df2, joinExprs, "leftouter")
+      assert(collect(smjDF.queryExecution.executedPlan) {
+        case _: SortMergeJoinExec => true }.size === 1)
+      val smjResult = smjDF.collect()
+
+      val shjDF = df1.hint("SHUFFLE_HASH").join(df2, joinExprs, "leftouter")
+      assert(collect(shjDF.queryExecution.executedPlan) {
+        case _: ShuffledHashJoinExec => true
+      }.size === 1)
+      // Same result between shuffled hash join and sort merge join
+      checkAnswer(shjDF, smjResult)
+    }
+
+    // test right outer with right side build
+    inputDFs.foreach { case (df2, df1, joinExprs) =>
+      val smjDF = df2.join(df1.hint("SHUFFLE_MERGE"), joinExprs, "rightouter")
+      assert(collect(smjDF.queryExecution.executedPlan) {
+        case _: SortMergeJoinExec => true }.size === 1)
+      val smjResult = smjDF.collect()
+
+      val shjDF = df2.join(df1.hint("SHUFFLE_HASH"), joinExprs, "rightouter")
+      assert(collect(shjDF.queryExecution.executedPlan) {
+        case _: ShuffledHashJoinExec => true
+      }.size === 1)

Review Comment:
   Do we need to verify build side of `ShuffledHashJoinExec` is `BuildLeft` here? Or `hint` is always working?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1212452822


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala:
##########
@@ -622,28 +632,23 @@ class JoinHintSuite extends PlanTest with SharedSparkSession with AdaptiveSparkP
         withLogAppender(hintAppender, level = Some(Level.WARN)) {
           assertBroadcastHashJoin(
             df1.join(df2.hint("BROADCAST"), $"a1" === $"b1", joinType), BuildRight)
-          assertShuffleHashJoin(
-            df1.join(df2.hint("SHUFFLE_HASH"), $"a1" === $"b1", joinType), BuildRight)
         }
 
         val logs = hintAppender.loggingEvents.map(_.getMessage.getFormattedMessage)
           .filter(_.contains("is not supported in the query:"))
         assert(logs.isEmpty)
       }
 
-      Seq("right_outer").foreach { joinType =>
+      Seq("left_semi", "left_anti").foreach { joinType =>
         val hintAppender = new LogAppender(s"join hint build side check for $joinType")
         withLogAppender(hintAppender, level = Some(Level.WARN)) {
-          assertShuffleMergeJoin(
-            df1.join(df2.hint("BROADCAST"), $"a1" === $"b1", joinType))
-          assertShuffleMergeJoin(
-            df1.join(df2.hint("SHUFFLE_HASH"), $"a1" === $"b1", joinType))

Review Comment:
   Shall we still keep the `hint("BROADCAST")` test and change the `hint("SHUFFLE_HASH")` to `assertShuffleHashJoin`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213345816


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -219,14 +238,15 @@ case class ShuffledHashJoinExec(
    * the value indices of its tuples will be 0, 1 and 2.
    * Note that value indices of tuples with different keys are incomparable.
    */
-  private def fullOuterJoinWithNonUniqueKey(
+  private def buildSideOuterJoinNonUniqueKey(
       streamIter: Iterator[InternalRow],
       hashedRelation: HashedRelation,
       joinKeys: UnsafeProjection,
       joinRowWithStream: InternalRow => JoinedRow,
       joinRowWithBuild: InternalRow => JoinedRow,
       streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
-      buildNullRow: GenericInternalRow): Iterator[InternalRow] = {
+      buildNullRow: GenericInternalRow,
+      full: Boolean): Iterator[InternalRow] = {

Review Comment:
   `full` -> `isFullOuterJoin`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on PR #41398:
URL: https://github.com/apache/spark/pull/41398#issuecomment-1573145056

   @szehon-ho Could you re-trigger the failed CI pipeline?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on PR #41398:
URL: https://github.com/apache/spark/pull/41398#issuecomment-1574071397

   > Only two minor comments left, otherwise looks good to me.
   
   @viirya Thanks a lot for taking a look! Since these are minor comments for tests, I will merge this PR first, we will follow up after @szehon-ho comes back from vacation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1212566624


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -57,6 +57,8 @@ case class ShuffledHashJoinExec(
 
   override def outputOrdering: Seq[SortOrder] = joinType match {
     case FullOuter => Nil
+    case LeftOuter if buildSide == BuildLeft => Nil
+    case RightOuter if buildSide == BuildRight => Nil

Review Comment:
   can we add some comments to explain why the ordering can't be preserved?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213341195


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -145,31 +157,38 @@ case class ShuffledHashJoinExec(
   }
 
   /**
-   * Full outer shuffled hash join with unique join keys:
+   * Shuffled hash join with unique join keys, where an outer side is the build side.
    * 1. Process rows from stream side by looking up hash relation.
    *    Mark the matched rows from build side be looked up.
    *    A bit set is used to track matched rows with key index.
    * 2. Process rows from build side by iterating hash relation.
    *    Filter out rows from build side being matched already,
    *    by checking key index from bit set.
    */
-  private def fullOuterJoinWithUniqueKey(
-      streamIter: Iterator[InternalRow],
-      hashedRelation: HashedRelation,
-      joinKeys: UnsafeProjection,
-      joinRowWithStream: InternalRow => JoinedRow,
-      joinRowWithBuild: InternalRow => JoinedRow,
-      streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
-      buildNullRow: GenericInternalRow): Iterator[InternalRow] = {
+  private def buildSideOuterJoinUniqueKey(
+                                           streamIter: Iterator[InternalRow],

Review Comment:
   Oh I think my intellij auto-formatted it,let me change it back



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213338142


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -145,31 +157,38 @@ case class ShuffledHashJoinExec(
   }
 
   /**
-   * Full outer shuffled hash join with unique join keys:
+   * Shuffled hash join with unique join keys, where an outer side is the build side.
    * 1. Process rows from stream side by looking up hash relation.
    *    Mark the matched rows from build side be looked up.
    *    A bit set is used to track matched rows with key index.
    * 2. Process rows from build side by iterating hash relation.
    *    Filter out rows from build side being matched already,
    *    by checking key index from bit set.
    */
-  private def fullOuterJoinWithUniqueKey(
-      streamIter: Iterator[InternalRow],
-      hashedRelation: HashedRelation,
-      joinKeys: UnsafeProjection,
-      joinRowWithStream: InternalRow => JoinedRow,
-      joinRowWithBuild: InternalRow => JoinedRow,
-      streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
-      buildNullRow: GenericInternalRow): Iterator[InternalRow] = {
+  private def buildSideOuterJoinUniqueKey(
+                                           streamIter: Iterator[InternalRow],

Review Comment:
   nit: 4 space indentation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213334184


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala:
##########
@@ -507,8 +507,6 @@ class JoinHintSuite extends PlanTest with SharedSparkSession with AdaptiveSparkP
           BuildLeft)
 
         // Shuffle-hash hint specified but not doable
-        assertBroadcastHashJoin(
-          sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1)" :: Nil, "left")), BuildRight)

Review Comment:
   You are right, done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sunchao commented on pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "sunchao (via GitHub)" <gi...@apache.org>.
sunchao commented on PR #41398:
URL: https://github.com/apache/spark/pull/41398#issuecomment-1574165244

   cc @c21 too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213994574


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -83,8 +88,10 @@ case class ShuffledHashJoinExec(
       iter,
       buildBoundKeys,
       taskMemoryManager = context.taskMemoryManager(),
-      // Full outer join needs support for NULL key in HashedRelation.
-      allowsNullKey = joinType == FullOuter,
+      // build-side outer join needs support for NULL key in HashedRelation.

Review Comment:
   Hi @viirya , i changed from :
   
   ```// Full outer join needs support for NULL key in HashedRelation.```
   to
   ```// build-side outer join needs support for NULL key in HashedRelation.```
   
   as now its a more general use-case that need the null key in HashedRelation, not just full outer join. 
   
   Ie, full outer join is one case of build-side outer join, which I mean,  join where outer-side is build-side



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1214043949


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -83,8 +88,10 @@ case class ShuffledHashJoinExec(
       iter,
       buildBoundKeys,
       taskMemoryManager = context.taskMemoryManager(),
-      // Full outer join needs support for NULL key in HashedRelation.
-      allowsNullKey = joinType == FullOuter,
+      // build-side outer join needs support for NULL key in HashedRelation.

Review Comment:
   I see how it may be read that way, let me clarify it.  I was originally hoping to find one phrase to capture full-outer join, right-outer + buildRight , left-outer + buildLeft 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho closed pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho closed pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join
URL: https://github.com/apache/spark/pull/41398


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1214000680


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -83,8 +88,10 @@ case class ShuffledHashJoinExec(
       iter,
       buildBoundKeys,
       taskMemoryManager = context.taskMemoryManager(),
-      // Full outer join needs support for NULL key in HashedRelation.
-      allowsNullKey = joinType == FullOuter,
+      // build-side outer join needs support for NULL key in HashedRelation.

Review Comment:
   `build-side outer join` sounds like left-outer or right-outer join specially. For full outer join, right/left side both can be build side, no?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213994574


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -83,8 +88,10 @@ case class ShuffledHashJoinExec(
       iter,
       buildBoundKeys,
       taskMemoryManager = context.taskMemoryManager(),
-      // Full outer join needs support for NULL key in HashedRelation.
-      allowsNullKey = joinType == FullOuter,
+      // build-side outer join needs support for NULL key in HashedRelation.

Review Comment:
   Hi @viirya , i changed from :
   
   ```// Full outer join needs support for NULL key in HashedRelation.```
   to
   ```// build-side outer join needs support for NULL key in HashedRelation.```
   
   as now its a more general use-case that need the null key in HashedRelation, not just full outer join



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1214052274


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -83,8 +88,10 @@ case class ShuffledHashJoinExec(
       iter,
       buildBoundKeys,
       taskMemoryManager = context.taskMemoryManager(),
-      // Full outer join needs support for NULL key in HashedRelation.
-      allowsNullKey = joinType == FullOuter,
+      // build-side outer join needs support for NULL key in HashedRelation.

Review Comment:
   OK i changed all the methods, and comments , hope it is more clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1214597485


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala:
##########
@@ -489,10 +489,16 @@ class JoinHintSuite extends PlanTest with SharedSparkSession with AdaptiveSparkP
         assertShuffleHashJoin(
           sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil)), BuildLeft)
         assertShuffleHashJoin(
-          sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil, "left")), BuildRight)
+          sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil, "left")), BuildLeft)

Review Comment:
   Why change this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1214605656


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -83,8 +88,10 @@ case class ShuffledHashJoinExec(
       iter,
       buildBoundKeys,
       taskMemoryManager = context.taskMemoryManager(),
-      // Full outer join needs support for NULL key in HashedRelation.
-      allowsNullKey = joinType == FullOuter,
+      // build-side outer join needs support for NULL key in HashedRelation.

Review Comment:
   Thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1212455882


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -83,8 +85,10 @@ case class ShuffledHashJoinExec(
       iter,
       buildBoundKeys,
       taskMemoryManager = context.taskMemoryManager(),
-      // Full outer join needs support for NULL key in HashedRelation.
-      allowsNullKey = joinType == FullOuter,
+      // build-side outer join needs support for NULL key in HashedRelation.

Review Comment:
   We probably want to change or remove this line
   https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L130



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on PR #41398:
URL: https://github.com/apache/spark/pull/41398#issuecomment-1573236791

   Yea I couldn't reproduce errors, trying again.
   ```
   2023-06-01T19:34:30.5788895Z [error] 	org.apache.spark.sql.errors.QueryCompilationErrorsSuite
   2023-06-01T19:34:30.5791234Z [error] 	org.apache.spark.sql.errors.QueryExecutionErrorsSuite
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #41398:
URL: https://github.com/apache/spark/pull/41398#issuecomment-1571331319

   cc @maryannxue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on PR #41398:
URL: https://github.com/apache/spark/pull/41398#issuecomment-1572310899

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on PR #41398:
URL: https://github.com/apache/spark/pull/41398#issuecomment-1574104932

   Merged to master. Thanks @szehon-ho and et al.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1214045306


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -83,8 +88,10 @@ case class ShuffledHashJoinExec(
       iter,
       buildBoundKeys,
       taskMemoryManager = context.taskMemoryManager(),
-      // Full outer join needs support for NULL key in HashedRelation.
-      allowsNullKey = joinType == FullOuter,
+      // build-side outer join needs support for NULL key in HashedRelation.

Review Comment:
   Maybe "build side or full outer join"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213320729


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -145,31 +154,38 @@ case class ShuffledHashJoinExec(
   }
 
   /**
-   * Full outer shuffled hash join with unique join keys:
+   * Shuffled hash join with unique join keys, where an outer side is the build side.
    * 1. Process rows from stream side by looking up hash relation.
    *    Mark the matched rows from build side be looked up.
    *    A bit set is used to track matched rows with key index.
    * 2. Process rows from build side by iterating hash relation.
    *    Filter out rows from build side being matched already,
    *    by checking key index from bit set.
    */
-  private def fullOuterJoinWithUniqueKey(
+  private def buildSideOuterJoinUniqueKey(
       streamIter: Iterator[InternalRow],
       hashedRelation: HashedRelation,
       joinKeys: UnsafeProjection,
       joinRowWithStream: InternalRow => JoinedRow,
       joinRowWithBuild: InternalRow => JoinedRow,
       streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
-      buildNullRow: GenericInternalRow): Iterator[InternalRow] = {
+      buildNullRow: GenericInternalRow,
+      full: Boolean): Iterator[InternalRow] = {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho closed pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho closed pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join
URL: https://github.com/apache/spark/pull/41398


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on PR #41398:
URL: https://github.com/apache/spark/pull/41398#issuecomment-1571166360

   @szehon-ho Thanks for the PR! The change looks reasonable to me. I have left a few minor comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1215024738


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala:
##########
@@ -489,10 +489,16 @@ class JoinHintSuite extends PlanTest with SharedSparkSession with AdaptiveSparkP
         assertShuffleHashJoin(
           sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil)), BuildLeft)
         assertShuffleHashJoin(
-          sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil, "left")), BuildRight)
+          sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1, t2)" :: Nil, "left")), BuildLeft)

Review Comment:
   In this situation, t1 is smaller than t2, so it now picks t1.  Before it was not possible to pick t1 and so t2 was picked.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #41398:
URL: https://github.com/apache/spark/pull/41398#issuecomment-1569467725

   Thank you for making a PR, @szehon-ho .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1212437084


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala:
##########
@@ -507,8 +507,6 @@ class JoinHintSuite extends PlanTest with SharedSparkSession with AdaptiveSparkP
           BuildLeft)
 
         // Shuffle-hash hint specified but not doable
-        assertBroadcastHashJoin(
-          sql(equiJoinQueryWithHint("SHUFFLE_HASH(t1)" :: Nil, "left")), BuildRight)

Review Comment:
   Shall we still keep this `shuffle-hash` hint test but change the assert to `assertShuffleHashJoin`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213347419


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -219,14 +238,15 @@ case class ShuffledHashJoinExec(
    * the value indices of its tuples will be 0, 1 and 2.
    * Note that value indices of tuples with different keys are incomparable.
    */
-  private def fullOuterJoinWithNonUniqueKey(
+  private def buildSideOuterJoinNonUniqueKey(
       streamIter: Iterator[InternalRow],
       hashedRelation: HashedRelation,
       joinKeys: UnsafeProjection,
       joinRowWithStream: InternalRow => JoinedRow,
       joinRowWithBuild: InternalRow => JoinedRow,
       streamNullJoinRowWithBuild: => InternalRow => JoinedRow,
-      buildNullRow: GenericInternalRow): Iterator[InternalRow] = {
+      buildNullRow: GenericInternalRow,
+      full: Boolean): Iterator[InternalRow] = {

Review Comment:
   yep, changed as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213340830


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala:
##########
@@ -622,28 +632,23 @@ class JoinHintSuite extends PlanTest with SharedSparkSession with AdaptiveSparkP
         withLogAppender(hintAppender, level = Some(Level.WARN)) {
           assertBroadcastHashJoin(
             df1.join(df2.hint("BROADCAST"), $"a1" === $"b1", joinType), BuildRight)
-          assertShuffleHashJoin(
-            df1.join(df2.hint("SHUFFLE_HASH"), $"a1" === $"b1", joinType), BuildRight)
         }
 
         val logs = hintAppender.loggingEvents.map(_.getMessage.getFormattedMessage)
           .filter(_.contains("is not supported in the query:"))
         assert(logs.isEmpty)
       }
 
-      Seq("right_outer").foreach { joinType =>
+      Seq("left_semi", "left_anti").foreach { joinType =>
         val hintAppender = new LogAppender(s"join hint build side check for $joinType")
         withLogAppender(hintAppender, level = Some(Level.WARN)) {
-          assertShuffleMergeJoin(
-            df1.join(df2.hint("BROADCAST"), $"a1" === $"b1", joinType))
-          assertShuffleMergeJoin(
-            df1.join(df2.hint("SHUFFLE_HASH"), $"a1" === $"b1", joinType))

Review Comment:
   I think I mistyped after a long day, and accidentally got rid of the right_outer test, I reverted it.  
   
   But I took a look at the SPARK issue (SPARK-35221) and as it is testing whether there are warning messages, if we change to assertShuffleHashJoin, the warming msg check further down will not pass.  Let me know what you think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] szehon-ho commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.
szehon-ho commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213333365


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -57,6 +57,8 @@ case class ShuffledHashJoinExec(
 
   override def outputOrdering: Seq[SortOrder] = joinType match {
     case FullOuter => Nil
+    case LeftOuter if buildSide == BuildLeft => Nil
+    case RightOuter if buildSide == BuildRight => Nil

Review Comment:
   Sure, added a comment as per my understanding (please let me know if I misunderstand something)
   
   My thought was, because the second iteration on the build-side (for outer-join semantic) is on a hashedRelation, the result cannot be in order.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213356047


##########
sql/core/src/test/scala/org/apache/spark/sql/JoinHintSuite.scala:
##########
@@ -622,28 +632,23 @@ class JoinHintSuite extends PlanTest with SharedSparkSession with AdaptiveSparkP
         withLogAppender(hintAppender, level = Some(Level.WARN)) {
           assertBroadcastHashJoin(
             df1.join(df2.hint("BROADCAST"), $"a1" === $"b1", joinType), BuildRight)
-          assertShuffleHashJoin(
-            df1.join(df2.hint("SHUFFLE_HASH"), $"a1" === $"b1", joinType), BuildRight)
         }
 
         val logs = hintAppender.loggingEvents.map(_.getMessage.getFormattedMessage)
           .filter(_.contains("is not supported in the query:"))
         assert(logs.isEmpty)
       }
 
-      Seq("right_outer").foreach { joinType =>
+      Seq("left_semi", "left_anti").foreach { joinType =>
         val hintAppender = new LogAppender(s"join hint build side check for $joinType")
         withLogAppender(hintAppender, level = Some(Level.WARN)) {
-          assertShuffleMergeJoin(
-            df1.join(df2.hint("BROADCAST"), $"a1" === $"b1", joinType))
-          assertShuffleMergeJoin(
-            df1.join(df2.hint("SHUFFLE_HASH"), $"a1" === $"b1", joinType))

Review Comment:
   I see. Then no need to keep this. Thanks for checking the original issue!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a diff in pull request #41398: [SPARK-36612][SQL] Support left outer join build left or right outer join build right in shuffled hash join

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #41398:
URL: https://github.com/apache/spark/pull/41398#discussion_r1213992676


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -83,8 +88,10 @@ case class ShuffledHashJoinExec(
       iter,
       buildBoundKeys,
       taskMemoryManager = context.taskMemoryManager(),
-      // Full outer join needs support for NULL key in HashedRelation.
-      allowsNullKey = joinType == FullOuter,
+      // build-side outer join needs support for NULL key in HashedRelation.

Review Comment:
   Isn't the original comment still valid?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala:
##########
@@ -83,8 +88,10 @@ case class ShuffledHashJoinExec(
       iter,
       buildBoundKeys,
       taskMemoryManager = context.taskMemoryManager(),
-      // Full outer join needs support for NULL key in HashedRelation.
-      allowsNullKey = joinType == FullOuter,
+      // build-side outer join needs support for NULL key in HashedRelation.

Review Comment:
   Isn't the original comment still valid? Why remove it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org