You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/04 20:10:16 UTC

[GitHub] [spark] sririshindra opened a new pull request #31477: [SPARK-34369][WEBUI] Track number of pairs processed out of Join.

sririshindra opened a new pull request #31477:
URL: https://github.com/apache/spark/pull/31477


   ### What changes were proposed in this pull request?
   Adding a new "number of matched pairs metric" to track number of pairs processed out of join.
   
   ### Why are the changes needed?
   Often users face a scenario where even a modest skew in a join can lead to tasks appearing to be stuck, due to the O(n^2) nature of a join considering all pairs of rows with matching keys. When this happens users think that spark has gotten deadlocked. If there is a bound condition, the "number of output rows" metric may look typical. Other metrics may look very modest (eg: shuffle read). In those cases, it is very hard to understand what the problem is. There is no conclusive proof without getting a heap dump and looking at some internal data structures.
   
   It would be much better if spark had a metric(which we propose be titled “number of matched pairs” as a companion to “number of output rows”) which showed the user how many pairs were being processed in the join. This would get updated in the live UI (when metrics get collected during heartbeats), so the user could easily see what was going on.
   
   This would even help in cases where there was some other cause of a stuck executor (eg. network issues) just to disprove this theory. For example, you may have 100k records with the same key on each side of a join. That probably won't really show up as extreme skew in task input data. But it'll become 10B join pairs that spark works through, in one task. See  the Jira description for an example.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, users will be able to see the new "number of matched pairs" metric whenever they look at a join node in the UI.
   
   
   ### How was this patch tested?
   Added unit tests
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-780046515


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39760/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570709608



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedPairs = longMetric("numMatchedPairs")
+
+  @transient private lazy val boundCondition: InternalRow => Boolean =
     if (condition.isDefined) {

Review comment:
       Ah, I see. What is the merit of being able to see this number for a user? I read the PR description and, at first, I thought this PR mainly targets at the case where a `condition` is given.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773819000


   **[Test build #134914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134914/testReport)** for PR 31477 at commit [`90f9b44`](https://github.com/apache/spark/commit/90f9b443dd4349df4b754ef4634b41898725844b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r578044382



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedRows = longMetric("numMatchedRows")

Review comment:
       Done. Fixed in latest commit.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -137,11 +144,20 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
   protected def streamSideKeyGenerator(): UnsafeProjection =
     UnsafeProjection.create(streamedBoundKeys)
 
-  @transient protected[this] lazy val boundCondition = if (condition.isDefined) {
-    Predicate.create(condition.get, streamedPlan.output ++ buildPlan.output).eval _
-  } else {
-    (r: InternalRow) => true
-  }
+  private val numMatchedRows = longMetric("numMatchedRows")

Review comment:
       Done. Fixed in latest commit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-780967265


   Adding this metric seems fine, but I'm a bit worried if users can understand what `"number of matched rows"` exactly means for each join type. The current user's docs don't have a page to explain about each the SQL metrics in the web UI? I looked around the docs, but I couldn't find them.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r578055334



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -513,12 +527,17 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
          |boolean $conditionPassed = true;
          |${eval.trim}
          |if ($matched != null) {
+         |  $numMatched.add(1);
          |  ${ev.code}
          |  $conditionPassed = !${ev.isNull} && ${ev.value};
          |}
        """.stripMargin
     } else {
-      s"final boolean $conditionPassed = true;"
+      s"""
+         |if ($matched != null) {
+         |  $numMatched.add(1);
+         |}
+       """.stripMargin + s"final boolean $conditionPassed = true;"

Review comment:
       ```
         s"""
            |if ($matched != null) {
            |  $numMatched.add(1);
            |}
            |final boolean $conditionPassed = true;
          """.stripMargin
   ```
   ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r577480847



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedRows = longMetric("numMatchedRows")

Review comment:
       shouldn't this be `lazy`? otherwise `metrics` becomes non lazy...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773883741


   **[Test build #134905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134905/testReport)** for PR 31477 at commit [`d73561c`](https://github.com/apache/spark/commit/d73561c3c9f4862e13d565c556b192b0c7a355cb).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787631581


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135569/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773692087


   **[Test build #134895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134895/testReport)** for PR 31477 at commit [`e2b2a81`](https://github.com/apache/spark/commit/e2b2a811a9384d253d2c5ba7776d3443f11b2a46).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773972339


   **[Test build #134914 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134914/testReport)** for PR 31477 at commit [`90f9b44`](https://github.com/apache/spark/commit/90f9b443dd4349df4b754ef4634b41898725844b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788462725


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135607/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r576938141



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -137,10 +144,19 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
   protected def streamSideKeyGenerator(): UnsafeProjection =
     UnsafeProjection.create(streamedBoundKeys)
 
-  @transient protected[this] lazy val boundCondition = if (condition.isDefined) {
-    Predicate.create(condition.get, streamedPlan.output ++ buildPlan.output).eval _
-  } else {
-    (r: InternalRow) => true
+  private val numMatchedRows = longMetric("numMatchedRows")
+
+  @transient protected[this] lazy val boundCondition: InternalRow => Boolean =
+    if (condition.isDefined) {
+      (r: InternalRow) => {
+        numMatchedRows += 1
+        Predicate.create(condition.get, streamedPlan.output ++ buildPlan.output).eval(r)
+      }
+    } else {
+      (_: InternalRow) => {
+        numMatchedRows += 1
+        true
+      }
   }

Review comment:
       Done. Fixed in the latest commit.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -433,6 +450,7 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
     } else {
       ""
     }
+    val checkCondition = s"$numMatched.add(1);\n$conditionDef"

Review comment:
       Done. Fixed in the latest commit.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -716,6 +741,7 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
     } else {
       s"$existsVar = true;"
     }
+    val checkCondition = s"$numMatched.add(1);\n$conditionDef"

Review comment:
       @dongjoon-hyun I am slightly confused about this part. I added `$numMatched.add(1);` within the checkCondition itself. I did this by adding it directly  before line 742 and line 737 directly. That way every time checkCondition is invoked the metric is incremented. I hope that is what you are looking for. If not, could you please elaborate where I am making a mistake. 

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -277,20 +277,34 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
     withTempView("testDataForJoin") {
       // Assume the execution plan is
       // ... -> SortMergeJoin(nodeId = 1) -> TungstenProject(nodeId = 0)
-      val query = "SELECT * FROM testData2 JOIN testDataForJoin ON testData2.a = testDataForJoin.a"
+      val query1 = "SELECT * FROM testData2 JOIN testDataForJoin ON testData2.a = testDataForJoin.a"
       Seq((0L, 2L, false), (1L, 4L, true)).foreach { case (nodeId1, nodeId2, enableWholeStage) =>
-        val df = spark.sql(query)
+        val df = spark.sql(query1)
         testSparkPlanMetrics(df, 1, Map(
           nodeId1 -> (("SortMergeJoin", Map(
             // It's 4 because we only read 3 rows in the first partition and 1 row in the second one
-            "number of output rows" -> 4L))),
+            "number of output rows" -> 4L,
+            "number of matched rows" -> 4L))),
           nodeId2 -> (("Exchange", Map(
             "records read" -> 4L,
             "local blocks read" -> 2L,
             "remote blocks read" -> 0L,
             "shuffle records written" -> 2L)))),
           enableWholeStage
         )
+
+        val query2 = "SELECT * FROM testData2 JOIN testDataForJoin ON " +
+          "testData2.a = testDataForJoin.a AND testData2.b <= testDataForJoin.b"
+        val df2 = spark.sql(query2)
+        Seq(false, true).foreach { case  enableWholeStage =>
+          testSparkPlanMetrics(df2, 1, Map(
+            0L -> ("SortMergeJoin", Map(
+              "number of output rows" -> 3L,
+              "number of matched rows" -> 4L))),
+            enableWholeStage
+          )
+        }
+

Review comment:
       Done. Fixed in the latest commit.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -277,20 +277,34 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
     withTempView("testDataForJoin") {
       // Assume the execution plan is
       // ... -> SortMergeJoin(nodeId = 1) -> TungstenProject(nodeId = 0)
-      val query = "SELECT * FROM testData2 JOIN testDataForJoin ON testData2.a = testDataForJoin.a"
+      val query1 = "SELECT * FROM testData2 JOIN testDataForJoin ON testData2.a = testDataForJoin.a"
       Seq((0L, 2L, false), (1L, 4L, true)).foreach { case (nodeId1, nodeId2, enableWholeStage) =>
-        val df = spark.sql(query)
+        val df = spark.sql(query1)
         testSparkPlanMetrics(df, 1, Map(
           nodeId1 -> (("SortMergeJoin", Map(
             // It's 4 because we only read 3 rows in the first partition and 1 row in the second one
-            "number of output rows" -> 4L))),
+            "number of output rows" -> 4L,
+            "number of matched rows" -> 4L))),
           nodeId2 -> (("Exchange", Map(
             "records read" -> 4L,
             "local blocks read" -> 2L,
             "remote blocks read" -> 0L,
             "shuffle records written" -> 2L)))),
           enableWholeStage
         )
+
+        val query2 = "SELECT * FROM testData2 JOIN testDataForJoin ON " +

Review comment:
       Yes, that was a mistake. Fixed in the latest commit.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -303,18 +317,35 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
     withTempView("testDataForJoin") {
       // Assume the execution plan is
       // ... -> SortMergeJoin(nodeId = 1) -> TungstenProject(nodeId = 0)
-      val leftJoinQuery = "SELECT * FROM testData2 left JOIN testDataForJoin ON " +
+      val query1 = "SELECT * FROM testData2 LEFT JOIN testDataForJoin ON " +
         "testData2.a = testDataForJoin.a"
-      val rightJoinQuery = "SELECT * FROM testDataForJoin right JOIN testData2 ON " +
+      val query2 = "SELECT * FROM testDataForJoin RIGHT JOIN testData2 ON " +
         "testData2.a = testDataForJoin.a"
-
-      Seq((leftJoinQuery, false), (leftJoinQuery, true), (rightJoinQuery, false),
-        (rightJoinQuery, true)).foreach { case (query, enableWholeStage) =>
+      val query3 = "SELECT * FROM testData2 RIGHT JOIN testDataForJoin ON " +
+        "testData2.a = testDataForJoin.a"
+      val query4 = "SELECT * FROM testData2 FULL OUTER JOIN testDataForJoin ON " +
+        "testData2.a = testDataForJoin.a"
+      val boundCondition1 = " AND testData2.b >= testDataForJoin.b"
+      val boundCondition2 = " AND testData2.a >= testDataForJoin.b"
+
+      Seq((query1, 8L, false),
+        (query1 + boundCondition1, 7L, false),
+        (query1 + boundCondition1, 7L, true),
+        (query3 + boundCondition2, 3L, false),
+        (query3 + boundCondition2, 3L, true),
+        (query4, 8L, false),
+        (query4, 8L, true),
+        (query4 + boundCondition1, 7L, false),
+        (query4 + boundCondition1, 7L, true),
+        (query1, 8L, true),
+        (query2, 8L, false),
+        (query2, 8L, true)).foreach { case (query, rows, enableWholeStage) =>
         val df = spark.sql(query)
         testSparkPlanMetrics(df, 1, Map(
           0L -> (("SortMergeJoin", Map(
             // It's 8 because we read 6 rows in the left and 2 row in the right one

Review comment:
       Yes, I removed the comment entirely as there are multiple queries and the comment is no longer valid for all the queries.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -324,13 +355,22 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
   test("BroadcastHashJoin metrics") {
     val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value")
     val df2 = Seq((1, "1"), (2, "2"), (3, "3"), (4, "4")).toDF("key", "value")
-    // Assume the execution plan is
-    // ... -> BroadcastHashJoin(nodeId = 1) -> TungstenProject(nodeId = 0)
-    Seq((1L, false), (2L, true)).foreach { case (nodeId, enableWholeStage) =>
-      val df = df1.join(broadcast(df2), "key")
+    val df3 = Seq((1, 1), (2, 4)).toDF("key", "value1")
+    val df4 = Seq((1, 1), (2, 2), (3, 3), (4, 4)).toDF("key", "value2")
+
+    Seq((false, df1, df2, 1L, 2L, false),
+      (false, df1, df2, 2L, 2L, true),
+      (true, df3, df4, 2L, 1L, true),
+      (true, df3, df4, 1L, 1L, false)
+    ).foreach { case (boundCondition, dfLeft, dfRight, nodeId, rows, enableWholeStage) =>
+      var df = dfLeft.join(broadcast(dfRight), "key")
+      if (boundCondition) {
+        df = df.filter("value1 > value2")
+      }
       testSparkPlanMetrics(df, 2, Map(
         nodeId -> (("BroadcastHashJoin", Map(
-          "number of output rows" -> 2L)))),
+          "number of output rows" -> rows,
+          "number of matched rows" -> 2L)))),

Review comment:
       Done. Fixed in the latest commit.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -369,23 +409,38 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
     }
   }
 
-  test("ShuffledHashJoin(left, outer) metrics") {
-    val leftDf = Seq((1, "1"), (2, "2")).toDF("key", "value")
-    val rightDf = (1 to 10).map(i => (i, i.toString)).toSeq.toDF("key2", "value")
-    Seq((0L, "right_outer", leftDf, rightDf, 10L, false),
-      (0L, "left_outer", rightDf, leftDf, 10L, false),
-      (1L, "right_outer", leftDf, rightDf, 10L, true),
-      (1L, "left_outer", rightDf, leftDf, 10L, true),
-      (2L, "left_anti", rightDf, leftDf, 8L, true),
-      (2L, "left_semi", rightDf, leftDf, 2L, true),
-      (1L, "left_anti", rightDf, leftDf, 8L, false),
-      (1L, "left_semi", rightDf, leftDf, 2L, false))
-      .foreach { case (nodeId, joinType, leftDf, rightDf, rows, enableWholeStage) =>
+  test("ShuffledHashJoin(left/right/outer/semi/anti, outer) metrics") {

Review comment:
       Done! fixed in latest commit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra edited a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra edited a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781029644


   > Adding this metric seems fine, but I'm a bit worried if users can understand what `"number of matched rows"` exactly means for each join type. The current user's docs don't have a page to explain about each the SQL metrics in the web UI? I looked around the docs, but I couldn't find them.
   
   I agree that it might be confusing to the users. Would you like me to add some documentation here [(monitoring).](https://spark.apache.org/docs/latest/monitoring.html) explaining what the metric measures for each join type? cc: @maropu 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773797582


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39487/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-780046515


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39760/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-799612845


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40654/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773752117


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39481/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #31477:
URL: https://github.com/apache/spark/pull/31477


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773773216


   **[Test build #134905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134905/testReport)** for PR 31477 at commit [`d73561c`](https://github.com/apache/spark/commit/d73561c3c9f4862e13d565c556b192b0c7a355cb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787529338


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135564/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-867237181


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r575607547



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -369,23 +409,38 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
     }
   }
 
-  test("ShuffledHashJoin(left, outer) metrics") {
-    val leftDf = Seq((1, "1"), (2, "2")).toDF("key", "value")
-    val rightDf = (1 to 10).map(i => (i, i.toString)).toSeq.toDF("key2", "value")
-    Seq((0L, "right_outer", leftDf, rightDf, 10L, false),
-      (0L, "left_outer", rightDf, leftDf, 10L, false),
-      (1L, "right_outer", leftDf, rightDf, 10L, true),
-      (1L, "left_outer", rightDf, leftDf, 10L, true),
-      (2L, "left_anti", rightDf, leftDf, 8L, true),
-      (2L, "left_semi", rightDf, leftDf, 2L, true),
-      (1L, "left_anti", rightDf, leftDf, 8L, false),
-      (1L, "left_semi", rightDf, leftDf, 2L, false))
-      .foreach { case (nodeId, joinType, leftDf, rightDf, rows, enableWholeStage) =>
+  test("ShuffledHashJoin(left/right/outer/semi/anti, outer) metrics") {

Review comment:
       Please add `SPARK-34369: ` test prefix.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781143321


   **[Test build #135216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135216/testReport)** for PR 31477 at commit [`79bdb82`](https://github.com/apache/spark/commit/79bdb822a0da7d413456367180e021d205f31253).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r594431673



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedRows" -> SQLMetrics.createMetric(sparkContext, "number of matched rows"))

Review comment:
       Update: In my most recent commit I fixed a perf issue that is causing a significant degradation for query-16 & quer-94. However there still seems to be some perf degradation that I cannot fix. Maybe it is because adding the new metric itself or maybe there is something wrong with the way that I am running the tests. I will run the benchmark again on a bare metal machine and see if it makes any difference. I will also see if there is anything else that can be done to improve the performance. My most recent runs are shared in this doc. https://docs.google.com/spreadsheets/d/1A0jCx5BL0wcN28oDQDOs73QhVZ71uNbKDg9Q8S0IE6g/edit?usp=sharing




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787496437


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40145/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-799594659


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40654/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-779970107


   **[Test build #135179 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135179/testReport)** for PR 31477 at commit [`8910350`](https://github.com/apache/spark/commit/89103509b32c8b6ededc1600f6ecdcc9ef1f08d7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r585673885



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedRows" -> SQLMetrics.createMetric(sparkContext, "number of matched rows"))

Review comment:
       This seems like a "progress bar" instead of a metrics. Most likely people don't care about it because it's usually `num output rows of left` * `num output rows of right`.
   
   Have we measured the perf overhead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773897291


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134905/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787490828


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40145/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r585808859



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedRows" -> SQLMetrics.createMetric(sparkContext, "number of matched rows"))

Review comment:
       I am running the TPC-DS benchmark right now. I will post the results shortly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787528981


   **[Test build #135564 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135564/testReport)** for PR 31477 at commit [`0d85ec6`](https://github.com/apache/spark/commit/0d85ec626273d18a143895ae140e3720f8217163).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787569162


   **[Test build #135569 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135569/testReport)** for PR 31477 at commit [`4f0817c`](https://github.com/apache/spark/commit/4f0817ce49ad75e5bc6f128d983f49668d017db6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787595910


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40150/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781053133


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39798/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-780982376


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39789/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788236717


   **[Test build #135607 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135607/testReport)** for PR 31477 at commit [`4f0817c`](https://github.com/apache/spark/commit/4f0817ce49ad75e5bc6f128d983f49668d017db6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570672502



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedPairs = longMetric("numMatchedPairs")
+
+  @transient private lazy val boundCondition: InternalRow => Boolean =
     if (condition.isDefined) {

Review comment:
       If the condition is empty, then `numOutputRows == numMatchedPairs` only in the case of InnerJoin. It is not necessarily the case for other types of join, because numMatchedPairs is only measuring the records when there is a common key on both the build side and the stream side.  For instance, in the case of leftOuterJoin `numOutputRows  = numMatchedPairs  + (number of non matched pairs on the left side of the join)`.
   Moreover, if we define a metric and fail to populate it for some cases, then the existing unit tests in the SQLMetricsSuite will fail because of the way the unit tests fetch the actual metrics. 

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       Sure, do you also want me to change the display name as well, `number of matched pairs` -> `number of joined rows`? I feel that `number of matched pairs` might be more precise and less confusing to the user. But if you think that  `number of joined rows` better I will change the name of the metric accordingly.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedPairs = longMetric("numMatchedPairs")
+
+  @transient private lazy val boundCondition: InternalRow => Boolean =
     if (condition.isDefined) {

Review comment:
       If the condition is empty, then `numOutputRows == numMatchedPairs` only in the case of InnerJoin. It is not necessarily the case for other types of join, because numMatchedPairs is only measuring the records when there is a common key on both the build side and the stream side.  For instance, in the case of leftOuterJoin `numOutputRows  = numMatchedPairs  + (number of non matched pairs on the left side of the join)`.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -415,24 +431,26 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
       ctx: CodegenContext,
       input: Seq[ExprCode]): (String, String, Seq[ExprCode]) = {
     val matched = ctx.freshName("matched")
+    val numMatched = metricTerm(ctx, "numMatchedPairs")
     val buildVars = genBuildSideVars(ctx, matched)
-    val checkCondition = if (condition.isDefined) {
-      val expr = condition.get
-      // evaluate the variables from build side that used by condition
-      val eval = evaluateRequiredVariables(buildPlan.output, buildVars, expr.references)
-      // filter the output via condition
-      ctx.currentVars = input ++ buildVars
-      val ev =
-        BindReferences.bindReference(expr, streamedPlan.output ++ buildPlan.output).genCode(ctx)
-      val skipRow = s"${ev.isNull} || !${ev.value}"
-      s"""
-         |$eval
-         |${ev.code}
-         |if (!($skipRow))
-       """.stripMargin
-    } else {
-      ""
-    }
+    val checkCondition = s"$numMatched.add(1);\n" +
+      (if (condition.isDefined) {
+        val expr = condition.get
+        // evaluate the variables from build side that used by condition
+        val eval = evaluateRequiredVariables(buildPlan.output, buildVars, expr.references)
+        // filter the output via condition
+        ctx.currentVars = input ++ buildVars
+        val ev =
+          BindReferences.bindReference(expr, streamedPlan.output ++ buildPlan.output).genCode(ctx)
+        val skipRow = s"${ev.isNull} || !${ev.value}"
+        s"""
+           |$eval
+           |${ev.code}
+           |if (!($skipRow))
+         """.stripMargin
+      } else {
+        ""
+      })

Review comment:
       Fixed in latest commit.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       Sure, do you also want me to change the display name as well, `number of matched pairs` -> `number of joined rows`? I feel that `number of matched pairs` might be more precise and less confusing to the user. But if you think that  `number of joined rows` is better I will change the name of the metric accordingly.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       How about `number of matched rows` instead?  I wanted highlight the fact that this metric measures the number of keys matched between the stream side and the build side.  I also felt that users might not recognize the difference between `number of joined rows` and `number of output rows`. But `number of joined rows` is fine as well. If you think that users will understand that better I will change the name of the metric to `number of joined rows`

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedPairs = longMetric("numMatchedPairs")
+
+  @transient private lazy val boundCondition: InternalRow => Boolean =
     if (condition.isDefined) {

Review comment:
       - Yes, the metric is very useful especially when there is a condition. It helps the users to recognize skew. But even if there is no condition, it could help the users to see the number of matched rows. For instance, in case of outer joins users will be able to see just how many rows matched between the stream side and the build side without explicitly doing another inner join.   
   
   - I also faced an implementation problem when I originally tried to show the metric only when there is a condition. If there is a metric defined, but if the metric is not used in some cases (For example if we don't increment the metric when there is no condition), then the tests in SQLMetricsSuite will fail because the accumulatorIds corresponding the metrics would get messed up. The size of metric values in `val metricValues = statusStore.executionMetrics(executionId)` and the metrics in `node.metrics` would become inconsistent. 
   
   - It might also confuse the users to see the metric in some cases and not see the metric in other cases. It would probably be best to show the metric everytime there is a join node in the UI to avoid any confusion.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       Updated the name of the metric to `number of matched rows` in the latest commit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773845253


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134898/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788236717


   **[Test build #135607 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135607/testReport)** for PR 31477 at commit [`4f0817c`](https://github.com/apache/spark/commit/4f0817ce49ad75e5bc6f128d983f49668d017db6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570720842



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedPairs = longMetric("numMatchedPairs")
+
+  @transient private lazy val boundCondition: InternalRow => Boolean =
     if (condition.isDefined) {

Review comment:
       - Yes, the metric is very useful especially when there is a condition. It helps the users to recognize skew. But even if there is no condition, it could help the users to see the number of matched rows. For instance, in case of outer joins users will be able to see just how many rows matched between the stream side and the build side without explicitly doing another inner join.   
   
   - I also faced an implementation problem when I originally tried to show the metric only when there is a condition. If there is a metric defined, but if the metric is not used in some cases (For example if we don't increment the metric when there is no condition), then the tests in SQLMetricsSuite will fail because the accumulatorIds corresponding the metrics would get messed up. The size of metric values in `val metricValues = statusStore.executionMetrics(executionId)` and the metrics in `node.metrics` would become inconsistent. 
   
   - It might also confuse the users to see the metric in some cases and not see the metric in other cases. It would probably be best to show the metric everytime there is a join node in the UI to avoid any confusion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570672502



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedPairs = longMetric("numMatchedPairs")
+
+  @transient private lazy val boundCondition: InternalRow => Boolean =
     if (condition.isDefined) {

Review comment:
       If the condition is empty, then `numOutputRows == numMatchedPairs` only in the case of InnerJoin. It is not necessarily the case for other types of join, because numMatchedPairs is only measuring the records when there is a common key on both the build side and the stream side.  For instance, in the case of leftOuterJoin `numOutputRows  = numMatchedPairs  + (number of non matched pairs on the left side of the join)`.
   Moreover, if we define a metric and fail to populate it for some cases, then the existing unit tests in the SQLMetricsSuite will fail because of the way the unit tests fetch the actual metrics. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570703692



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       >  I feel that number of matched pairs might be more precise and less confusing to the user.
   
   Any reason? At first, I was not sure about what `pairs` means though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773844202


   **[Test build #134898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134898/testReport)** for PR 31477 at commit [`e2b2a81`](https://github.com/apache/spark/commit/e2b2a811a9384d253d2c5ba7776d3443f11b2a46).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570673933



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       Sure, do you also want me to change the display name as well, `number of matched pairs` -> `number of joined rows`? I feel that `number of matched pairs` might be more precise and less confusing to the user. But if you think that  `number of joined rows` is better I will change the name of the metric accordingly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788462725


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135607/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787499889


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40145/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773993042


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134914/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra edited a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra edited a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781029644


   > Adding this metric seems fine, but I'm a bit worried if users can understand what `"number of matched rows"` exactly means for each join type. The current user's docs don't have a page to explain about each the SQL metrics in the web UI? I looked around the docs, but I couldn't find them.
   
   I agree that it might be confusing to the users. Would you like me to add some documentation here [(monitoring).](https://spark.apache.org/docs/latest/monitoring.html) explaining what the metric does for each join type? cc: @maropu 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773867180


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39497/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
bersprockets commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788229686


   jenkins, retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r575606176



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -137,10 +144,19 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
   protected def streamSideKeyGenerator(): UnsafeProjection =
     UnsafeProjection.create(streamedBoundKeys)
 
-  @transient protected[this] lazy val boundCondition = if (condition.isDefined) {
-    Predicate.create(condition.get, streamedPlan.output ++ buildPlan.output).eval _
-  } else {
-    (r: InternalRow) => true
+  private val numMatchedRows = longMetric("numMatchedRows")
+
+  @transient protected[this] lazy val boundCondition: InternalRow => Boolean =
+    if (condition.isDefined) {
+      (r: InternalRow) => {
+        numMatchedRows += 1
+        Predicate.create(condition.get, streamedPlan.output ++ buildPlan.output).eval(r)
+      }
+    } else {
+      (_: InternalRow) => {
+        numMatchedRows += 1
+        true
+      }
   }

Review comment:
       Indentation?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781190930


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135216/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788230752


   **[Test build #135610 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135610/testReport)** for PR 31477 at commit [`4f0817c`](https://github.com/apache/spark/commit/4f0817ce49ad75e5bc6f128d983f49668d017db6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781036234


   **[Test build #135216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135216/testReport)** for PR 31477 at commit [`79bdb82`](https://github.com/apache/spark/commit/79bdb822a0da7d413456367180e021d205f31253).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773742165


   **[Test build #134898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134898/testReport)** for PR 31477 at commit [`e2b2a81`](https://github.com/apache/spark/commit/e2b2a811a9384d253d2c5ba7776d3443f11b2a46).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787569162


   **[Test build #135569 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135569/testReport)** for PR 31477 at commit [`4f0817c`](https://github.com/apache/spark/commit/4f0817ce49ad75e5bc6f128d983f49668d017db6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787552221


   > A new suite is preferred here. Creating a new suite for Join-specific SQLMetrics is preferred here, because it has to cover various different cases, including different join types, different join algorithms, whole-stage on/off, and different runtime shortcuts. Also, stream-stream and batch-stream joins should also be tested.
   
   @gatorsmile I refactored the existing test Suite and created a new Suite that contains all the join specifc tests.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
bersprockets commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788209359


   Retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773813062






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788230752


   **[Test build #135610 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135610/testReport)** for PR 31477 at commit [`4f0817c`](https://github.com/apache/spark/commit/4f0817ce49ad75e5bc6f128d983f49668d017db6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781078105


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39798/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570727456



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       Updated the name of the metric to `number of matched rows` in the latest commit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787631581


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135569/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773692087






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r588352442



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedRows" -> SQLMetrics.createMetric(sparkContext, "number of matched rows"))

Review comment:
       Update: I figured out why the perf degradation is occurring for query-16 & query-94 and fixed it. I ran the benchmark for a couple of iterations and looks like my fix is working. I am running the benchmark with 25 iterations to identify if any other query has been effected.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra edited a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra edited a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781061053


   > > Would you like me to add some documentation here (monitoring). explaining what the metric measures for each join type? cc: @maropu
   > 
   > Yea, I think It would be very helpful if you could do it . Since these metrics are SQL-specific, so it might be better to write them somewhere in [the Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html), then add a link to it in the monitoring page. cc: @gatorsmile @HyukjinKwon
   
   Ok, I will try to get it done tomorrow.
   
   Update: @maropu Sorry, I didn't get time to finish adding the documentation you asked for on the spark-website. I am trying to finish it today. Sorry for the delay.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787484008


   **[Test build #135564 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135564/testReport)** for PR 31477 at commit [`0d85ec6`](https://github.com/apache/spark/commit/0d85ec626273d18a143895ae140e3720f8217163).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781029644


   > Adding this metric seems fine, but I'm a bit worried if users can understand what `"number of matched rows"` exactly means for each join type. The current user's docs don't have a page to explain about each the SQL metrics in the web UI? I looked around the docs, but I couldn't find them.
   
   I agree that it might be confusing to the users. Would you like me to add some documentation here [(monitoring).](https://spark.apache.org/docs/latest/monitoring.html) explaining what the metric does for each join type?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570673933



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       Sure, do you also want me to change the display name as well, `number of matched pairs` -> `number of joined rows`? I feel that `number of matched pairs` might be more precise and less confusing to the user. But if you think that  `number of joined rows` better I will change the name of the metric accordingly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773692087






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781008314


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39789/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787499889


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40145/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r585673885



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedRows" -> SQLMetrics.createMetric(sparkContext, "number of matched rows"))

Review comment:
       This seems like a "progress bar" instead of a metrics. Most likely people don't care about it because it's usually `num output rows of left` * `num output rows of right`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][WEBUI][WIP] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773574298


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781031652


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135208/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r578113277



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -513,12 +527,17 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
          |boolean $conditionPassed = true;
          |${eval.trim}
          |if ($matched != null) {
+         |  $numMatched.add(1);
          |  ${ev.code}
          |  $conditionPassed = !${ev.isNull} && ${ev.value};
          |}
        """.stripMargin
     } else {
-      s"final boolean $conditionPassed = true;"
+      s"""
+         |if ($matched != null) {
+         |  $numMatched.add(1);
+         |}
+       """.stripMargin + s"final boolean $conditionPassed = true;"

Review comment:
       Done. Fixed in latest commit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773727734


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39481/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773787447


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39487/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788460553


   **[Test build #135607 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135607/testReport)** for PR 31477 at commit [`4f0817c`](https://github.com/apache/spark/commit/4f0817ce49ad75e5bc6f128d983f49668d017db6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781008314


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39789/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781061053


   
   
   
   > > Would you like me to add some documentation here (monitoring). explaining what the metric measures for each join type? cc: @maropu
   > 
   > Yea, I think It would be very helpful if you could do it . Since these metrics are SQL-specific, so it might be better to write them somewhere in [the Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html), then add a link to it in the monitoring page. cc: @gatorsmile @HyukjinKwon
   
   Ok, I will try to get it done tomorrow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773681938


   ok to test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-780129404


   **[Test build #135179 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135179/testReport)** for PR 31477 at commit [`8910350`](https://github.com/apache/spark/commit/89103509b32c8b6ededc1600f6ecdcc9ef1f08d7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] c21 commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
c21 commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r577343739



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -48,6 +48,13 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
     s"$nodeName $joinType ${buildSide} ($opId)".trim
   }
 
+  /**
+   * When overriding metrics be sure to include numOutputRows & numMatchedRows
+   */
+  override lazy val metrics = Map(
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedRows" -> SQLMetrics.createMetric(sparkContext, "number of matched rows"))

Review comment:
       I feel it's kind of weird that we need to double define `metrics` here in `HashJoin`, and also its subclass `BroadcastHashJoinExec` and `ShuffledHashJoinExec`. It might introduce unexpected bug later. Could we pass the `numMatchedRows` to `HashJoin`, similar to what we did for `numOutputRows` before?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-780957202


   **[Test build #135208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135208/testReport)** for PR 31477 at commit [`fdfeadb`](https://github.com/apache/spark/commit/fdfeadbfe17e6231448db74c7a2013eb05c107cc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra edited a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra edited a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787552318


   > @sririshindra could you check if no performance (e.g., TPCDS queries) changes, too?
   
   Sure, I am working that right now. cc: @maropu 
   
   Update: I ran the TPCDS queries both with and without this PR and I did not see any performance degradation in the query execution times. I will run it on a much larger dataset and with more iterations and if that causes any perf degradation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-780996768


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39789/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
sririshindra commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773638159


   Could you please take a look at this PR. cc: @maropu @dongjoon-hyun 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r585976985



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedRows" -> SQLMetrics.createMetric(sparkContext, "number of matched rows"))

Review comment:
       I ran the benchmark and I see a clear performance degradation for query-16 and query-94. All the other queries seem to be fine. I will debug why there is a performance degradation for  query-16 & query-94. https://docs.google.com/document/d/16_cjk3g3_7mMH2C_T9bsFroxaH_CihJkXINYirt-mtg/edit?usp=sharing 
   cc: @cloud-fan 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-799612845


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40654/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-799574445


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40654/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773867180


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39497/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787628988


   **[Test build #135569 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135569/testReport)** for PR 31477 at commit [`4f0817c`](https://github.com/apache/spark/commit/4f0817ce49ad75e5bc6f128d983f49668d017db6).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra edited a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra edited a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787552318


   > @sririshindra could you check if no performance (e.g., TPCDS queries) changes, too?
   
   Sure, I am working that right now. cc: @maropu 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787484008


   **[Test build #135564 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135564/testReport)** for PR 31477 at commit [`0d85ec6`](https://github.com/apache/spark/commit/0d85ec626273d18a143895ae140e3720f8217163).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r577493123



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -48,6 +48,13 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
     s"$nodeName $joinType ${buildSide} ($opId)".trim
   }
 
+  /**
+   * When overriding metrics be sure to include numOutputRows & numMatchedRows
+   */
+  override lazy val metrics = Map(
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedRows" -> SQLMetrics.createMetric(sparkContext, "number of matched rows"))

Review comment:
       But do we need to define `metrics` here at all? If `numMatchedRows` becomes `lazy` as it should be, then probably we don't need to have `metrics` here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773819000


   **[Test build #134914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134914/testReport)** for PR 31477 at commit [`90f9b44`](https://github.com/apache/spark/commit/90f9b443dd4349df4b754ef4634b41898725844b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781031652


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135208/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r575607066



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -277,20 +277,34 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
     withTempView("testDataForJoin") {
       // Assume the execution plan is
       // ... -> SortMergeJoin(nodeId = 1) -> TungstenProject(nodeId = 0)
-      val query = "SELECT * FROM testData2 JOIN testDataForJoin ON testData2.a = testDataForJoin.a"
+      val query1 = "SELECT * FROM testData2 JOIN testDataForJoin ON testData2.a = testDataForJoin.a"
       Seq((0L, 2L, false), (1L, 4L, true)).foreach { case (nodeId1, nodeId2, enableWholeStage) =>
-        val df = spark.sql(query)
+        val df = spark.sql(query1)
         testSparkPlanMetrics(df, 1, Map(
           nodeId1 -> (("SortMergeJoin", Map(
             // It's 4 because we only read 3 rows in the first partition and 1 row in the second one
-            "number of output rows" -> 4L))),
+            "number of output rows" -> 4L,
+            "number of matched rows" -> 4L))),
           nodeId2 -> (("Exchange", Map(
             "records read" -> 4L,
             "local blocks read" -> 2L,
             "remote blocks read" -> 0L,
             "shuffle records written" -> 2L)))),
           enableWholeStage
         )
+
+        val query2 = "SELECT * FROM testData2 JOIN testDataForJoin ON " +

Review comment:
       Why this is inside the `foreach` of line 281?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r575606910



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -277,20 +277,34 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
     withTempView("testDataForJoin") {
       // Assume the execution plan is
       // ... -> SortMergeJoin(nodeId = 1) -> TungstenProject(nodeId = 0)
-      val query = "SELECT * FROM testData2 JOIN testDataForJoin ON testData2.a = testDataForJoin.a"
+      val query1 = "SELECT * FROM testData2 JOIN testDataForJoin ON testData2.a = testDataForJoin.a"
       Seq((0L, 2L, false), (1L, 4L, true)).foreach { case (nodeId1, nodeId2, enableWholeStage) =>
-        val df = spark.sql(query)
+        val df = spark.sql(query1)
         testSparkPlanMetrics(df, 1, Map(
           nodeId1 -> (("SortMergeJoin", Map(
             // It's 4 because we only read 3 rows in the first partition and 1 row in the second one
-            "number of output rows" -> 4L))),
+            "number of output rows" -> 4L,
+            "number of matched rows" -> 4L))),
           nodeId2 -> (("Exchange", Map(
             "records read" -> 4L,
             "local blocks read" -> 2L,
             "remote blocks read" -> 0L,
             "shuffle records written" -> 2L)))),
           enableWholeStage
         )
+
+        val query2 = "SELECT * FROM testData2 JOIN testDataForJoin ON " +
+          "testData2.a = testDataForJoin.a AND testData2.b <= testDataForJoin.b"
+        val df2 = spark.sql(query2)
+        Seq(false, true).foreach { case  enableWholeStage =>
+          testSparkPlanMetrics(df2, 1, Map(
+            0L -> ("SortMergeJoin", Map(
+              "number of output rows" -> 3L,
+              "number of matched rows" -> 4L))),
+            enableWholeStage
+          )
+        }
+

Review comment:
       Shall we remove this empty line?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773773216


   **[Test build #134905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134905/testReport)** for PR 31477 at commit [`d73561c`](https://github.com/apache/spark/commit/d73561c3c9f4862e13d565c556b192b0c7a355cb).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773692087


   **[Test build #134895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134895/testReport)** for PR 31477 at commit [`e2b2a81`](https://github.com/apache/spark/commit/e2b2a811a9384d253d2c5ba7776d3443f11b2a46).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781078105


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39798/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570672502



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedPairs = longMetric("numMatchedPairs")
+
+  @transient private lazy val boundCondition: InternalRow => Boolean =
     if (condition.isDefined) {

Review comment:
       If the condition is empty, then `numOutputRows == numMatchedPairs` only in the case of InnerJoin. It is not necessarily the case for other types of join, because numMatchedPairs is only measuring the records when there is a common key on both the build side and the stream side.  For instance, in the case of leftOuterJoin `numOutputRows  = numMatchedPairs  + (number of non matched pairs on the left side of the join)`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-779970107


   **[Test build #135179 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135179/testReport)** for PR 31477 at commit [`8910350`](https://github.com/apache/spark/commit/89103509b32c8b6ededc1600f6ecdcc9ef1f08d7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r575607421



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -324,13 +355,22 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
   test("BroadcastHashJoin metrics") {
     val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value")
     val df2 = Seq((1, "1"), (2, "2"), (3, "3"), (4, "4")).toDF("key", "value")
-    // Assume the execution plan is
-    // ... -> BroadcastHashJoin(nodeId = 1) -> TungstenProject(nodeId = 0)
-    Seq((1L, false), (2L, true)).foreach { case (nodeId, enableWholeStage) =>
-      val df = df1.join(broadcast(df2), "key")
+    val df3 = Seq((1, 1), (2, 4)).toDF("key", "value1")
+    val df4 = Seq((1, 1), (2, 2), (3, 3), (4, 4)).toDF("key", "value2")
+
+    Seq((false, df1, df2, 1L, 2L, false),
+      (false, df1, df2, 2L, 2L, true),
+      (true, df3, df4, 2L, 1L, true),
+      (true, df3, df4, 1L, 1L, false)
+    ).foreach { case (boundCondition, dfLeft, dfRight, nodeId, rows, enableWholeStage) =>
+      var df = dfLeft.join(broadcast(dfRight), "key")
+      if (boundCondition) {
+        df = df.filter("value1 > value2")
+      }
       testSparkPlanMetrics(df, 2, Map(
         nodeId -> (("BroadcastHashJoin", Map(
-          "number of output rows" -> 2L)))),
+          "number of output rows" -> rows,
+          "number of matched rows" -> 2L)))),

Review comment:
       Is it always `2`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787551554


   @sririshindra could you check if no performance (e.g., TPCDS queries) changes, too?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787552318


   > @sririshindra could you check if no performance (e.g., TPCDS queries) changes, too?
   
   Sure, I am working that right now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781023323


   **[Test build #135208 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135208/testReport)** for PR 31477 at commit [`fdfeadb`](https://github.com/apache/spark/commit/fdfeadbfe17e6231448db74c7a2013eb05c107cc).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788454083


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135610/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-780141388


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135179/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r578113170



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedRows = longMetric("numMatchedRows")

Review comment:
       Actually making it lazy is causing a NullPointerException in SqlQuerySuite, ExistenceJoinSuite, OuterJoinSuite.  Somehow the sqlContext object in line 67 is turning out to be null. I am still trying to understand why that is the case. For now I reverted the change.
   `java.lang.NullPointerException
   	at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:67)
   	at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.metrics$lzycompute(BroadcastNestedLoopJoinExec.scala:42)
   	at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.metrics(BroadcastNestedLoopJoinExec.scala:41)
   	at org.apache.spark.sql.execution.SparkPlan.longMetric(SparkPlan.scala:135)
   	at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.numMatchedRows$lzycompute(BroadcastNestedLoopJoinExec.scala:93)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
sririshindra commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773638159


   Could you please take a look at this PR. cc: @maropu @dongjoon-hyun 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787117709


   Also cc @viirya I remember we saw the perf regressions when you adding new metrics for Aggregate. Maybe you can also take a look at this


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
peter-toth commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r577481635



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -137,11 +144,20 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
   protected def streamSideKeyGenerator(): UnsafeProjection =
     UnsafeProjection.create(streamedBoundKeys)
 
-  @transient protected[this] lazy val boundCondition = if (condition.isDefined) {
-    Predicate.create(condition.get, streamedPlan.output ++ buildPlan.output).eval _
-  } else {
-    (r: InternalRow) => true
-  }
+  private val numMatchedRows = longMetric("numMatchedRows")

Review comment:
       ditto




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-786984661


   @maropu I added some documentation briefly explaining the metric's definition and created a PR in the spark website. Please find the link to that [PR here ](https://github.com/apache/spark-website/pull/317). Please let me know if this is satisfactory and if any more changes are needed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r575606542



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -716,6 +741,7 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
     } else {
       s"$existsVar = true;"
     }
+    val checkCondition = s"$numMatched.add(1);\n$conditionDef"

Review comment:
       ditto. We had better add this before line 712 and line 717 directly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773897291


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134905/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788442656


   **[Test build #135610 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135610/testReport)** for PR 31477 at commit [`4f0817c`](https://github.com/apache/spark/commit/4f0817ce49ad75e5bc6f128d983f49668d017db6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570709820



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       sgtm




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773993042


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134914/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][WEBUI][WIP] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773574298






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773813062






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787485609


   > Can we improve the test coverage? Build a new suite instead of modify the existing one.
   > 
   > cc @cloud-fan @maryannxue @rednaxelafx
   
   @gatorsmile Could you please elaborate on what do you mean by a new Suite? Do mean to say that there should be a separate suite for the metric that is being proposed in this PR or are you suggesting that the SQLMetrics Suite itself should be refactored into multiple test Suites (maybe a seperate test Suite for Join based tests, maybe further divide each join based tests with its own test Suite). Also, do you think this new Suite should be done as part of this PR or do you think that a separate follow up PR would be more appropriate?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787502960


   A new suite is preferred here. Creating a new suite for Join-specific SQLMetrics is preferred here, because it has to cover various different cases, including different join types, different join algorithms, whole-stage on/off, and different runtime shortcuts. Also, stream-stream and batch-stream joins should also be tested.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra edited a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra edited a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-786984661


   @maropu I added some documentation briefly explaining the metric's definition and created a PR in the spark website. Please find the link to that [PR here ](https://github.com/apache/spark-website/pull/317). Please let me know if this is satisfactory and if any more changes are needed.
   
   Update: @maropu I updated the current PR adding additional documentation explaining the new metric to the spark documentation.  There could be more metrics that can be added to the list. But I felt that any additional metrics should be added to the documentation in a separate PR. I tested the new documentation by manually building as follows. Please let me know if anything else needs to be done. Thanks.
   ```bash
   cd docs
   SKIP_API=1 jekyll build
   open _site/index.html
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-780957202


   **[Test build #135208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135208/testReport)** for PR 31477 at commit [`fdfeadb`](https://github.com/apache/spark/commit/fdfeadbfe17e6231448db74c7a2013eb05c107cc).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773752117


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39481/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781190930


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135216/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-788454083


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135610/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra edited a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra edited a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781061053


   > > Would you like me to add some documentation here (monitoring). explaining what the metric measures for each join type? cc: @maropu
   > 
   > Yea, I think It would be very helpful if you could do it . Since these metrics are SQL-specific, so it might be better to write them somewhere in [the Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html), then add a link to it in the monitoring page. cc: @gatorsmile @HyukjinKwon
   
   Ok, I will try to get it done tomorrow.
   
   Update: @maropu Sorry, I didn't get time to finish adding the documentation you asked for on the spark-website. I am trying to finish it today. Sorry for the delay.
   
   Update: I had been a little busy and was not able to finish updating the spark-webiste. I am almost done though. I will publish a PR for the spark-website with the added documentation and add a link to that PR here. Hopefully I will be able to get it done by end of this week (Feb 26,2021).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570624874



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       `numMatchedPairs` -> `numJoinedRows`?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -415,24 +431,26 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
       ctx: CodegenContext,
       input: Seq[ExprCode]): (String, String, Seq[ExprCode]) = {
     val matched = ctx.freshName("matched")
+    val numMatched = metricTerm(ctx, "numMatchedPairs")
     val buildVars = genBuildSideVars(ctx, matched)
-    val checkCondition = if (condition.isDefined) {
-      val expr = condition.get
-      // evaluate the variables from build side that used by condition
-      val eval = evaluateRequiredVariables(buildPlan.output, buildVars, expr.references)
-      // filter the output via condition
-      ctx.currentVars = input ++ buildVars
-      val ev =
-        BindReferences.bindReference(expr, streamedPlan.output ++ buildPlan.output).genCode(ctx)
-      val skipRow = s"${ev.isNull} || !${ev.value}"
-      s"""
-         |$eval
-         |${ev.code}
-         |if (!($skipRow))
-       """.stripMargin
-    } else {
-      ""
-    }
+    val checkCondition = s"$numMatched.add(1);\n" +
+      (if (condition.isDefined) {
+        val expr = condition.get
+        // evaluate the variables from build side that used by condition
+        val eval = evaluateRequiredVariables(buildPlan.output, buildVars, expr.references)
+        // filter the output via condition
+        ctx.currentVars = input ++ buildVars
+        val ev =
+          BindReferences.bindReference(expr, streamedPlan.output ++ buildPlan.output).genCode(ctx)
+        val skipRow = s"${ev.isNull} || !${ev.value}"
+        s"""
+           |$eval
+           |${ev.code}
+           |if (!($skipRow))
+         """.stripMargin
+      } else {
+        ""
+      })

Review comment:
       Could you avoid unnecessary changes? For example;
   ```
       val checkCondition = if (condition.isDefined) {
         val expr = condition.get
         ...
       } else {
           ""
       }
       val some_val = s"$numMatched.add(1);\n$checkCondition"
       (matched, some_val, buildVars)
   ```

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedPairs = longMetric("numMatchedPairs")
+
+  @transient private lazy val boundCondition: InternalRow => Boolean =
     if (condition.isDefined) {

Review comment:
       If `condition` is empty, `numOutputRows` == `numMatchedPairs `? If so, could we only show this new metric in the web UI when `condition` defined?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       >  I feel that number of matched pairs might be more precise and less confusing to the user.
   
   Any reason? At first, I was not sure about what `pairs` means though.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedPairs = longMetric("numMatchedPairs")
+
+  @transient private lazy val boundCondition: InternalRow => Boolean =
     if (condition.isDefined) {

Review comment:
       Ah, I see. What is the merit of being able to see this number for a user? I read the PR description and, at first, I thought this PR mainly targets at the case where a `condition` is given.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       sgtm




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781065739


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39798/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r575607421



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -324,13 +355,22 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
   test("BroadcastHashJoin metrics") {
     val df1 = Seq((1, "1"), (2, "2")).toDF("key", "value")
     val df2 = Seq((1, "1"), (2, "2"), (3, "3"), (4, "4")).toDF("key", "value")
-    // Assume the execution plan is
-    // ... -> BroadcastHashJoin(nodeId = 1) -> TungstenProject(nodeId = 0)
-    Seq((1L, false), (2L, true)).foreach { case (nodeId, enableWholeStage) =>
-      val df = df1.join(broadcast(df2), "key")
+    val df3 = Seq((1, 1), (2, 4)).toDF("key", "value1")
+    val df4 = Seq((1, 1), (2, 2), (3, 3), (4, 4)).toDF("key", "value2")
+
+    Seq((false, df1, df2, 1L, 2L, false),
+      (false, df1, df2, 2L, 2L, true),
+      (true, df3, df4, 2L, 1L, true),
+      (true, df3, df4, 1L, 1L, false)
+    ).foreach { case (boundCondition, dfLeft, dfRight, nodeId, rows, enableWholeStage) =>
+      var df = dfLeft.join(broadcast(dfRight), "key")
+      if (boundCondition) {
+        df = df.filter("value1 > value2")
+      }
       testSparkPlanMetrics(df, 2, Map(
         nodeId -> (("BroadcastHashJoin", Map(
-          "number of output rows" -> 2L)))),
+          "number of output rows" -> rows,
+          "number of matched rows" -> 2L)))),

Review comment:
       Is it always `2`? Can we make `df3` and `df4` more complex in order to have a different `number of matched rows`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570708920



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       How about `number of matched rows` instead?  I wanted highlight the fact that this metric measures the number of keys matched between the stream side and the build side.  I also felt that users might not recognize the difference between `number of joined rows` and `number of output rows`. But `number of joined rows` is fine as well. If you think that users will understand that better I will change the name of the metric to `number of joined rows`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773795208


   **[Test build #134895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134895/testReport)** for PR 31477 at commit [`e2b2a81`](https://github.com/apache/spark/commit/e2b2a811a9384d253d2c5ba7776d3443f11b2a46).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r578044301



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -48,6 +48,13 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
     s"$nodeName $joinType ${buildSide} ($opId)".trim
   }
 
+  /**
+   * When overriding metrics be sure to include numOutputRows & numMatchedRows
+   */
+  override lazy val metrics = Map(
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedRows" -> SQLMetrics.createMetric(sparkContext, "number of matched rows"))

Review comment:
       I removed the metrics from HashJoin in the latest commit and made the numMatchedRows lazy.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r575607211



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -303,18 +317,35 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
     withTempView("testDataForJoin") {
       // Assume the execution plan is
       // ... -> SortMergeJoin(nodeId = 1) -> TungstenProject(nodeId = 0)
-      val leftJoinQuery = "SELECT * FROM testData2 left JOIN testDataForJoin ON " +
+      val query1 = "SELECT * FROM testData2 LEFT JOIN testDataForJoin ON " +
         "testData2.a = testDataForJoin.a"
-      val rightJoinQuery = "SELECT * FROM testDataForJoin right JOIN testData2 ON " +
+      val query2 = "SELECT * FROM testDataForJoin RIGHT JOIN testData2 ON " +
         "testData2.a = testDataForJoin.a"
-
-      Seq((leftJoinQuery, false), (leftJoinQuery, true), (rightJoinQuery, false),
-        (rightJoinQuery, true)).foreach { case (query, enableWholeStage) =>
+      val query3 = "SELECT * FROM testData2 RIGHT JOIN testDataForJoin ON " +
+        "testData2.a = testDataForJoin.a"
+      val query4 = "SELECT * FROM testData2 FULL OUTER JOIN testDataForJoin ON " +
+        "testData2.a = testDataForJoin.a"
+      val boundCondition1 = " AND testData2.b >= testDataForJoin.b"
+      val boundCondition2 = " AND testData2.a >= testDataForJoin.b"
+
+      Seq((query1, 8L, false),
+        (query1 + boundCondition1, 7L, false),
+        (query1 + boundCondition1, 7L, true),
+        (query3 + boundCondition2, 3L, false),
+        (query3 + boundCondition2, 3L, true),
+        (query4, 8L, false),
+        (query4, 8L, true),
+        (query4 + boundCondition1, 7L, false),
+        (query4 + boundCondition1, 7L, true),
+        (query1, 8L, true),
+        (query2, 8L, false),
+        (query2, 8L, true)).foreach { case (query, rows, enableWholeStage) =>
         val df = spark.sql(query)
         testSparkPlanMetrics(df, 1, Map(
           0L -> (("SortMergeJoin", Map(
             // It's 8 because we read 6 rows in the left and 2 row in the right one

Review comment:
       Please update this comments accordingly because this is invalid after this PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773574298






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781036234


   **[Test build #135216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135216/testReport)** for PR 31477 at commit [`79bdb82`](https://github.com/apache/spark/commit/79bdb822a0da7d413456367180e021d205f31253).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-781041575


   > Would you like me to add some documentation here (monitoring). explaining what the metric measures for each join type? cc: @maropu
   
   Yea, I think It would be very helpful if you could do it . Since these metrics are SQL-specific, so it might be better to write them somewhere in [the Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html), then add a link to it in the monitoring page. cc: @gatorsmile @HyukjinKwon 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787117208


   Can we improve the test coverage? Build a new suite instead of modify the existing one.
   
   cc @cloud-fan @maryannxue @rednaxelafx 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r575606419



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -433,6 +450,7 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
     } else {
       ""
     }
+    val checkCondition = s"$numMatched.add(1);\n$conditionDef"

Review comment:
       Could you add `$numMatched.add(1);` directly before line 446 and line 451?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773738583


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39481/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773845253


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/134898/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-799543295


   **[Test build #136071 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136071/testReport)** for PR 31477 at commit [`4c35275`](https://github.com/apache/spark/commit/4c35275fa7c670cd18c16548f36cd6184ec155ff).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773742165


   **[Test build #134898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/134898/testReport)** for PR 31477 at commit [`e2b2a81`](https://github.com/apache/spark/commit/e2b2a811a9384d253d2c5ba7776d3443f11b2a46).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773574298


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-780141388


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135179/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] sririshindra commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
sririshindra commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570684802



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -415,24 +431,26 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
       ctx: CodegenContext,
       input: Seq[ExprCode]): (String, String, Seq[ExprCode]) = {
     val matched = ctx.freshName("matched")
+    val numMatched = metricTerm(ctx, "numMatchedPairs")
     val buildVars = genBuildSideVars(ctx, matched)
-    val checkCondition = if (condition.isDefined) {
-      val expr = condition.get
-      // evaluate the variables from build side that used by condition
-      val eval = evaluateRequiredVariables(buildPlan.output, buildVars, expr.references)
-      // filter the output via condition
-      ctx.currentVars = input ++ buildVars
-      val ev =
-        BindReferences.bindReference(expr, streamedPlan.output ++ buildPlan.output).genCode(ctx)
-      val skipRow = s"${ev.isNull} || !${ev.value}"
-      s"""
-         |$eval
-         |${ev.code}
-         |if (!($skipRow))
-       """.stripMargin
-    } else {
-      ""
-    }
+    val checkCondition = s"$numMatched.add(1);\n" +
+      (if (condition.isDefined) {
+        val expr = condition.get
+        // evaluate the variables from build side that used by condition
+        val eval = evaluateRequiredVariables(buildPlan.output, buildVars, expr.references)
+        // filter the output via condition
+        ctx.currentVars = input ++ buildVars
+        val ev =
+          BindReferences.bindReference(expr, streamedPlan.output ++ buildPlan.output).genCode(ctx)
+        val skipRow = s"${ev.isNull} || !${ev.value}"
+        s"""
+           |$eval
+           |${ev.code}
+           |if (!($skipRow))
+         """.stripMargin
+      } else {
+        ""
+      })

Review comment:
       Fixed in latest commit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-773681938


   ok to test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787595910


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40150/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on a change in pull request #31477: [SPARK-34369][SQL][WEBUI] Track number of pairs processed out of Join.

Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #31477:
URL: https://github.com/apache/spark/pull/31477#discussion_r570624874



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala
##########
@@ -57,7 +57,8 @@ case class BroadcastHashJoinExec(
   }
 
   override lazy val metrics = Map(
-    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"))
+    "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"),
+    "numMatchedPairs" -> SQLMetrics.createMetric(sparkContext, "number of matched pairs"))

Review comment:
       `numMatchedPairs` -> `numJoinedRows`?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala
##########
@@ -415,24 +431,26 @@ trait HashJoin extends BaseJoinExec with CodegenSupport {
       ctx: CodegenContext,
       input: Seq[ExprCode]): (String, String, Seq[ExprCode]) = {
     val matched = ctx.freshName("matched")
+    val numMatched = metricTerm(ctx, "numMatchedPairs")
     val buildVars = genBuildSideVars(ctx, matched)
-    val checkCondition = if (condition.isDefined) {
-      val expr = condition.get
-      // evaluate the variables from build side that used by condition
-      val eval = evaluateRequiredVariables(buildPlan.output, buildVars, expr.references)
-      // filter the output via condition
-      ctx.currentVars = input ++ buildVars
-      val ev =
-        BindReferences.bindReference(expr, streamedPlan.output ++ buildPlan.output).genCode(ctx)
-      val skipRow = s"${ev.isNull} || !${ev.value}"
-      s"""
-         |$eval
-         |${ev.code}
-         |if (!($skipRow))
-       """.stripMargin
-    } else {
-      ""
-    }
+    val checkCondition = s"$numMatched.add(1);\n" +
+      (if (condition.isDefined) {
+        val expr = condition.get
+        // evaluate the variables from build side that used by condition
+        val eval = evaluateRequiredVariables(buildPlan.output, buildVars, expr.references)
+        // filter the output via condition
+        ctx.currentVars = input ++ buildVars
+        val ev =
+          BindReferences.bindReference(expr, streamedPlan.output ++ buildPlan.output).genCode(ctx)
+        val skipRow = s"${ev.isNull} || !${ev.value}"
+        s"""
+           |$eval
+           |${ev.code}
+           |if (!($skipRow))
+         """.stripMargin
+      } else {
+        ""
+      })

Review comment:
       Could you avoid unnecessary changes? For example;
   ```
       val checkCondition = if (condition.isDefined) {
         val expr = condition.get
         ...
       } else {
           ""
       }
       val some_val = s"$numMatched.add(1);\n$checkCondition"
       (matched, some_val, buildVars)
   ```

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala
##########
@@ -89,13 +90,20 @@ case class BroadcastNestedLoopJoinExec(
     }
   }
 
-  @transient private lazy val boundCondition = {
+  private val numMatchedPairs = longMetric("numMatchedPairs")
+
+  @transient private lazy val boundCondition: InternalRow => Boolean =
     if (condition.isDefined) {

Review comment:
       If `condition` is empty, `numOutputRows` == `numMatchedPairs `? If so, could we only show this new metric in the web UI when `condition` defined?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31477: [SPARK-34369][SQL] Track number of pairs processed out of Join

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31477:
URL: https://github.com/apache/spark/pull/31477#issuecomment-787529338


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135564/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org