You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/12 00:22:12 UTC

[GitHub] [spark] karuppayya opened a new pull request #28804: SPARK-31973: Add ability to disable Sort,Spill in Partial aggregation

karuppayya opened a new pull request #28804:
URL: https://github.com/apache/spark/pull/28804


   ### What changes were proposed in this pull request?
   In case of HashAggregation, a partial aggregation(update) is done followed by final aggregation(merge) 
   
   During partial aggregation we sort and spill to disk every-time       fby, when the fast Map(when enabled) and  UnsafeFixedWidthAggregationMap gets exhausted
   
   **When the cardinality of grouping column is close to the total number of records being processed, the sorting of data spilling to disk is not required, since it is kind of no-op and we can directly use in Final aggregation.**
   
   When the user is aware of nature of data, currently he has no control over disabling this sort, spill operation.
   
   This is similar to following issues in Hive:
   https://issues.apache.org/jira/browse/HIVE-223
   https://issues.apache.org/jira/browse/HIVE-291
   
   In this PR, the abilty to disable sort/spill during partial aggregation is added
    
   ### Why are the changes needed?
   This improvement can improve the performance of queries
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   This patch was tested manually


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645847550






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648620793


   **[Test build #124468 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124468/testReport)** for PR 28804 at commit [`d2873a3`](https://github.com/apache/spark/commit/d2873a3fb26280f2a81bd4180debf538c707484d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646818315






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-659047829






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670629807


   **[Test build #127210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127210/testReport)** for PR 28804 at commit [`c9a415d`](https://github.com/apache/spark/commit/c9a415de201a784756a3e3ef1a39ef069b7f0b4e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670646425






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467464857



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,34 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+
+  val SKIP_PARTIAL_AGGREGATE_MINROWS =
+  buildConf("spark.sql.aggregate.skipPartialAggregate.minNumRows")
+      .internal()
+      .doc("Number of records after which aggregate operator checks if " +
+        "partial aggregation phase can be avoided")
+      .longConf
+      .createWithDefault(100000)
+
+  val SKIP_PARTIAL_AGGREGATE_AGGREGATE_RATIO =
+  buildConf("spark.sql.aggregate.skipPartialAggregate.aggregateRatio")
+      .internal()
+      .doc("Ratio of number of records present in map of Aggregate operator" +
+        "to the total number of records processed by the Aggregate operator")

Review comment:
       Please describe not how to compute the ratio, but what this means.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670707728


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-659049598


   Updated the description with the benchmarks, after the latest changes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646935525


   **[Test build #124304 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124304/testReport)** for PR 28804 at commit [`56c95e2`](https://github.com/apache/spark/commit/56c95e242126d7aacdb4862adc5e094b4e29561b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-659094426


   **[Test build #125924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125924/testReport)** for PR 28804 at commit [`26a2fd6`](https://github.com/apache/spark/commit/26a2fd63c9410d274a8ed26aeccff0e0a6a4f79f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-642999484


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r439144332



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
##########
@@ -165,6 +166,26 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession
     }
   }
 
+  test("SPARK-: Avoid spill in partial aggregation " +
+    "when spark.sql.aggregate.spill.partialaggregate.disabled is set") {
+    withSQLConf((SQLConf.SPILL_PARTIAL_AGGREGATE_DISABLED.key, "true"),

Review comment:
       This is not sufficient since the rows still get added to `UnsafeFixedWidthAggregationMap` , need to figure out a way to avoid adding elements to `UnsafeFixedWidthAggregationMap`. Please advise @cloud-fan  @gatorsmile @maropu 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467472877



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -838,13 +880,17 @@ case class HashAggregateExec(
        |  long $beforeAgg = System.nanoTime();
        |  $doAggFuncName();
        |  $aggTime.add((System.nanoTime() - $beforeAgg) / $NANOS_PER_MILLIS);
+       |  if (shouldStop()) return;
        |}
+       |$genCodePostInitCode
        |// output the result
        |$outputFromFastHashMap
        |$outputFromRegularHashMap
      """.stripMargin
   }
 
+  override def needStopCheck: Boolean = skipPartialAggregateEnabled

Review comment:
       If `skipPartialAggregateEnabled` = true but #rows/cardinality don't go over the threshold, partial aggregates are not skipped. Even in that case, we  set true to `needStopCheck`?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r443119807



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,13 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(false)

Review comment:
       Could you check if all the existing tests can pass when setting true to this config?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r446720097



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,13 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)

Review comment:
       That meant a ratio of a distinct row count and total row count in group-by key column stats. For example, if a number `distinctCount / rowCount` is close to 1.0, you apply the optimization; otherwise, you don't.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466760439



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")
+    }
+    val childrenConsumed = ctx.
+      addMutableState(CodeGenerator.JAVA_BOOLEAN, "childrenConsumed")
+    rowCountTerm = ctx.
+      addMutableState(CodeGenerator.JAVA_LONG, "rowCount")

Review comment:
       Because of the historical reason, we avoid generating unnecesary mutable states where poissible. See: https://issues.apache.org/jira/browse/SPARK-18016
   I think it is best to generate the same generated code if `skipPartialAggregate=false`.
   Did you suggest it is impossible to do so?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655707361


   **[Test build #125393 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125393/testReport)** for PR 28804 at commit [`3ca81ae`](https://github.com/apache/spark/commit/3ca81ae8d381509f33c49dd3a81f57856e5bd264).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645734522


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124189/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645866010


   **[Test build #124209 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124209/testReport)** for PR 28804 at commit [`2b3704b`](https://github.com/apache/spark/commit/2b3704b1bd96e0c64890a0d730f840c0ab7da36e).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670659564


   **[Test build #127214 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127214/testReport)** for PR 28804 at commit [`ceaa4e5`](https://github.com/apache/spark/commit/ceaa4e52d558d21964e5ea84a236519680202115).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670725493


   **[Test build #127218 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127218/testReport)** for PR 28804 at commit [`0a186f0`](https://github.com/apache/spark/commit/0a186f0eb5d71732ec3abc6b42a12dae6594277f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-711714582


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-744028264


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132721/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670630615


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645779303


   @maropu @cloud-fan Can you please add me to the whitelist to trigger the tests?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648635483


   **[Test build #124468 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124468/testReport)** for PR 28804 at commit [`d2873a3`](https://github.com/apache/spark/commit/d2873a3fb26280f2a81bd4180debf538c707484d).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655708568


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125393/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670725860






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-653932405






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r446725910



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,13 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)

Review comment:
       Thanks @maropu for explaining, I will make this change




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646799809


   **[Test build #124294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124294/testReport)** for PR 28804 at commit [`9a10261`](https://github.com/apache/spark/commit/9a10261324238f20fa6edb2554f90b965c829d89).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467464951



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,34 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+
+  val SKIP_PARTIAL_AGGREGATE_MINROWS =
+  buildConf("spark.sql.aggregate.skipPartialAggregate.minNumRows")
+      .internal()
+      .doc("Number of records after which aggregate operator checks if " +
+        "partial aggregation phase can be avoided")
+      .longConf
+      .createWithDefault(100000)
+
+  val SKIP_PARTIAL_AGGREGATE_AGGREGATE_RATIO =
+  buildConf("spark.sql.aggregate.skipPartialAggregate.aggregateRatio")
+      .internal()
+      .doc("Ratio of number of records present in map of Aggregate operator" +
+        "to the total number of records processed by the Aggregate operator")

Review comment:
       Also, please add `.version` and `.checkValue`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646816065


   **[Test build #124295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124295/testReport)** for PR 28804 at commit [`1c0399a`](https://github.com/apache/spark/commit/1c0399aa3147a6e461b41de85412b6c3a404009f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467464647



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -409,6 +411,12 @@ case class HashAggregateExec(
   private var fastHashMapTerm: String = _
   private var isFastHashMapEnabled: Boolean = false
 
+  private var avoidSpillInPartialAggregateTerm: String = _
+  private val skipPartialAggregate = sqlContext.conf.skipPartialAggregate &&
+    AggUtils.areAggExpressionsPartial(modes) && find(_.isInstanceOf[ExpandExec]).isEmpty

Review comment:
       I have no smart idea for that now. Hive doesn't have that kind of plans?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670248312






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646821885






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646802503


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646854580


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124298/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646974298


   > When the cardinality of grouping column is close to the total number of records being processed
   
   Ur, one more question; how do we know that the cardinality is close the #records being processed  when processing aggregates?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655708033






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655725403


   **[Test build #125394 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125394/testReport)** for PR 28804 at commit [`8850777`](https://github.com/apache/spark/commit/8850777f41dbf6d0a526a8409c426f156e7f7fa7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-673460228


   btw, could this optimization be implemented on the adaptive execution framework (`AdaptiveSparkPlanExec`)? In the initial discussion (https://github.com/apache/spark/pull/28804/files#r447158417), it was pointed out that accurate statistics could not be collected. But, I think we might be able to collect the stats based on the framework. Yea, as we know, we need to look for a light-weight way to compute cardinality on shuffle output. If we find it, I think we can simply drop partial aggregate for high cardinality cases.
   
   Have you already considered this approach? What I'm worried about now is that the current implementation makes the code complicated and it is limited to hash aggregates w/codegen only.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648859722


   **[Test build #124489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124489/testReport)** for PR 28804 at commit [`d2873a3`](https://github.com/apache/spark/commit/d2873a3fb26280f2a81bd4180debf538c707484d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-650552251


   > it is more of a manual step and can be used only if the user knows the nature of data upfront.Like in my benchmark, where we expect the the all but few grouping keys to be different.
   A user will be able to find the nature of data by looking at the metrics in Spark UI where the number of output rows from previous stage is same/almost same as the number of output rows from HashAggregate. If the the user expects the new data to have this nature in his subsequent runs(say a partitioned table with new data every hour/day), he can enable the config.
   
   
   hm...., if `SKIP_PARTIAL_AGGREGATE_ENABLED=true` and the cardinality is **not** the same with the number of rows, a query returns a wrong aggregated answer, right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645781069






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645729114


   ok to test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r441932845



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2173,6 +2173,13 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SPILL_PARTIAL_AGGREGATE_DISABLED =
+    buildConf("spark.sql.aggregate.spill.partialaggregate.disabled")

Review comment:
       `disabled` -> `enabled` to follow the other config naming.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] c21 commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

c21 commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-801664504


   Any update for the PR? Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-675060833


   @cloud-fan 
   We observed this behaviour(partial aggregation not helping) in one of our customers.
   Initially, I had disabled the partial aggregation completely by making the Aggregate mode to `org.apache.spark.sql.catalyst.expressions.aggregate.Complete`
   But later found the Hive's optimization for handling this scenario.
   I have used the Hive's heuristic(with default for minRows of 100000 to be sampled) in this PR.
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466748030



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -409,6 +411,12 @@ case class HashAggregateExec(
   private var fastHashMapTerm: String = _
   private var isFastHashMapEnabled: Boolean = false
 
+  private var avoidSpillInPartialAggregateTerm: String = _
+  private val skipPartialAggregate = sqlContext.conf.skipPartialAggregate &&
+    AggUtils.areAggExpressionsPartial(modes) && find(_.isInstanceOf[ExpandExec]).isEmpty

Review comment:
       This is required to avoid this optimization for Query with more than one distinct.
   `org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates` takes cares of rewriting aggregates with more than one distinct.
   The rule assumes that map side aggregation has taken care of performing distinct operation.
   With my change this will result in wrong results.
   For example:
   For the first example given as part of comments in the rule
   ```
    * First example: query without filter clauses (in scala):
    * {{{
    *   val data = Seq(
    *     ("a", "ca1", "cb1", 10),
    *     ("a", "ca1", "cb2", 5),
    *     ("b", "ca1", "cb1", 13))
    *     .toDF("key", "cat1", "cat2", "value")
    *   data.createOrReplaceTempView("data")
    *
    *   val agg = data.groupBy($"key")
    *     .agg(
    *       countDistinct($"cat1").as("cat1_cnt"),
    *       countDistinct($"cat2").as("cat2_cnt"),
    *       sum($"value").as("total"))
    * }}}
    *
    * This translates to the following (pseudo) logical plan:
    * {{{
    * Aggregate(
    *    key = ['key]
    *    functions = [COUNT(DISTINCT 'cat1),
    *                 COUNT(DISTINCT 'cat2),
    *                 sum('value)]
    *    output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
    *   LocalTableScan [...]
    * }}}
    *
    * This rule rewrites this logical plan to the following (pseudo) logical plan:
    * {{{
    * Aggregate(
    *    key = ['key]
    *    functions = [count(if (('gid = 1)) 'cat1 else null),
    *                 count(if (('gid = 2)) 'cat2 else null),
    *                 first(if (('gid = 0)) 'total else null) ignore nulls]
    *    output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
    *   Aggregate(
    *      key = ['key, 'cat1, 'cat2, 'gid]
    *      functions = [sum('value)]
    *      output = ['key, 'cat1, 'cat2, 'gid, 'total])
    *     Expand(
    *        projections = [('key, null, null, 0, cast('value as bigint)),
    *                       ('key, 'cat1, null, 1, null),
    *                       ('key, null, 'cat2, 2, null)]
    *        output = ['key, 'cat1, 'cat2, 'gid, 'value])
    *       LocalTableScan [...]
    * }}}
    *
   ```
   Say the following are the two records in the dataset 
   ```
   rec 1: (“key1“, “cat1“, “cat1“, 1)
   rec 2: (“key1“, “cat1“, “cat1“, 1)
   ```
   With my change
   After expand:
   ```
   (“key1“, “null“, “null“, 0, 1)
   
   (“key1“, “cat1“, “null“, 1, null)
   
   (“key1“, “null“, “cat2“, 2, null)
   
   (“key1“, “null“, “null“, 0, 1)
   
   (“key1“, “cat1“, “null“, 1, null)
   
   (“key1“, “null“, “cat1“, 2, null)
   ```
   After partial aggregation
   ```
   (“key1“, “null“, “null“, 0, 1)
   
   (“key1“, “cat1“, “null“, 1, null)
   
   (“key1“, “null“, “cat2“, 2, null)
   
   (“key2“, “null“, “null“, 0, 1)
   
   (“key2“, “cat2“, “null“, 1, null)
   
   (“key2“, “null“, “cat2“, 2, null)
   ```
   Reducer side aggregation result: **(key1, 2, 2, 2)**
   But the correct answer is: **(key1, 1, 1, 2)**
   
   Hence checking for the presence of expand node to avoid this skipping partial aggregation
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645800594


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645800603


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124199/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646893545


   **[Test build #124304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124304/testReport)** for PR 28804 at commit [`56c95e2`](https://github.com/apache/spark/commit/56c95e242126d7aacdb4862adc5e094b4e29561b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-654412769


   **[Test build #125118 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125118/testReport)** for PR 28804 at commit [`75125d9`](https://github.com/apache/spark/commit/75125d95ee1e011dd2348273dd84de453a93cd92).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amoghmargoor commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

amoghmargoor commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r447158417



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,13 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)

Review comment:
       @maropu This is very useful suggestion. One issue is columns stats are rarely computed. We came across this work in HIVE https://issues.apache.org/jira/browse/HIVE-291. They turn off map side aggregate (i.e., partial aggregate will be pass through) in Physical operator (i.e., Group-By operator)  if map-side aggregation reduce the entries by at least half and they look at 100000 rows to do that (ref: patch https://issues.apache.org/jira/secure/attachment/12400257/291.1.txt). Should we do something similar in HashAggregateExec here ? Any thoughts on this ? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-650636794


   > No, The Final aggregation will take care giving the right results.
   This is like more like setting the Aggregation mode to org.apache.spark.sql.catalyst.expressions.aggregate.Complete
   
   Ah, I see. We cannot just rewrite a plan from the two-phase aggregate (w/ partial mode) to a single-phase one (w/ complete mode) for the case during optimizing plans?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670807546


   **[Test build #127218 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127218/testReport)** for PR 28804 at commit [`0a186f0`](https://github.com/apache/spark/commit/0a186f0eb5d71732ec3abc6b42a12dae6594277f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645730604






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-647051698


   > Ur, one more question; how do we know that the cardinality is close the #records being processed when processing aggregates?
   @maropu
   - it is more of a manual step and can be used only if the user knows the nature of data upfront.Like in my benchmark, where we expect the the all but few grouping keys to be different.
   - A user will be able to find the nature of data by looking at the metrics in Spark UI where the number of output rows from previous stage is same/almost same as the number of output rows from HashAggregate. If the the user expects the new data to have this nature in his subsequent runs(say a partitioned table with new data every hour/day), he can enable the config.
   
   If this can be done at runtime, without any config that would be the ideal solution. This PR is the first step towards it.
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670659760


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646802511


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124294/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646854506


   **[Test build #124298 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124298/testReport)** for PR 28804 at commit [`7952aa7`](https://github.com/apache/spark/commit/7952aa707a91bac1df9e84ad17acacd006c7e850).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646893874






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646800237






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466760439



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")
+    }
+    val childrenConsumed = ctx.
+      addMutableState(CodeGenerator.JAVA_BOOLEAN, "childrenConsumed")
+    rowCountTerm = ctx.
+      addMutableState(CodeGenerator.JAVA_LONG, "rowCount")

Review comment:
       Because of the historical reason, we avoid generating unnecesary mutable states where poissible. See: https://issues.apache.org/jira/browse/SPARK-18016
   I think it is the best to generate the same generated code with that in the master if `skipPartialAggregate=false`.
   Did you suggest it is impossible to do so?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648621296






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-672296315


   **[Test build #127340 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127340/testReport)** for PR 28804 at commit [`11572a1`](https://github.com/apache/spark/commit/11572a105ef870c9e95b4302e6613b7f0e73d0de).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646831732


   **[Test build #124296 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124296/testReport)** for PR 28804 at commit [`73de4c8`](https://github.com/apache/spark/commit/73de4c8b54161116565a495e7cf9a9a65844adce).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655708560






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r468725811



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -838,13 +880,17 @@ case class HashAggregateExec(
        |  long $beforeAgg = System.nanoTime();
        |  $doAggFuncName();
        |  $aggTime.add((System.nanoTime() - $beforeAgg) / $NANOS_PER_MILLIS);
+       |  if (shouldStop()) return;
        |}
+       |$genCodePostInitCode
        |// output the result
        |$outputFromFastHashMap
        |$outputFromRegularHashMap
      """.stripMargin
   }
 
+  override def needStopCheck: Boolean = skipPartialAggregateEnabled

Review comment:
       When the ration of  cardinality to numRows,
    **doesnot go beyond threshold** -  it implies that the optimization has not kicked in yet. In which case `org.apache.spark.sql.execution.CodegenSupport#shouldStopCheckCode`, returns false. And continues with the iterating over remaining items of the  iterator.
    **goes beyond threshold** - We add the item(since the addition to Map is skipped) to the org.apache.spark.sql.execution.BufferedRowIterator#currentRows, which gets consumed by the parent.
   
   Since it is inexpensive operation and has been used at many places in HashAggregateExec and didnt see any performance penalties, this approached seemed ok to me .
   
   Please let me know if you have any other suggestions.
   Generated code:
   ```
   private void agg_doAggregateWithKeys_0() throws java.io.IOException {
   /* 318 */     while ( localtablescan_input_0.hasNext()) {
   /* 319 */       InternalRow localtablescan_row_0 = (InternalRow) localtablescan_input_0.next();
   /* 320 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[7] /* numOutputRows */).add(1);
   /* 321 */       boolean localtablescan_isNull_0 = localtablescan_row_0.isNullAt(0);
   /* 322 */       UTF8String localtablescan_value_0 = localtablescan_isNull_0 ?
   /* 323 */       null : (localtablescan_row_0.getUTF8String(0));
   /* 324 */       int localtablescan_value_1 = localtablescan_row_0.getInt(1);
   /* 325 */
   /* 326 */       agg_doConsume_0(localtablescan_value_0, localtablescan_isNull_0, localtablescan_value_1);
   /* 327 */       if (shouldStop()) return; // code added as part of needStopCheck
   /* 328 */     }
   /* 329 */
   /* 330 */     agg_childrenConsumed_0 = true;
   /* 331 */
   /* 332 */     agg_fastHashMapIter_0 = agg_fastHashMap_0.rowIterator();
   /* 333 */     agg_mapIter_0 = ((org.apache.spark.sql.execution.aggregate.HashAggregateExec) references[0] /* plan */).finishAggregate(agg_hashMap_0, agg_sorter_0, ((org.apache.spark.sql.execution.metric.SQLMetric) references[3] /* peakMemory */), ((org.apache.spark.sql.execution.metric.SQLMetric) references[4] /* spillSize */), ((org.apache.spark.sql.execution.metric.SQLMetric) references[5] /* avgHashProbe */));
   /* 334 */
   /* 335 */   }
   ```
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-653931995


   **[Test build #124950 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124950/testReport)** for PR 28804 at commit [`7766401`](https://github.com/apache/spark/commit/7766401f81dbde6d2941cacd69b51ad9acc5d855).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670250093


   **[Test build #127160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127160/testReport)** for PR 28804 at commit [`c49f106`](https://github.com/apache/spark/commit/c49f106b205f215812af79f82d2a89e03015143b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-659204777






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r455403105



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,25 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)
+
+  val SKIP_PARTIAL_AGGREGATE_THRESHOLD =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.threshold")
+      .internal()
+      .longConf
+      .createWithDefault(100000)
+
+  val SKIP_PARTIAL_AGGREGATE_RATIO =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.ratio")

Review comment:
       updated. Can you please check if its explanotory?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645734520






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645846710


   add to whitelist


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645734498


   **[Test build #124189 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124189/testReport)** for PR 28804 at commit [`5f05aa7`](https://github.com/apache/spark/commit/5f05aa7e2631a56b5cc97206f2799f64151b1789).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645866077


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124209/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466440577



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,29 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sorter(sort/spill) during partial aggregation")

Review comment:
       Please check the other doc descriptions, e.g., https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2111-L2113




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-672297178






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467173370



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -353,4 +353,8 @@ object AggUtils {
 
     finalAndCompleteAggregate :: Nil
   }
+
+  def areAggExpressionsPartial(modes: Seq[AggregateMode]): Boolean = {
+    modes.nonEmpty && modes.forall(_ == Partial)

Review comment:
       I was not aware of this case, Removed the check and made it inline in its callers




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670645866


   **[Test build #127214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127214/testReport)** for PR 28804 at commit [`ceaa4e5`](https://github.com/apache/spark/commit/ceaa4e52d558d21964e5ea84a236519680202115).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-642999484


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-672091607


   **[Test build #127340 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127340/testReport)** for PR 28804 at commit [`11572a1`](https://github.com/apache/spark/commit/11572a105ef870c9e95b4302e6613b7f0e73d0de).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-647031491


   **[Test build #124327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124327/testReport)** for PR 28804 at commit [`43237ba`](https://github.com/apache/spark/commit/43237ba31354f1816df7b98a881e52863d1c65a9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-854090367


   @c21 Please go ahead 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466756945



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {

Review comment:
       Ah, I see.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466748648



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")

Review comment:
       Without the semicolon, the auto generated codegen breaks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r455402409



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,25 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)
+
+  val SKIP_PARTIAL_AGGREGATE_THRESHOLD =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.threshold")
+      .internal()
+      .longConf
+      .createWithDefault(100000)

Review comment:
       @cloud-fan we skip partial aggregartion only when the aggragation was not able to cut down records by 50%(define by spark.sql.aggregate.partialaggregate.skip.ratio). In this case it will not kick in.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646816065


   **[Test build #124295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124295/testReport)** for PR 28804 at commit [`1c0399a`](https://github.com/apache/spark/commit/1c0399aa3147a6e461b41de85412b6c3a404009f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r443119418



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -879,42 +901,53 @@ case class HashAggregateExec(
 
     val oomeClassName = classOf[SparkOutOfMemoryError].getName
 
+
+

Review comment:
       Plz revert the unnecessary changes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466757897



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -409,6 +411,12 @@ case class HashAggregateExec(
   private var fastHashMapTerm: String = _
   private var isFastHashMapEnabled: Boolean = false
 
+  private var avoidSpillInPartialAggregateTerm: String = _
+  private val skipPartialAggregate = sqlContext.conf.skipPartialAggregate &&
+    AggUtils.areAggExpressionsPartial(modes) && find(_.isInstanceOf[ExpandExec]).isEmpty

Review comment:
       hm, I see... But, I think the current approach looks dangerous because we might add a new plan having the same assumption in future.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467472877



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -838,13 +880,17 @@ case class HashAggregateExec(
        |  long $beforeAgg = System.nanoTime();
        |  $doAggFuncName();
        |  $aggTime.add((System.nanoTime() - $beforeAgg) / $NANOS_PER_MILLIS);
+       |  if (shouldStop()) return;
        |}
+       |$genCodePostInitCode
        |// output the result
        |$outputFromFastHashMap
        |$outputFromRegularHashMap
      """.stripMargin
   }
 
+  override def needStopCheck: Boolean = skipPartialAggregateEnabled

Review comment:
       If `skipPartialAggregateEnabled` = true but #rows/cardinality don't go over the threshold, I think partial aggregates are not skipped. In that case, we need to set true to `needStopCheck`?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-653948000


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya closed pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya closed pull request #28804:
URL: https://github.com/apache/spark/pull/28804


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-654413059






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-654412769


   **[Test build #125118 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125118/testReport)** for PR 28804 at commit [`75125d9`](https://github.com/apache/spark/commit/75125d95ee1e011dd2348273dd84de453a93cd92).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670630615






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466749198



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")
+    }
+    val childrenConsumed = ctx.
+      addMutableState(CodeGenerator.JAVA_BOOLEAN, "childrenConsumed")
+    rowCountTerm = ctx.
+      addMutableState(CodeGenerator.JAVA_LONG, "rowCount")

Review comment:
       I didnot add these behind the flag since it was resulting in addition of few if..else checks. Hence avoided them since these variables are harmless.
   Reg initialization, since they are field members, they get initialized to `false` and `0` respectively




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648859722


   **[Test build #124489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124489/testReport)** for PR 28804 at commit [`d2873a3`](https://github.com/apache/spark/commit/d2873a3fb26280f2a81bd4180debf538c707484d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670275642


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127160/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648860368






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648635587


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670807750






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467172000



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -353,4 +353,8 @@ object AggUtils {
 
     finalAndCompleteAggregate :: Nil
   }
+
+  def areAggExpressionsPartial(modes: Seq[AggregateMode]): Boolean = {

Review comment:
       Removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646839649






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-711714582






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648621296






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648590889


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670707738


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127216/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-711713840


   **[Test build #129988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129988/testReport)** for PR 28804 at commit [`11572a1`](https://github.com/apache/spark/commit/11572a105ef870c9e95b4302e6613b7f0e73d0de).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-647031491


   **[Test build #124327 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124327/testReport)** for PR 28804 at commit [`43237ba`](https://github.com/apache/spark/commit/43237ba31354f1816df7b98a881e52863d1c65a9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648935212


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124489/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648552714






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670269420


   @karuppayya I updated the title a bit because I feel it is ambigous. Feel free to update it again if you have a better title.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-647044774






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467465148



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
##########
@@ -51,6 +52,42 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession
     assert(df.collect() === Array(Row(9, 4.5)))
   }
 
+  test("Avoid spill in partial aggregation" ) {
+    withSQLConf((SQLConf.SKIP_PARTIAL_AGGREGATE_ENABLED.key, "true"),
+      (SQLConf.SKIP_PARTIAL_AGGREGATE_MINROWS.key, "2")) {
+      // Create Dataframes
+      val data = Seq(("James", 1), ("James", 1), ("Phil", 1))
+      val aggDF = data.toDF("name", "values").groupBy("name").sum("values")
+      val partAggNode = aggDF.queryExecution.executedPlan.find {
+        case h: HashAggregateExec =>
+          val modes = h.aggregateExpressions.map(_.mode)
+          modes.nonEmpty && modes.forall(_ == Partial)
+        case _ => false
+      }
+
+      checkAnswer(aggDF, Seq(Row("James", 2), Row("Phil", 1)))
+      assert(partAggNode.isDefined,
+      "No HashAggregate node with partial aggregate expression found")
+      assert(partAggNode.get.metrics("partialAggSkipped").value == data.size,
+      "Partial aggregation got triggered in partial hash aggregate node")
+    }
+  }
+
+  test(s"Distinct: Partial aggregation should happen for" +
+    s" HashAggregate nodes performing partial Aggregate operations " ) {
+    withSQLConf((SQLConf.SKIP_PARTIAL_AGGREGATE_ENABLED.key, "true")) {

Review comment:
       nit: `SQLConf.SKIP_PARTIAL_AGGREGATE_ENABLED.key -> "true"`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646818326


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124295/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-659204777


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670275561


   **[Test build #127160 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127160/testReport)** for PR 28804 at commit [`c49f106`](https://github.com/apache/spark/commit/c49f106b205f215812af79f82d2a89e03015143b).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r446571681



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,13 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)

Review comment:
       Could we use a threadhold + column stats instead of this boolean config?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655798990






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r447512443



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,13 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)

Review comment:
       >  They turn off map side aggregate (i.e., partial aggregate will be pass through) in Physical operator (i.e., Group-By operator) if map-side aggregation reduce the entries by at least half and they look at 100000 rows to do that 
   
   Looks whether that approach improves performance depends on IO performance, but looks interesting to me. WDYT? @cloud-fan 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r447512443



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,13 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)

Review comment:
       >  They turn off map side aggregate (i.e., partial aggregate will be pass through) in Physical operator (i.e., Group-By operator) if map-side aggregation reduce the entries by at least half and they look at 100000 rows to do that 
   
   I think whether that approach improves performance depends on IO performance, but the idea looks interesting to me. WDYT? @cloud-fan 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-659094426


   **[Test build #125924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125924/testReport)** for PR 28804 at commit [`26a2fd6`](https://github.com/apache/spark/commit/26a2fd63c9410d274a8ed26aeccff0e0a6a4f79f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-744023642


   **[Test build #132721 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132721/testReport)** for PR 28804 at commit [`11572a1`](https://github.com/apache/spark/commit/11572a105ef870c9e95b4302e6613b7f0e73d0de).
    * This patch **fails Spark unit tests**.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670630439






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-649138531


   **[Test build #124505 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124505/testReport)** for PR 28804 at commit [`afc2903`](https://github.com/apache/spark/commit/afc2903e4a327d6caef518e6d3f0dc431424ac7c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-654508911


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655708033






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646854576






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467472877



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -838,13 +880,17 @@ case class HashAggregateExec(
        |  long $beforeAgg = System.nanoTime();
        |  $doAggFuncName();
        |  $aggTime.add((System.nanoTime() - $beforeAgg) / $NANOS_PER_MILLIS);
+       |  if (shouldStop()) return;
        |}
+       |$genCodePostInitCode
        |// output the result
        |$outputFromFastHashMap
        |$outputFromRegularHashMap
      """.stripMargin
   }
 
+  override def needStopCheck: Boolean = skipPartialAggregateEnabled

Review comment:
       If `skipPartialAggregateEnabled` = true but #rows and cardinality doesn't go over the threshold, I think partial aggregates are not skipped. In that case, we need to set true to `needStopCheck`?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648590750


   **[Test build #124450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124450/testReport)** for PR 28804 at commit [`99c1d22`](https://github.com/apache/spark/commit/99c1d2226d170f789dfa534ffc658f7fc430c38d).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655708553


   **[Test build #125393 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125393/testReport)** for PR 28804 at commit [`3ca81ae`](https://github.com/apache/spark/commit/3ca81ae8d381509f33c49dd3a81f57856e5bd264).
    * This patch **fails Scala style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648859029


   Jenkins, retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467171812



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -409,6 +411,12 @@ case class HashAggregateExec(
   private var fastHashMapTerm: String = _
   private var isFastHashMapEnabled: Boolean = false
 
+  private var avoidSpillInPartialAggregateTerm: String = _
+  private val skipPartialAggregate = sqlContext.conf.skipPartialAggregate &&
+    AggUtils.areAggExpressionsPartial(modes) && find(_.isInstanceOf[ExpandExec]).isEmpty

Review comment:
       True.. Any suggestions on handling this in a better way? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645850639


   **[Test build #124209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124209/testReport)** for PR 28804 at commit [`2b3704b`](https://github.com/apache/spark/commit/2b3704b1bd96e0c64890a0d730f840c0ab7da36e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646816500






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-744005260


   **[Test build #132721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132721/testReport)** for PR 28804 at commit [`11572a1`](https://github.com/apache/spark/commit/11572a105ef870c9e95b4302e6613b7f0e73d0de).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-744005260


   **[Test build #132721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132721/testReport)** for PR 28804 at commit [`11572a1`](https://github.com/apache/spark/commit/11572a105ef870c9e95b4302e6613b7f0e73d0de).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670659771


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127214/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466760439



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")
+    }
+    val childrenConsumed = ctx.
+      addMutableState(CodeGenerator.JAVA_BOOLEAN, "childrenConsumed")
+    rowCountTerm = ctx.
+      addMutableState(CodeGenerator.JAVA_LONG, "rowCount")

Review comment:
       Because of the historical reason, we avoid generating unnecesary mutable states where poissible. See: https://issues.apache.org/jira/browse/SPARK-18016
   I think it is best to generate the same generated code with that in the master if `skipPartialAggregate=false`.
   Did you suggest it is impossible to do so?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-659203150


   **[Test build #125924 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125924/testReport)** for PR 28804 at commit [`26a2fd6`](https://github.com/apache/spark/commit/26a2fd63c9410d274a8ed26aeccff0e0a6a4f79f).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467466366



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")

Review comment:
       Still not sure, we cannot use `skipPartialAggregate=true`, `skipPartialAggregate.minNumRows=0` , and `skipPartialAggregate.aggregateRatio=0.0` for testing?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-672092513






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648590889






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646799809


   **[Test build #124294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124294/testReport)** for PR 28804 at commit [`9a10261`](https://github.com/apache/spark/commit/9a10261324238f20fa6edb2554f90b965c829d89).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467173370



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -353,4 +353,8 @@ object AggUtils {
 
     finalAndCompleteAggregate :: Nil
   }
+
+  def areAggExpressionsPartial(modes: Seq[AggregateMode]): Boolean = {
+    modes.nonEmpty && modes.forall(_ == Partial)

Review comment:
       In this case, the reducer side also does not have any aggregate function and we might end up not aggregating the data




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645800594






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645730148


   **[Test build #124189 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124189/testReport)** for PR 28804 at commit [`5f05aa7`](https://github.com/apache/spark/commit/5f05aa7e2631a56b5cc97206f2799f64151b1789).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r441974438



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -72,6 +74,8 @@ case class HashAggregateExec(
     "peakMemory" -> SQLMetrics.createSizeMetric(sparkContext, "peak memory"),
     "spillSize" -> SQLMetrics.createSizeMetric(sparkContext, "spill size"),
     "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in aggregation build"),
+    "partialAggSkipped" -> SQLMetrics.createMetric(sparkContext, "Num records" +
+      " skipped partial aggregation skipped"),

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645730604






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467172474



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")
+    }
+    val childrenConsumed = ctx.
+      addMutableState(CodeGenerator.JAVA_BOOLEAN, "childrenConsumed")
+    rowCountTerm = ctx.
+      addMutableState(CodeGenerator.JAVA_LONG, "rowCount")

Review comment:
       I have made the changes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646893874






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648934831


   **[Test build #124489 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124489/testReport)** for PR 28804 at commit [`d2873a3`](https://github.com/apache/spark/commit/d2873a3fb26280f2a81bd4180debf538c707484d).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646831856


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124296/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645847550






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646935915






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645850639


   **[Test build #124209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124209/testReport)** for PR 28804 at commit [`2b3704b`](https://github.com/apache/spark/commit/2b3704b1bd96e0c64890a0d730f840c0ab7da36e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670630622


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127210/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-654508078


   **[Test build #125118 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125118/testReport)** for PR 28804 at commit [`75125d9`](https://github.com/apache/spark/commit/75125d95ee1e011dd2348273dd84de453a93cd92).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670630600


   **[Test build #127210 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127210/testReport)** for PR 28804 at commit [`c9a415d`](https://github.com/apache/spark/commit/c9a415de201a784756a3e3ef1a39ef069b7f0b4e).
    * This patch **fails Scala style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645781069






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r451880238



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,25 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)
+
+  val SKIP_PARTIAL_AGGREGATE_THRESHOLD =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.threshold")
+      .internal()
+      .longConf
+      .createWithDefault(100000)
+
+  val SKIP_PARTIAL_AGGREGATE_RATIO =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.ratio")
+      .internal()
+      .doubleConf
+      .createWithDefault(0.5)

Review comment:
       Also, could you check performance numbers by varying the params?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-649241132


   @maropu I have fixed the tests with the flag enabled. Please take a look. Should I go ahead change the default value of the config back to `false`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645778573






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-744028264


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/132721/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646818315


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467465104



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
##########
@@ -51,6 +52,42 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession
     assert(df.collect() === Array(Row(9, 4.5)))
   }
 
+  test("Avoid spill in partial aggregation" ) {
+    withSQLConf((SQLConf.SKIP_PARTIAL_AGGREGATE_ENABLED.key, "true"),
+      (SQLConf.SKIP_PARTIAL_AGGREGATE_MINROWS.key, "2")) {
+      // Create Dataframes
+      val data = Seq(("James", 1), ("James", 1), ("Phil", 1))
+      val aggDF = data.toDF("name", "values").groupBy("name").sum("values")
+      val partAggNode = aggDF.queryExecution.executedPlan.find {
+        case h: HashAggregateExec =>
+          val modes = h.aggregateExpressions.map(_.mode)
+          modes.nonEmpty && modes.forall(_ == Partial)
+        case _ => false
+      }
+
+      checkAnswer(aggDF, Seq(Row("James", 2), Row("Phil", 1)))
+      assert(partAggNode.isDefined,
+      "No HashAggregate node with partial aggregate expression found")
+      assert(partAggNode.get.metrics("partialAggSkipped").value == data.size,
+      "Partial aggregation got triggered in partial hash aggregate node")
+    }
+  }
+
+  test(s"Distinct: Partial aggregation should happen for" +
+    s" HashAggregate nodes performing partial Aggregate operations " ) {

Review comment:
       please remove `s` in the head.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655725538






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646802503






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646973316


   > When the cardinality of grouping column is close to the total number of records being processed, the sorting of data spilling to disk is not required, since it is kind of no-op and we can directly use rows in Final aggregation.
   
   I do not look into the code yet, but one question I have; does this optimization get benefits only when codegen enabled? When I read the description above, I thought this was more general one though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r441932524



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -72,6 +74,8 @@ case class HashAggregateExec(
     "peakMemory" -> SQLMetrics.createSizeMetric(sparkContext, "peak memory"),
     "spillSize" -> SQLMetrics.createSizeMetric(sparkContext, "spill size"),
     "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in aggregation build"),
+    "partialAggSkipped" -> SQLMetrics.createMetric(sparkContext, "Num records" +
+      " skipped partial aggregation skipped"),

Review comment:
       skipped ... skipped? How about `number of skipped records for partial aggregates`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466767625



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -353,4 +353,8 @@ object AggUtils {
 
     finalAndCompleteAggregate :: Nil
   }
+
+  def areAggExpressionsPartial(modes: Seq[AggregateMode]): Boolean = {
+    modes.nonEmpty && modes.forall(_ == Partial)

Review comment:
       We canno apply this optimization if the empty case? e.g., 
   ```
   scala> sql("select k from t group by k").explain()
   == Physical Plan ==
   *(2) HashAggregate(keys=[k#29], functions=[])
   +- Exchange hashpartitioning(k#29, 200), true, [id=#47]
      +- *(1) HashAggregate(keys=[k#29], functions=[])
         +- *(1) ColumnarToRow
            +- FileScan parquet default.t[k#29] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int>
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r446718013



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,13 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)

Review comment:
       I didnt get the threshold part. Can you pleas elaborate




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-650585266


   > > it is more of a manual step and can be used only if the user knows the nature of data upfront.Like in my benchmark, where we expect the the all but few grouping keys to be different.
   > > A user will be able to find the nature of data by looking at the metrics in Spark UI where the number of output rows from previous stage is same/almost same as the number of output rows from HashAggregate. If the the user expects the new data to have this nature in his subsequent runs(say a partitioned table with new data every hour/day), he can enable the config.
   > 
   > hm...., if `SKIP_PARTIAL_AGGREGATE_ENABLED=true` and the cardinality is **not** the same with the number of rows, a query returns a wrong aggregated answer, right?
   
   No, The Final aggregation will take care giving the right results.
   This is like more like setting the Aggregation mode to `org.apache.spark.sql.catalyst.expressions.aggregate.Complete`
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-653947829


   **[Test build #124950 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124950/testReport)** for PR 28804 at commit [`7766401`](https://github.com/apache/spark/commit/7766401f81dbde6d2941cacd69b51ad9acc5d855).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670248312






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-649221849






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] c21 commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

c21 commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-854089520


   @karuppayya - we did a similar change internally and rolled out to production already in facebook. We made some change on top of this (e.g. only skip partial aggregate when the map needs spill), and fixed several bugs. Do you mind if we submit a separate PR (list you as co-author) and help move the feature forward? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467466366



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")

Review comment:
       Still not sure, we cannot use `skipPartialAggregate=true`, `skipPartialAggregate.minNumRows=0` , and `skipPartialAggregate.aggregateRatio=0.0` for testing? What I'm worried about is that we add tests for the code paths that are not executed under non-testing (normal) conditions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646841673


   **[Test build #124298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124298/testReport)** for PR 28804 at commit [`7952aa7`](https://github.com/apache/spark/commit/7952aa707a91bac1df9e84ad17acacd006c7e850).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670707640


   **[Test build #127216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127216/testReport)** for PR 28804 at commit [`2ae5525`](https://github.com/apache/spark/commit/2ae5525294718ea43cee419bc095a3d382b6c085).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655707361


   **[Test build #125393 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125393/testReport)** for PR 28804 at commit [`3ca81ae`](https://github.com/apache/spark/commit/3ca81ae8d381509f33c49dd3a81f57856e5bd264).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-647031604






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-854090367


   @c21 Please go ahead 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-653932405






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646816500






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-669329149


   @cloud-fan @maropu Can you help review these changes?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670707728






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646854576


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670275635






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r455403785



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,25 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)
+
+  val SKIP_PARTIAL_AGGREGATE_THRESHOLD =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.threshold")
+      .internal()
+      .longConf
+      .createWithDefault(100000)
+
+  val SKIP_PARTIAL_AGGREGATE_RATIO =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.ratio")
+      .internal()
+      .doubleConf
+      .createWithDefault(0.5)

Review comment:
       @maropu I have borrowed this heuristic from Hive. We can merge them into one. Any suggestions here? 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466765488



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -353,4 +353,8 @@ object AggUtils {
 
     finalAndCompleteAggregate :: Nil
   }
+
+  def areAggExpressionsPartial(modes: Seq[AggregateMode]): Boolean = {

Review comment:
       If we don't reuse it in the main codebase, I think its better to move this func into HashAggregateExec.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467175169



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")

Review comment:
       The aggregation map memory needs to be exhausted for the optimization to kick in.
   To bypass that for testing, using the `isTesting` method to trigger the optimization in HashAggregate operator




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670694952






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r452388582



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,25 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")

Review comment:
       so this only works for hash aggregate but not the sort based aggregate?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648935199






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r455402886



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,25 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")

Review comment:
       I beleive this heuristic can be applied for sort based aggregation as well. I started with Hash based aggregate, I will create a new PR for sort based aggregation.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645780769


   **[Test build #124199 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124199/testReport)** for PR 28804 at commit [`2b3704b`](https://github.com/apache/spark/commit/2b3704b1bd96e0c64890a0d730f840c0ab7da36e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670694952






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466757897



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -409,6 +411,12 @@ case class HashAggregateExec(
   private var fastHashMapTerm: String = _
   private var isFastHashMapEnabled: Boolean = false
 
+  private var avoidSpillInPartialAggregateTerm: String = _
+  private val skipPartialAggregate = sqlContext.conf.skipPartialAggregate &&
+    AggUtils.areAggExpressionsPartial(modes) && find(_.isInstanceOf[ExpandExec]).isEmpty

Review comment:
       hm, I see... But, I think the current approach looks dangerous because we might add a new plan having the same assumption in feature.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645800455


   **[Test build #124199 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124199/testReport)** for PR 28804 at commit [`2b3704b`](https://github.com/apache/spark/commit/2b3704b1bd96e0c64890a0d730f840c0ab7da36e).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-653948002


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124950/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-711644766


   **[Test build #129988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129988/testReport)** for PR 28804 at commit [`11572a1`](https://github.com/apache/spark/commit/11572a105ef870c9e95b4302e6613b7f0e73d0de).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648635587






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646821390


   **[Test build #124296 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124296/testReport)** for PR 28804 at commit [`73de4c8`](https://github.com/apache/spark/commit/73de4c8b54161116565a495e7cf9a9a65844adce).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670725860






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655725538






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648590893


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124450/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646821390


   **[Test build #124296 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124296/testReport)** for PR 28804 at commit [`73de4c8`](https://github.com/apache/spark/commit/73de4c8b54161116565a495e7cf9a9a65844adce).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648552423


   **[Test build #124450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124450/testReport)** for PR 28804 at commit [`99c1d22`](https://github.com/apache/spark/commit/99c1d2226d170f789dfa534ffc658f7fc430c38d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655798322


   **[Test build #125394 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125394/testReport)** for PR 28804 at commit [`8850777`](https://github.com/apache/spark/commit/8850777f41dbf6d0a526a8409c426f156e7f7fa7).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466765987



##########
File path: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationMap.java
##########
@@ -63,6 +63,14 @@
    */
   private final UnsafeRow currentAggregationBuffer;
 
+  /**
+   * Number of rows that were added to the map
+   * This includes the elements that were passed on sorter
+   * using {@link #destructAndCreateExternalSorter()}
+   *

Review comment:
       nit: remove this blank.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670630439






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670725493


   **[Test build #127218 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127218/testReport)** for PR 28804 at commit [`0a186f0`](https://github.com/apache/spark/commit/0a186f0eb5d71732ec3abc6b42a12dae6594277f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466758218



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")

Review comment:
       I didn't mean so, why do we need this flag here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466377228



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")
+    }
+    val childrenConsumed = ctx.
+      addMutableState(CodeGenerator.JAVA_BOOLEAN, "childrenConsumed")
+    rowCountTerm = ctx.
+      addMutableState(CodeGenerator.JAVA_LONG, "rowCount")

Review comment:
       We need the two variables even if `skipPartialAggregate`=false?

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
##########
@@ -142,52 +142,57 @@ class SQLMetricsSuite extends SharedSparkSession with SQLMetricsTestUtils
   }
 
   test("Aggregate metrics: track avg probe") {
-    // The executed plan looks like:
-    // HashAggregate(keys=[a#61], functions=[count(1)], output=[a#61, count#71L])
-    // +- Exchange hashpartitioning(a#61, 5)
-    //    +- HashAggregate(keys=[a#61], functions=[partial_count(1)], output=[a#61, count#76L])
-    //       +- Exchange RoundRobinPartitioning(1)
-    //          +- LocalTableScan [a#61]
-    //
-    // Assume the execution plan with node id is:
-    // Wholestage disabled:
-    // HashAggregate(nodeId = 0)
-    //   Exchange(nodeId = 1)
-    //     HashAggregate(nodeId = 2)
-    //       Exchange (nodeId = 3)
-    //         LocalTableScan(nodeId = 4)
-    //
-    // Wholestage enabled:
-    // WholeStageCodegen(nodeId = 0)
-    //   HashAggregate(nodeId = 1)
-    //     Exchange(nodeId = 2)
-    //       WholeStageCodegen(nodeId = 3)
-    //         HashAggregate(nodeId = 4)
-    //           Exchange(nodeId = 5)
-    //             LocalTableScan(nodeId = 6)
-    Seq(true, false).foreach { enableWholeStage =>
-      val df = generateRandomBytesDF().repartition(1).groupBy('a).count()
-      val nodeIds = if (enableWholeStage) {
-        Set(4L, 1L)
-      } else {
-        Set(2L, 0L)
-      }
-      val metrics = getSparkPlanMetrics(df, 1, nodeIds, enableWholeStage).get
-      nodeIds.foreach { nodeId =>
-        val probes = metrics(nodeId)._2("avg hash probe bucket list iters").toString
-        if (!probes.contains("\n")) {
-          // It's a single metrics value
-          assert(probes.toDouble > 1.0)
+    if (spark.sessionState.conf.getConf(SQLConf.SKIP_PARTIAL_AGGREGATE_ENABLED)) {
+      logInfo("Skipping, since partial Aggregation is disabled")
+    } else {
+      // The executed plan looks like:
+      // HashAggregate(keys=[a#61], functions=[count(1)], output=[a#61, count#71L])
+      // +- Exchange hashpartitioning(a#61, 5)
+      //    +- HashAggregate(keys=[a#61], functions=[partial_count(1)], output=[a#61, count#76L])
+      //       +- Exchange RoundRobinPartitioning(1)
+      //          +- LocalTableScan [a#61]
+      //
+      // Assume the execution plan with node id is:
+      // Wholestage disabled:
+      // HashAggregate(nodeId = 0)
+      //   Exchange(nodeId = 1)
+      //     HashAggregate(nodeId = 2)
+      //       Exchange (nodeId = 3)
+      //         LocalTableScan(nodeId = 4)
+      //
+      // Wholestage enabled:
+      // WholeStageCodegen(nodeId = 0)
+      //   HashAggregate(nodeId = 1)
+      //     Exchange(nodeId = 2)
+      //       WholeStageCodegen(nodeId = 3)
+      //         HashAggregate(nodeId = 4)
+      //           Exchange(nodeId = 5)
+      //             LocalTableScan(nodeId = 6)
+      Seq(true, false).foreach { enableWholeStage =>
+        val df = generateRandomBytesDF().repartition(1).groupBy('a).count()
+        val nodeIds = if (enableWholeStage) {
+          Set(4L, 1L)
         } else {
-          val mainValue = probes.split("\n").apply(1).stripPrefix("(").stripSuffix(")")
-          // Extract min, med, max from the string and strip off everthing else.
-          val index = mainValue.indexOf(" (", 0)
-          mainValue.slice(0, index).split(", ").foreach {
-            probe => assert(probe.toDouble > 1.0)
+          Set(2L, 0L)
+        }
+        val metrics = getSparkPlanMetrics(df, 1, nodeIds, enableWholeStage).get
+        nodeIds.foreach { nodeId =>
+          val probes = metrics(nodeId)._2("avg hash probe bucket list iters").toString
+          if (!probes.contains("\n")) {
+            // It's a single metrics value
+            assert(probes.toDouble > 1.0)
+          } else {
+            val mainValue = probes.split("\n").apply(1).stripPrefix("(").stripSuffix(")")
+            // Extract min, med, max from the string and strip off everthing else.
+            val index = mainValue.indexOf(" (", 0)
+            mainValue.slice(0, index).split(", ").foreach {
+              probe => assert(probe.toDouble > 1.0)
+            }
           }
         }
       }
     }
+

Review comment:
       nit: remove this blank.

##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
##########
@@ -51,6 +51,52 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession
     assert(df.collect() === Array(Row(9, 4.5)))
   }
 
+  test(s"Avoid spill in partial aggregation" ) {

Review comment:
       drop `s` in the head.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -877,50 +903,111 @@ case class HashAggregateExec(
       ("true", "true", "", "")
     }
 
+    val skipPartialAggregateThreshold = sqlContext.conf.skipPartialAggregateThreshold
+    val skipPartialAggRatio = sqlContext.conf.skipPartialAggregateRatio
+
+    val countTerm = ctx.addMutableState(CodeGenerator.JAVA_LONG, "count")
     val oomeClassName = classOf[SparkOutOfMemoryError].getName
+    val findOrInsertRegularHashMap: String = {
+      def getAggBufferFromMap = {
+        s"""
+           |// generate grouping key
+           |${unsafeRowKeyCode.code}
+           |int $unsafeRowKeyHash = ${unsafeRowKeyCode.value}.hashCode();
+           |if ($checkFallbackForBytesToBytesMap) {
+           |  // try to get the buffer from hash map
+           |  $unsafeRowBuffer =
+           |    $hashMapTerm.getAggregationBufferFromUnsafeRow($unsafeRowKeys, $unsafeRowKeyHash);
+           |}
+          """.stripMargin
+      }
 
-    val findOrInsertRegularHashMap: String =
-      s"""
-         |// generate grouping key
-         |${unsafeRowKeyCode.code}
-         |int $unsafeRowKeyHash = ${unsafeRowKeyCode.value}.hashCode();
-         |if ($checkFallbackForBytesToBytesMap) {
-         |  // try to get the buffer from hash map
-         |  $unsafeRowBuffer =
-         |    $hashMapTerm.getAggregationBufferFromUnsafeRow($unsafeRowKeys, $unsafeRowKeyHash);
-         |}
-         |// Can't allocate buffer from the hash map. Spill the map and fallback to sort-based
-         |// aggregation after processing all input rows.
-         |if ($unsafeRowBuffer == null) {
-         |  if ($sorterTerm == null) {
-         |    $sorterTerm = $hashMapTerm.destructAndCreateExternalSorter();
-         |  } else {
-         |    $sorterTerm.merge($hashMapTerm.destructAndCreateExternalSorter());
-         |  }
-         |  $resetCounter
-         |  // the hash map had be spilled, it should have enough memory now,
-         |  // try to allocate buffer again.
-         |  $unsafeRowBuffer = $hashMapTerm.getAggregationBufferFromUnsafeRow(
-         |    $unsafeRowKeys, $unsafeRowKeyHash);
-         |  if ($unsafeRowBuffer == null) {
-         |    // failed to allocate the first page
-         |    throw new $oomeClassName("No enough memory for aggregation");
-         |  }
-         |}
+      def addToSorter: String = {
+        s"""
+          |if ($sorterTerm == null) {
+          |  $sorterTerm = $hashMapTerm.destructAndCreateExternalSorter();
+          |} else {
+          |  $sorterTerm.merge($hashMapTerm.destructAndCreateExternalSorter());
+          |}
+          |$resetCounter
+          |// the hash map had be spilled, it should have enough memory now,
+          |// try to allocate buffer again.
+          |$unsafeRowBuffer = $hashMapTerm.getAggregationBufferFromUnsafeRow(
+          |  $unsafeRowKeys, $unsafeRowKeyHash);
+          |if ($unsafeRowBuffer == null) {
+          |  // failed to allocate the first page
+          |  throw new $oomeClassName("No enough memory for aggregation");
+          |}""".stripMargin
+      }
+
+      def getHeuristicToAvoidAgg: String = {
+        s"""
+          |!($rowCountTerm < $skipPartialAggregateThreshold) &&
+          |      ((float)$countTerm/$rowCountTerm) > $skipPartialAggRatio;
+          |""".stripMargin
+      }
+
+      if (skipPartialAggregate) {
+        s"""

Review comment:
       How about this?
   ```
         if (skipPartialAggregate) {
           val checkIfPartialAggSkipped = "!($rowCountTerm < $skipPartialAggregateThreshold) && " +
             "((float)$countTerm/$rowCountTerm) > $skipPartialAggRatio"
           s"""
              |if (!$avoidSpillInPartialAggregateTerm) {
              |  $getAggBufferFromMap
              |  ...
              |  if ($unsafeRowBuffer == null && !$avoidSpillInPartialAggregateTerm) {
              |    $countTerm = $countTerm + $hashMapTerm.getNumRows();
              |    if (checkIfPartialAggSkipped) {
   ```

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")

Review comment:
       `${Utils.isTesting}`?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -929,6 +1016,27 @@ case class HashAggregateExec(
       } else {
         findOrInsertRegularHashMap
       }
+      def createEmptyAggBufferAndUpdateMetrics: String = {
+        if (skipPartialAggregate) {
+          val partTerm = metricTerm(ctx, "partialAggSkipped")

Review comment:
       `partTerm`->`numAggSkippedRows `?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -72,6 +72,8 @@ case class HashAggregateExec(
     "peakMemory" -> SQLMetrics.createSizeMetric(sparkContext, "peak memory"),
     "spillSize" -> SQLMetrics.createSizeMetric(sparkContext, "spill size"),
     "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in aggregation build"),
+    "partialAggSkipped" -> SQLMetrics.createMetric(sparkContext,
+      "number of skipped records for partial aggregates"),

Review comment:
       I think this metric is only meaningful for aggregates with a partial mode, so could we show it only in the mode?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -409,6 +411,12 @@ case class HashAggregateExec(
   private var fastHashMapTerm: String = _
   private var isFastHashMapEnabled: Boolean = false
 
+  private var avoidSpillInPartialAggregateTerm: String = _
+  private val skipPartialAggregate = sqlContext.conf.skipPartialAggregate &&

Review comment:
       `skipPartialAggregate` -> `skipPartialAggregateEnabled`?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -750,18 +768,19 @@ case class HashAggregateExec(
       finishRegularHashMap
     }
 
+    outputFunc = generateResultFunction(ctx)
     val doAggFuncName = ctx.addNewFunction(doAgg,
       s"""
          |private void $doAgg() throws java.io.IOException {
          |  ${child.asInstanceOf[CodegenSupport].produce(ctx, this)}
+         |  $childrenConsumed = true;

Review comment:
       Why do we need this variable?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -750,18 +768,19 @@ case class HashAggregateExec(
       finishRegularHashMap
     }
 
+    outputFunc = generateResultFunction(ctx)
     val doAggFuncName = ctx.addNewFunction(doAgg,
       s"""
          |private void $doAgg() throws java.io.IOException {
          |  ${child.asInstanceOf[CodegenSupport].produce(ctx, this)}
+         |  $childrenConsumed = true;
          |  $finishHashMap
          |}
        """.stripMargin)
 
     // generate code for output
     val keyTerm = ctx.freshName("aggKey")
     val bufferTerm = ctx.freshName("aggBuffer")
-    val outputFunc = generateResultFunction(ctx)

Review comment:
       Why did you move this line into the line 771?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -409,6 +411,12 @@ case class HashAggregateExec(
   private var fastHashMapTerm: String = _
   private var isFastHashMapEnabled: Boolean = false
 
+  private var avoidSpillInPartialAggregateTerm: String = _
+  private val skipPartialAggregate = sqlContext.conf.skipPartialAggregate &&
+    AggUtils.areAggExpressionsPartial(modes) && find(_.isInstanceOf[ExpandExec]).isEmpty

Review comment:
       Why do we need this check `find(_.isInstanceOf[ExpandExec]).isEmpty`?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
##########
@@ -353,4 +353,8 @@ object AggUtils {
 
     finalAndCompleteAggregate :: Nil
   }
+
+  def areAggExpressionsPartial(modes: Seq[AggregateMode]): Boolean = {

Review comment:
       Could you inline this method in `HashAggregateExec`?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {

Review comment:
       Why did you apply this optimization only for the with-key case?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -838,13 +857,20 @@ case class HashAggregateExec(
        |  long $beforeAgg = System.nanoTime();
        |  $doAggFuncName();
        |  $aggTime.add((System.nanoTime() - $beforeAgg) / $NANOS_PER_MILLIS);
+       |  if (shouldStop()) return;
+       |}
+       |if (!$childrenConsumed) {
+       |  $doAggFuncName();
+       |  if (shouldStop()) return;

Review comment:
       What does this block mean? Looks like the method call in the line 858 sets true to `childrenConsumed ` then this block not executed?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -877,50 +903,111 @@ case class HashAggregateExec(
       ("true", "true", "", "")
     }
 
+    val skipPartialAggregateThreshold = sqlContext.conf.skipPartialAggregateThreshold
+    val skipPartialAggRatio = sqlContext.conf.skipPartialAggregateRatio
+
+    val countTerm = ctx.addMutableState(CodeGenerator.JAVA_LONG, "count")
     val oomeClassName = classOf[SparkOutOfMemoryError].getName
+    val findOrInsertRegularHashMap: String = {
+      def getAggBufferFromMap = {
+        s"""
+           |// generate grouping key
+           |${unsafeRowKeyCode.code}
+           |int $unsafeRowKeyHash = ${unsafeRowKeyCode.value}.hashCode();
+           |if ($checkFallbackForBytesToBytesMap) {
+           |  // try to get the buffer from hash map
+           |  $unsafeRowBuffer =
+           |    $hashMapTerm.getAggregationBufferFromUnsafeRow($unsafeRowKeys, $unsafeRowKeyHash);
+           |}
+          """.stripMargin
+      }
 
-    val findOrInsertRegularHashMap: String =
-      s"""
-         |// generate grouping key
-         |${unsafeRowKeyCode.code}
-         |int $unsafeRowKeyHash = ${unsafeRowKeyCode.value}.hashCode();
-         |if ($checkFallbackForBytesToBytesMap) {
-         |  // try to get the buffer from hash map
-         |  $unsafeRowBuffer =
-         |    $hashMapTerm.getAggregationBufferFromUnsafeRow($unsafeRowKeys, $unsafeRowKeyHash);
-         |}
-         |// Can't allocate buffer from the hash map. Spill the map and fallback to sort-based
-         |// aggregation after processing all input rows.
-         |if ($unsafeRowBuffer == null) {
-         |  if ($sorterTerm == null) {
-         |    $sorterTerm = $hashMapTerm.destructAndCreateExternalSorter();
-         |  } else {
-         |    $sorterTerm.merge($hashMapTerm.destructAndCreateExternalSorter());
-         |  }
-         |  $resetCounter
-         |  // the hash map had be spilled, it should have enough memory now,
-         |  // try to allocate buffer again.
-         |  $unsafeRowBuffer = $hashMapTerm.getAggregationBufferFromUnsafeRow(
-         |    $unsafeRowKeys, $unsafeRowKeyHash);
-         |  if ($unsafeRowBuffer == null) {
-         |    // failed to allocate the first page
-         |    throw new $oomeClassName("No enough memory for aggregation");
-         |  }
-         |}
+      def addToSorter: String = {
+        s"""
+          |if ($sorterTerm == null) {
+          |  $sorterTerm = $hashMapTerm.destructAndCreateExternalSorter();
+          |} else {
+          |  $sorterTerm.merge($hashMapTerm.destructAndCreateExternalSorter());
+          |}
+          |$resetCounter
+          |// the hash map had be spilled, it should have enough memory now,
+          |// try to allocate buffer again.
+          |$unsafeRowBuffer = $hashMapTerm.getAggregationBufferFromUnsafeRow(
+          |  $unsafeRowKeys, $unsafeRowKeyHash);
+          |if ($unsafeRowBuffer == null) {
+          |  // failed to allocate the first page
+          |  throw new $oomeClassName("No enough memory for aggregation");
+          |}""".stripMargin
+      }
+
+      def getHeuristicToAvoidAgg: String = {
+        s"""
+          |!($rowCountTerm < $skipPartialAggregateThreshold) &&
+          |      ((float)$countTerm/$rowCountTerm) > $skipPartialAggRatio;
+          |""".stripMargin
+      }
+
+      if (skipPartialAggregate) {
+        s"""
+           |if (!$avoidSpillInPartialAggregateTerm) {
+           |  $getAggBufferFromMap
+           |  // Can't allocate buffer from the hash map.
+           |  // Check if we can avoid partial aggregation.
+           |  //  Otherwise, Spill the map and fallback to sort-based
+           |  // aggregation after processing all input rows.
+           |  if ($unsafeRowBuffer == null && !$avoidSpillInPartialAggregateTerm) {
+           |    $countTerm = $countTerm + $hashMapTerm.getNumRows();
+           |    boolean skipPartAgg = $getHeuristicToAvoidAgg
+           |    if (skipPartAgg) {
+           |      // Aggregation buffer is created later
+           |      $avoidSpillInPartialAggregateTerm = true;
+           |    } else {
+           |      $addToSorter
+           |    }
+           |  }
+           |}
+       """.stripMargin
+      } else {
+        s"""
+           |  $getAggBufferFromMap
+           |  // Can't allocate buffer from the hash map. Spill the map and fallback to sort-based
+           |  // aggregation after processing all input rows.
+           |  if ($unsafeRowBuffer == null) {
+           |    $addToSorter
+           |  }

Review comment:
       nit: please remove the indents.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -72,6 +72,8 @@ case class HashAggregateExec(
     "peakMemory" -> SQLMetrics.createSizeMetric(sparkContext, "peak memory"),
     "spillSize" -> SQLMetrics.createSizeMetric(sparkContext, "spill size"),
     "aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "time in aggregation build"),
+    "partialAggSkipped" -> SQLMetrics.createMetric(sparkContext,
+      "number of skipped records for partial aggregates"),

Review comment:
       `partialAggSkipped` -> `numAggSkippedRows`?

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -680,6 +688,16 @@ case class HashAggregateExec(
 
   private def doProduceWithKeys(ctx: CodegenContext): String = {
     val initAgg = ctx.addMutableState(CodeGenerator.JAVA_BOOLEAN, "initAgg")
+    if (skipPartialAggregate) {
+      avoidSpillInPartialAggregateTerm = ctx.
+        addMutableState(CodeGenerator.JAVA_BOOLEAN,
+          "avoidPartialAggregate",
+          term => s"$term = ${Utils.isTesting};")
+    }
+    val childrenConsumed = ctx.
+      addMutableState(CodeGenerator.JAVA_BOOLEAN, "childrenConsumed")
+    rowCountTerm = ctx.
+      addMutableState(CodeGenerator.JAVA_LONG, "rowCount")

Review comment:
       Also, I think we need to initialize the variables.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -877,50 +903,111 @@ case class HashAggregateExec(
       ("true", "true", "", "")
     }
 
+    val skipPartialAggregateThreshold = sqlContext.conf.skipPartialAggregateThreshold
+    val skipPartialAggRatio = sqlContext.conf.skipPartialAggregateRatio
+
+    val countTerm = ctx.addMutableState(CodeGenerator.JAVA_LONG, "count")
     val oomeClassName = classOf[SparkOutOfMemoryError].getName
+    val findOrInsertRegularHashMap: String = {
+      def getAggBufferFromMap = {
+        s"""
+           |// generate grouping key
+           |${unsafeRowKeyCode.code}
+           |int $unsafeRowKeyHash = ${unsafeRowKeyCode.value}.hashCode();
+           |if ($checkFallbackForBytesToBytesMap) {
+           |  // try to get the buffer from hash map
+           |  $unsafeRowBuffer =
+           |    $hashMapTerm.getAggregationBufferFromUnsafeRow($unsafeRowKeys, $unsafeRowKeyHash);
+           |}
+          """.stripMargin
+      }
 
-    val findOrInsertRegularHashMap: String =
-      s"""
-         |// generate grouping key
-         |${unsafeRowKeyCode.code}
-         |int $unsafeRowKeyHash = ${unsafeRowKeyCode.value}.hashCode();
-         |if ($checkFallbackForBytesToBytesMap) {
-         |  // try to get the buffer from hash map
-         |  $unsafeRowBuffer =
-         |    $hashMapTerm.getAggregationBufferFromUnsafeRow($unsafeRowKeys, $unsafeRowKeyHash);
-         |}
-         |// Can't allocate buffer from the hash map. Spill the map and fallback to sort-based
-         |// aggregation after processing all input rows.
-         |if ($unsafeRowBuffer == null) {
-         |  if ($sorterTerm == null) {
-         |    $sorterTerm = $hashMapTerm.destructAndCreateExternalSorter();
-         |  } else {
-         |    $sorterTerm.merge($hashMapTerm.destructAndCreateExternalSorter());
-         |  }
-         |  $resetCounter
-         |  // the hash map had be spilled, it should have enough memory now,
-         |  // try to allocate buffer again.
-         |  $unsafeRowBuffer = $hashMapTerm.getAggregationBufferFromUnsafeRow(
-         |    $unsafeRowKeys, $unsafeRowKeyHash);
-         |  if ($unsafeRowBuffer == null) {
-         |    // failed to allocate the first page
-         |    throw new $oomeClassName("No enough memory for aggregation");
-         |  }
-         |}
+      def addToSorter: String = {
+        s"""
+          |if ($sorterTerm == null) {
+          |  $sorterTerm = $hashMapTerm.destructAndCreateExternalSorter();
+          |} else {
+          |  $sorterTerm.merge($hashMapTerm.destructAndCreateExternalSorter());
+          |}
+          |$resetCounter
+          |// the hash map had be spilled, it should have enough memory now,
+          |// try to allocate buffer again.
+          |$unsafeRowBuffer = $hashMapTerm.getAggregationBufferFromUnsafeRow(
+          |  $unsafeRowKeys, $unsafeRowKeyHash);
+          |if ($unsafeRowBuffer == null) {
+          |  // failed to allocate the first page
+          |  throw new $oomeClassName("No enough memory for aggregation");
+          |}""".stripMargin
+      }
+
+      def getHeuristicToAvoidAgg: String = {
+        s"""
+          |!($rowCountTerm < $skipPartialAggregateThreshold) &&
+          |      ((float)$countTerm/$rowCountTerm) > $skipPartialAggRatio;
+          |""".stripMargin
+      }
+
+      if (skipPartialAggregate) {
+        s"""
+           |if (!$avoidSpillInPartialAggregateTerm) {
+           |  $getAggBufferFromMap
+           |  // Can't allocate buffer from the hash map.
+           |  // Check if we can avoid partial aggregation.
+           |  //  Otherwise, Spill the map and fallback to sort-based
+           |  // aggregation after processing all input rows.
+           |  if ($unsafeRowBuffer == null && !$avoidSpillInPartialAggregateTerm) {

Review comment:
       The second condition `!$avoidSpillInPartialAggregateTerm` has already been checked in the line 952? https://github.com/apache/spark/pull/28804/files#diff-2eb948516b5beaeb746aadac27fbd5b4R952

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,29 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sorter(sort/spill) during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)
+
+  val SKIP_PARTIAL_AGGREGATE_THRESHOLD =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.threshold")

Review comment:
       I think we need to assign more meaningful param names, e.g.,
   ```
   
   spark.sql.aggregate.skipPartialAggregate
   spark.sql.aggregate.skipPartialAggregate.minNumRows
   spark.sql.aggregate.skipPartialAggregate.aggregateRatio
   ```

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,29 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sorter(sort/spill) during partial aggregation")

Review comment:
       Could you describe more in this doc? For example, this PR intends to only support hash-based aggregate with codegen enabled?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-673478644


   > I don't think AQE can help here. This is partial aggregate and usually there won't be a shuffle right before the partial agg.
   
   Hm, I see. Even so, `BasicStatsPlanVisitor` cannot propatate somewhat accurate input stats (of shuffle output) into partial aggreates?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655708560


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-672092513






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu edited a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu edited a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-650552251


   > it is more of a manual step and can be used only if the user knows the nature of data upfront.Like in my benchmark, where we expect the the all but few grouping keys to be different.
   A user will be able to find the nature of data by looking at the metrics in Spark UI where the number of output rows from previous stage is same/almost same as the number of output rows from HashAggregate. If the the user expects the new data to have this nature in his subsequent runs(say a partitioned table with new data every hour/day), he can enable the config.
   
   
   hm...., if `SKIP_PARTIAL_AGGREGATE_ENABLED=true` and the cardinality is **not** the same with the number of rows, a query returns a wrong aggregated answer?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r439160476



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
##########
@@ -165,6 +166,26 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession
     }
   }
 
+  test("SPARK-: Avoid spill in partial aggregation " +
+    "when spark.sql.aggregate.spill.partialaggregate.disabled is set") {
+    withSQLConf((SQLConf.SPILL_PARTIAL_AGGREGATE_DISABLED.key, "true"),

Review comment:
       Could you show us performance numbers?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-649221158


   **[Test build #124505 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124505/testReport)** for PR 28804 at commit [`afc2903`](https://github.com/apache/spark/commit/afc2903e4a327d6caef518e6d3f0dc431424ac7c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646818290


   **[Test build #124295 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124295/testReport)** for PR 28804 at commit [`1c0399a`](https://github.com/apache/spark/commit/1c0399aa3147a6e461b41de85412b6c3a404009f).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL][WIP] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r441798723



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
##########
@@ -165,6 +166,26 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession
     }
   }
 
+  test("SPARK-: Avoid spill in partial aggregation " +
+    "when spark.sql.aggregate.spill.partialaggregate.disabled is set") {
+    withSQLConf((SQLConf.SPILL_PARTIAL_AGGREGATE_DISABLED.key, "true"),

Review comment:
       @maropu 
   I have added the benchmark to the description.
   I have also figured out a way to add UT for this.
   @maropu  @cloud-fan  @gatorsmile Can you please help review the change
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-642994396


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-649221849






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-670629807


   **[Test build #127210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127210/testReport)** for PR 28804 at commit [`c9a415d`](https://github.com/apache/spark/commit/c9a415de201a784756a3e3ef1a39ef069b7f0b4e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-653948000






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r449790519



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -877,44 +901,61 @@ case class HashAggregateExec(
       ("true", "true", "", "")
     }
 
-    val oomeClassName = classOf[SparkOutOfMemoryError].getName
+    val skipPartialAggregateThreshold = sqlContext.conf.skipPartialAggregateThreshold
+    val skipPartialAggRatio = sqlContext.conf.skipPartialAggregateRatio
 
+    val oomeClassName = classOf[SparkOutOfMemoryError].getName
+    val countTerm = ctx.addMutableState(CodeGenerator.JAVA_LONG, "count")
     val findOrInsertRegularHashMap: String =
       s"""
-         |// generate grouping key
-         |${unsafeRowKeyCode.code}
-         |int $unsafeRowKeyHash = ${unsafeRowKeyCode.value}.hashCode();
-         |if ($checkFallbackForBytesToBytesMap) {
-         |  // try to get the buffer from hash map
-         |  $unsafeRowBuffer =
-         |    $hashMapTerm.getAggregationBufferFromUnsafeRow($unsafeRowKeys, $unsafeRowKeyHash);
-         |}
-         |// Can't allocate buffer from the hash map. Spill the map and fallback to sort-based
-         |// aggregation after processing all input rows.
-         |if ($unsafeRowBuffer == null) {
-         |  if ($sorterTerm == null) {
-         |    $sorterTerm = $hashMapTerm.destructAndCreateExternalSorter();
-         |  } else {
-         |    $sorterTerm.merge($hashMapTerm.destructAndCreateExternalSorter());
+         |if (!$avoidSpillInPartialAggregateTerm) {
+         |  // generate grouping key
+         |  ${unsafeRowKeyCode.code}
+         |  int $unsafeRowKeyHash = ${unsafeRowKeyCode.value}.hashCode();
+         |  if ($checkFallbackForBytesToBytesMap) {
+         |    // try to get the buffer from hash map
+         |    $unsafeRowBuffer =
+         |      $hashMapTerm.getAggregationBufferFromUnsafeRow($unsafeRowKeys, $unsafeRowKeyHash);
          |  }
-         |  $resetCounter
-         |  // the hash map had be spilled, it should have enough memory now,
-         |  // try to allocate buffer again.
-         |  $unsafeRowBuffer = $hashMapTerm.getAggregationBufferFromUnsafeRow(
-         |    $unsafeRowKeys, $unsafeRowKeyHash);
-         |  if ($unsafeRowBuffer == null) {
-         |    // failed to allocate the first page
-         |    throw new $oomeClassName("No enough memory for aggregation");
+         |  // Can't allocate buffer from the hash map. Spill the map and fallback to sort-based
+         |  // aggregation after processing all input rows.
+         |  if ($unsafeRowBuffer == null && !$avoidSpillInPartialAggregateTerm) {

Review comment:
       @prakharjain09 
   The first check will kick in once we have decided to avoid partail aggregation, in whih case we wont attempt to fetch from Map.
   Second check will kick only when the Map is exhausted completely, and this is when we decide whether we have to skip partil aggregation.
   I think both checks are required.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-645730148


   **[Test build #124189 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124189/testReport)** for PR 28804 at commit [`5f05aa7`](https://github.com/apache/spark/commit/5f05aa7e2631a56b5cc97206f2799f64151b1789).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-647051371


   > I do not look into the code yet, but one question I have; does this optimization get benefits only when codegen enabled? When I read the description above, I thought this was more general one though.
   
   Yes, this is not specific to codegen. I will also make corresponding chnages in `doExecute`, I can do it in a different
   PR and keep this change focused on codegen?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r451879710



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2196,6 +2196,25 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SKIP_PARTIAL_AGGREGATE_ENABLED =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.enabled")
+      .internal()
+      .doc("Avoid sort/spill to disk during partial aggregation")
+      .booleanConf
+      .createWithDefault(true)
+
+  val SKIP_PARTIAL_AGGREGATE_THRESHOLD =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.threshold")
+      .internal()
+      .longConf
+      .createWithDefault(100000)
+
+  val SKIP_PARTIAL_AGGREGATE_RATIO =
+    buildConf("spark.sql.aggregate.partialaggregate.skip.ratio")
+      .internal()
+      .doubleConf
+      .createWithDefault(0.5)

Review comment:
       Why we need the two params for this optimiation? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-649138765






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-711644766


   **[Test build #129988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129988/testReport)** for PR 28804 at commit [`11572a1`](https://github.com/apache/spark/commit/11572a105ef870c9e95b4302e6613b7f0e73d0de).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466750508



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -877,50 +903,111 @@ case class HashAggregateExec(
       ("true", "true", "", "")
     }
 
+    val skipPartialAggregateThreshold = sqlContext.conf.skipPartialAggregateThreshold
+    val skipPartialAggRatio = sqlContext.conf.skipPartialAggregateRatio
+
+    val countTerm = ctx.addMutableState(CodeGenerator.JAVA_LONG, "count")
     val oomeClassName = classOf[SparkOutOfMemoryError].getName
+    val findOrInsertRegularHashMap: String = {
+      def getAggBufferFromMap = {
+        s"""
+           |// generate grouping key
+           |${unsafeRowKeyCode.code}
+           |int $unsafeRowKeyHash = ${unsafeRowKeyCode.value}.hashCode();
+           |if ($checkFallbackForBytesToBytesMap) {
+           |  // try to get the buffer from hash map
+           |  $unsafeRowBuffer =
+           |    $hashMapTerm.getAggregationBufferFromUnsafeRow($unsafeRowKeys, $unsafeRowKeyHash);
+           |}
+          """.stripMargin
+      }
 
-    val findOrInsertRegularHashMap: String =
-      s"""
-         |// generate grouping key
-         |${unsafeRowKeyCode.code}
-         |int $unsafeRowKeyHash = ${unsafeRowKeyCode.value}.hashCode();
-         |if ($checkFallbackForBytesToBytesMap) {
-         |  // try to get the buffer from hash map
-         |  $unsafeRowBuffer =
-         |    $hashMapTerm.getAggregationBufferFromUnsafeRow($unsafeRowKeys, $unsafeRowKeyHash);
-         |}
-         |// Can't allocate buffer from the hash map. Spill the map and fallback to sort-based
-         |// aggregation after processing all input rows.
-         |if ($unsafeRowBuffer == null) {
-         |  if ($sorterTerm == null) {
-         |    $sorterTerm = $hashMapTerm.destructAndCreateExternalSorter();
-         |  } else {
-         |    $sorterTerm.merge($hashMapTerm.destructAndCreateExternalSorter());
-         |  }
-         |  $resetCounter
-         |  // the hash map had be spilled, it should have enough memory now,
-         |  // try to allocate buffer again.
-         |  $unsafeRowBuffer = $hashMapTerm.getAggregationBufferFromUnsafeRow(
-         |    $unsafeRowKeys, $unsafeRowKeyHash);
-         |  if ($unsafeRowBuffer == null) {
-         |    // failed to allocate the first page
-         |    throw new $oomeClassName("No enough memory for aggregation");
-         |  }
-         |}
+      def addToSorter: String = {
+        s"""
+          |if ($sorterTerm == null) {
+          |  $sorterTerm = $hashMapTerm.destructAndCreateExternalSorter();
+          |} else {
+          |  $sorterTerm.merge($hashMapTerm.destructAndCreateExternalSorter());
+          |}
+          |$resetCounter
+          |// the hash map had be spilled, it should have enough memory now,
+          |// try to allocate buffer again.
+          |$unsafeRowBuffer = $hashMapTerm.getAggregationBufferFromUnsafeRow(
+          |  $unsafeRowKeys, $unsafeRowKeyHash);
+          |if ($unsafeRowBuffer == null) {
+          |  // failed to allocate the first page
+          |  throw new $oomeClassName("No enough memory for aggregation");
+          |}""".stripMargin
+      }
+
+      def getHeuristicToAvoidAgg: String = {
+        s"""
+          |!($rowCountTerm < $skipPartialAggregateThreshold) &&
+          |      ((float)$countTerm/$rowCountTerm) > $skipPartialAggRatio;
+          |""".stripMargin
+      }
+
+      if (skipPartialAggregate) {
+        s"""
+           |if (!$avoidSpillInPartialAggregateTerm) {
+           |  $getAggBufferFromMap
+           |  // Can't allocate buffer from the hash map.
+           |  // Check if we can avoid partial aggregation.
+           |  //  Otherwise, Spill the map and fallback to sort-based
+           |  // aggregation after processing all input rows.
+           |  if ($unsafeRowBuffer == null && !$avoidSpillInPartialAggregateTerm) {

Review comment:
       Removed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-674921032


   AQE doesn't provide column stats, and column stats propagation can be incorrect if we have many operators.
   
   IIUC the current approach is: sample the first 100000 rows, if they can't reduce data by half (which means one key has 2 values by average), then we skip the partial aggregate.
   
   This sounds reasonable, but it's hard to tell how to pick the config values. @karuppayya do you have any experience of using it in practice?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646821885






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r441974540



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -2173,6 +2173,13 @@ object SQLConf {
       .checkValue(bit => bit >= 10 && bit <= 30, "The bit value must be in [10, 30].")
       .createWithDefault(16)
 
+  val SPILL_PARTIAL_AGGREGATE_DISABLED =
+    buildConf("spark.sql.aggregate.spill.partialaggregate.disabled")

Review comment:
       I have renamed the config. Can you please check




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646831853






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646831853


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-654508917


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125118/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-671397804


   Can we have a short description of the approach in the PR description? Seems like we are making hash aggregate adaptive to do pass-through if it can't reduce data size in the first n rows.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-655725403


   **[Test build #125394 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125394/testReport)** for PR 28804 at commit [`8850777`](https://github.com/apache/spark/commit/8850777f41dbf6d0a526a8409c426f156e7f7fa7).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-646893545


   **[Test build #124304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124304/testReport)** for PR 28804 at commit [`56c95e2`](https://github.com/apache/spark/commit/56c95e242126d7aacdb4862adc5e094b4e29561b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-648552423


   **[Test build #124450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124450/testReport)** for PR 28804 at commit [`99c1d22`](https://github.com/apache/spark/commit/99c1d2226d170f789dfa534ffc658f7fc430c38d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r467472877



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -838,13 +880,17 @@ case class HashAggregateExec(
        |  long $beforeAgg = System.nanoTime();
        |  $doAggFuncName();
        |  $aggTime.add((System.nanoTime() - $beforeAgg) / $NANOS_PER_MILLIS);
+       |  if (shouldStop()) return;
        |}
+       |$genCodePostInitCode
        |// output the result
        |$outputFromFastHashMap
        |$outputFromRegularHashMap
      """.stripMargin
   }
 
+  override def needStopCheck: Boolean = skipPartialAggregateEnabled

Review comment:
       If `skipPartialAggregateEnabled` = true but #rows/cardinality don't go over the threshold, partial aggregates are not skipped. In that case, we need to set true to `needStopCheck`?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karuppayya commented on a change in pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

Posted by GitBox <gi...@apache.org>.

karuppayya commented on a change in pull request #28804:
URL: https://github.com/apache/spark/pull/28804#discussion_r466749506



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
##########
@@ -750,18 +768,19 @@ case class HashAggregateExec(
       finishRegularHashMap
     }
 
+    outputFunc = generateResultFunction(ctx)
     val doAggFuncName = ctx.addNewFunction(doAgg,
       s"""
          |private void $doAgg() throws java.io.IOException {
          |  ${child.asInstanceOf[CodegenSupport].produce(ctx, this)}
+         |  $childrenConsumed = true;

Review comment:
       Before my change, All the records of the the dataset were processed and HashMap populated(+ spilt to sorter) at once.
   With my change, all the records are not processed at once due to the change at line#878
   Hence we need to know if all the records of the dataset is processed, otherwise process them.
   This variable serves the purpose.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org