You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/07/08 09:02:51 UTC

[GitHub] spark pull request #18570: [SPARK-21100][SQL][Followup] cleanup code and add...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/18570

    [SPARK-21100][SQL][Followup] cleanup code and add more comments for Dataset.summary

    ## What changes were proposed in this pull request?
    
    Some code cleanup and adding comments to make the code more readable. Changed the way to generate result rows, to be more clear.
    
    ## How was this patch tested?
    
    existing tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark summary

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18570.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18570
    
----
commit 3f2b41bbcde29c1016bcab401de515cf67b8f246
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-07-08T08:59:50Z

    cleanup code and add more comments

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    **[Test build #79385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79385/testReport)** for PR 18570 at commit [`d57edd4`](https://github.com/apache/spark/commit/d57edd444dcab2ee4f148a2f5ebe4830fa970e26).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    cc @gatorsmile 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    Thanks! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    cc @aray


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18570: [SPARK-21100][SQL][Followup] cleanup code and add...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18570#discussion_r126320362
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala ---
    @@ -228,90 +229,71 @@ object StatFunctions extends Logging {
         val defaultStatistics = Seq("count", "mean", "stddev", "min", "25%", "50%", "75%", "max")
         val selectedStatistics = if (statistics.nonEmpty) statistics else defaultStatistics
     
    -    val hasPercentiles = selectedStatistics.exists(_.endsWith("%"))
    -    val (percentiles, percentileNames, remainingAggregates) = if (hasPercentiles) {
    -      val (pStrings, rest) = selectedStatistics.partition(a => a.endsWith("%"))
    -      val percentiles = pStrings.map { p =>
    -        try {
    -          p.stripSuffix("%").toDouble / 100.0
    -        } catch {
    -          case e: NumberFormatException =>
    -            throw new IllegalArgumentException(s"Unable to parse $p as a percentile", e)
    -        }
    +    val percentiles = selectedStatistics.filter(a => a.endsWith("%")).map { p =>
    +      try {
    +        p.stripSuffix("%").toDouble / 100.0
    +      } catch {
    +        case e: NumberFormatException =>
    +          throw new IllegalArgumentException(s"Unable to parse $p as a percentile", e)
           }
    -      require(percentiles.forall(p => p >= 0 && p <= 1), "Percentiles must be in the range [0, 1]")
    -      (percentiles, pStrings, rest)
    -    } else {
    -      (Seq(), Seq(), selectedStatistics)
    -    }
    -
    -
    -    // The list of summary statistics to compute, in the form of expressions.
    -    val availableStatistics = Map[String, Expression => Expression](
    -      "count" -> ((child: Expression) => Count(child).toAggregateExpression()),
    -      "mean" -> ((child: Expression) => Average(child).toAggregateExpression()),
    -      "stddev" -> ((child: Expression) => StddevSamp(child).toAggregateExpression()),
    -      "min" -> ((child: Expression) => Min(child).toAggregateExpression()),
    -      "max" -> ((child: Expression) => Max(child).toAggregateExpression()))
    -
    -    val statisticFns = remainingAggregates.map { agg =>
    -      require(availableStatistics.contains(agg), s"$agg is not a recognised statistic")
    -      agg -> availableStatistics(agg)
         }
    +    require(percentiles.forall(p => p >= 0 && p <= 1), "Percentiles must be in the range [0, 1]")
     
    -    def percentileAgg(child: Expression): Expression =
    -      new ApproximatePercentile(child, CreateArray(percentiles.map(Literal(_))))
    -        .toAggregateExpression()
    -
    -    val outputCols = ds.aggregatableColumns.map(usePrettyExpression(_).sql).toList
    -
    -    val ret: Seq[Row] = if (outputCols.nonEmpty) {
    -      var aggExprs = statisticFns.toList.flatMap { case (_, colToAgg) =>
    -        outputCols.map(c => Column(Cast(colToAgg(Column(c).expr), StringType)).as(c))
    -      }
    -      if (hasPercentiles) {
    -        aggExprs = outputCols.map(c => Column(percentileAgg(Column(c).expr)).as(c)) ++ aggExprs
    +    var percentileIndex = 0
    +    val statisticFns = selectedStatistics.map { stats =>
    --- End diff --
    
    How about making it case insensitive, since our function resolution is always case insensitive?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    **[Test build #79380 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79380/testReport)** for PR 18570 at commit [`3f2b41b`](https://github.com/apache/spark/commit/3f2b41bbcde29c1016bcab401de515cf67b8f246).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    **[Test build #79385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79385/testReport)** for PR 18570 at commit [`d57edd4`](https://github.com/apache/spark/commit/d57edd444dcab2ee4f148a2f5ebe4830fa970e26).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18570: [SPARK-21100][SQL][Followup] cleanup code and add...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18570#discussion_r126280881
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala ---
    @@ -228,90 +229,71 @@ object StatFunctions extends Logging {
         val defaultStatistics = Seq("count", "mean", "stddev", "min", "25%", "50%", "75%", "max")
         val selectedStatistics = if (statistics.nonEmpty) statistics else defaultStatistics
     
    -    val hasPercentiles = selectedStatistics.exists(_.endsWith("%"))
    -    val (percentiles, percentileNames, remainingAggregates) = if (hasPercentiles) {
    -      val (pStrings, rest) = selectedStatistics.partition(a => a.endsWith("%"))
    -      val percentiles = pStrings.map { p =>
    -        try {
    -          p.stripSuffix("%").toDouble / 100.0
    -        } catch {
    -          case e: NumberFormatException =>
    -            throw new IllegalArgumentException(s"Unable to parse $p as a percentile", e)
    -        }
    +    val percentiles = selectedStatistics.filter(a => a.endsWith("%")).map { p =>
    +      try {
    +        p.stripSuffix("%").toDouble / 100.0
    +      } catch {
    +        case e: NumberFormatException =>
    +          throw new IllegalArgumentException(s"Unable to parse $p as a percentile", e)
           }
    -      require(percentiles.forall(p => p >= 0 && p <= 1), "Percentiles must be in the range [0, 1]")
    -      (percentiles, pStrings, rest)
    -    } else {
    -      (Seq(), Seq(), selectedStatistics)
    -    }
    -
    -
    -    // The list of summary statistics to compute, in the form of expressions.
    -    val availableStatistics = Map[String, Expression => Expression](
    -      "count" -> ((child: Expression) => Count(child).toAggregateExpression()),
    -      "mean" -> ((child: Expression) => Average(child).toAggregateExpression()),
    -      "stddev" -> ((child: Expression) => StddevSamp(child).toAggregateExpression()),
    -      "min" -> ((child: Expression) => Min(child).toAggregateExpression()),
    -      "max" -> ((child: Expression) => Max(child).toAggregateExpression()))
    -
    -    val statisticFns = remainingAggregates.map { agg =>
    -      require(availableStatistics.contains(agg), s"$agg is not a recognised statistic")
    -      agg -> availableStatistics(agg)
         }
    +    require(percentiles.forall(p => p >= 0 && p <= 1), "Percentiles must be in the range [0, 1]")
     
    -    def percentileAgg(child: Expression): Expression =
    -      new ApproximatePercentile(child, CreateArray(percentiles.map(Literal(_))))
    -        .toAggregateExpression()
    -
    -    val outputCols = ds.aggregatableColumns.map(usePrettyExpression(_).sql).toList
    -
    -    val ret: Seq[Row] = if (outputCols.nonEmpty) {
    -      var aggExprs = statisticFns.toList.flatMap { case (_, colToAgg) =>
    -        outputCols.map(c => Column(Cast(colToAgg(Column(c).expr), StringType)).as(c))
    -      }
    -      if (hasPercentiles) {
    -        aggExprs = outputCols.map(c => Column(percentileAgg(Column(c).expr)).as(c)) ++ aggExprs
    +    var percentileIndex = 0
    +    val statisticFns = selectedStatistics.map { stats =>
    +      if (stats.endsWith("%")) {
    +        val index = percentileIndex
    +        percentileIndex += 1
    +        (child: Expression) =>
    +          GetArrayItem(
    +            new ApproximatePercentile(child, Literal.create(percentiles)).toAggregateExpression(),
    --- End diff --
    
    The aggregate operator in Spark SQL only executes duplicated aggregate expressions once, so it's ok to have duplicated `ApproximatePercentile` in aggregate expressions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    **[Test build #79433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79433/testReport)** for PR 18570 at commit [`d82475f`](https://github.com/apache/spark/commit/d82475f755530c7c4caea03435d4ebc682945ea2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    **[Test build #79433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79433/testReport)** for PR 18570 at commit [`d82475f`](https://github.com/apache/spark/commit/d82475f755530c7c4caea03435d4ebc682945ea2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79380/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79433/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79385/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #18570: [SPARK-21100][SQL][Followup] cleanup code and add more c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/18570
  
    **[Test build #79380 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79380/testReport)** for PR 18570 at commit [`3f2b41b`](https://github.com/apache/spark/commit/3f2b41bbcde29c1016bcab401de515cf67b8f246).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #18570: [SPARK-21100][SQL][Followup] cleanup code and add...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/18570


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org