You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2015/02/13 17:20:30 UTC

[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/4592

    [SPARK-5799][SQL] Compute aggregation function on specified numeric columns

    Compute aggregation function on specified numeric columns. For example:
    
        val df = Seq(("a", 1, 0, "b"), ("b", 2, 4, "c"), ("a", 2, 3, "d")).toDataFrame("key", "value1", "value2", "rest")
        df.groupBy("key").min("value2")
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 specific_cols_agg

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4592.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4592
    
----
commit 371a3f722c2c4c92275c30bfe060b353c719a216
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2015-02-13T16:15:33Z

    Compute aggregation function on specified numeric columns.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74418479
  
    @rxin @davies these issues are addressed in new commits. Please take a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24719554
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -714,30 +714,45 @@ def count(self):
             [Row(age=2, count=1), Row(age=5, count=1)]
             """
     
    -    @dfapi
    -    def mean(self):
    +    def mean(self, *cols):
    --- End diff --
    
    Indeed, will do it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24720621
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -664,6 +664,18 @@ def _api(self):
         return _api
     
     
    +def df_varargs_api(f, *args):
    +    def _api(self):
    --- End diff --
    
    okay, fix it later. but it still works with current version?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74493864
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27549/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74480130
  
      [Test build #27549 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27549/consoleFull) for   PR 4592 at commit [`9446896`](https://github.com/apache/spark/commit/94468966d5595bf97b545c0a127a787b23566fd0).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74410977
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27511/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24748159
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala ---
    @@ -39,13 +39,14 @@ class GroupedData protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expressio
           df.sqlContext, Aggregate(groupingExprs, namedGroupingExprs ++ aggExprs, df.logicalPlan))
       }
     
    -  private[this] def aggregateNumericColumns(f: Expression => Expression): Seq[NamedExpression] = {
    -    df.numericColumns.map { c =>
    -      val a = f(c)
    -      Alias(a, a.toString)()
    -    }
    +  private[this] def aggregateNumericColumns(colNames: String*)
    --- End diff --
    
    That's good but we are exposing implementation details of df to GroupedData...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24716974
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -714,30 +714,45 @@ def count(self):
             [Row(age=2, count=1), Row(age=5, count=1)]
             """
     
    -    @dfapi
    -    def mean(self):
    +    def mean(self, *cols):
    --- End diff --
    
    @davies take a look at the python changes here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74365752
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27479/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24766338
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala ---
    @@ -88,12 +88,12 @@ private[sql] class DataFrameImpl protected[sql](
         }
       }
     
    -  protected[sql] def numericColumns: Seq[Expression] = {
    +  protected[sql] def numericColumns(): Seq[Expression] = {
    --- End diff --
    
    why adding the parentheses here? it's ok i will merge it and fix it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74415013
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27519/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74412125
  
      [Test build #27519 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27519/consoleFull) for   PR 4592 at commit [`353fad7`](https://github.com/apache/spark/commit/353fad714409fb96a99455825fd6b44c56de43ee).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74404099
  
      [Test build #27504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27504/consoleFull) for   PR 4592 at commit [`b079e6b`](https://github.com/apache/spark/commit/b079e6bfd0efcc947f23283e66b5692dffcdd5bd).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24713647
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala ---
    @@ -93,7 +93,14 @@ private[sql] class DataFrameImpl protected[sql](
           queryExecution.analyzed.resolve(n.name, sqlContext.analyzer.resolver).get
         }
       }
    -
    + 
    +  protected[sql] def numericColumns(colNames: String*): Seq[Expression] = {
    --- End diff --
    
    Do you mean not to filter columns in  numericColumns but in aggregateNumericColumns? Is it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74387286
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27491/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74365698
  
    Thanks for working on this. I would like to actually put this into 1.3, because it makes more sense than the current API. If you can make the changes I suggested quickly, that would be great. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24713028
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala ---
    @@ -46,6 +46,15 @@ class GroupedData protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expressio
         }
       }
     
    +  @scala.annotation.varargs
    +  private[this] def aggregateNumericColumns(colName: String, colNames: String*)
    --- End diff --
    
    can you just remove the other aggregateNumericColumns and leave this one?
    
    Also for this one there is no need to use the annotation because it is not a public API.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24731788
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala ---
    @@ -39,13 +39,14 @@ class GroupedData protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expressio
           df.sqlContext, Aggregate(groupingExprs, namedGroupingExprs ++ aggExprs, df.logicalPlan))
       }
     
    -  private[this] def aggregateNumericColumns(f: Expression => Expression): Seq[NamedExpression] = {
    -    df.numericColumns.map { c =>
    -      val a = f(c)
    -      Alias(a, a.toString)()
    -    }
    +  private[this] def aggregateNumericColumns(colNames: String*)
    --- End diff --
    
    also the error message can be very confusing when you just say "Cannot resolve column name" even though the column obviously exist.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24713032
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -154,6 +154,18 @@ class DataFrameSuite extends QueryTest {
           testData2.agg(sum('b)),
           Row(9)
         )
    +
    +    val df1 = Seq(("a", 1, 0, "b"), ("b", 2, 4, "c"), ("a", 2, 3, "d"))
    +      .toDataFrame("key", "value1", "value2", "rest")
    +
    +    checkAnswer(
    +      df1.groupBy("key").min(),
    +      df1.groupBy("key").min("value1", "value2").collect
    --- End diff --
    
    add () to collect


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74408838
  
      [Test build #27511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27511/consoleFull) for   PR 4592 at commit [`54ed0c4`](https://github.com/apache/spark/commit/54ed0c43b1850c349a59a702af325b4dc5946971).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24713033
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -154,6 +154,18 @@ class DataFrameSuite extends QueryTest {
           testData2.agg(sum('b)),
           Row(9)
         )
    +
    +    val df1 = Seq(("a", 1, 0, "b"), ("b", 2, 4, "c"), ("a", 2, 3, "d"))
    +      .toDataFrame("key", "value1", "value2", "rest")
    +
    +    checkAnswer(
    +      df1.groupBy("key").min(),
    +      df1.groupBy("key").min("value1", "value2").collect
    +    )
    +    checkAnswer(
    +      df1.groupBy("key").min("value2"),
    +      Seq(Row("a",0), Row("b",4))
    --- End diff --
    
    add space after comma


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24713616
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala ---
    @@ -93,7 +93,14 @@ private[sql] class DataFrameImpl protected[sql](
           queryExecution.analyzed.resolve(n.name, sqlContext.analyzer.resolver).get
         }
       }
    -
    + 
    +  protected[sql] def numericColumns(colNames: String*): Seq[Expression] = {
    --- End diff --
    
    I was suggesting removing this, and just resolve all the columns the user specified. I don't think we should silently drop a column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74415009
  
      [Test build #27519 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27519/consoleFull) for   PR 4592 at commit [`353fad7`](https://github.com/apache/spark/commit/353fad714409fb96a99455825fd6b44c56de43ee).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74280213
  
      [Test build #27442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27442/consoleFull) for   PR 4592 at commit [`371a3f7`](https://github.com/apache/spark/commit/371a3f722c2c4c92275c30bfe060b353c719a216).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74380692
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27488/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74373239
  
      [Test build #27480 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27480/consoleFull) for   PR 4592 at commit [`b1a24fc`](https://github.com/apache/spark/commit/b1a24fc0275b895376e3de0af1cd7b3e2c72048b).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74380690
  
      [Test build #27488 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27488/consoleFull) for   PR 4592 at commit [`4c63a01`](https://github.com/apache/spark/commit/4c63a01695ca68eeb003b642d8553c5916868012).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74406090
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27504/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74380639
  
      [Test build #27488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27488/consoleFull) for   PR 4592 at commit [`4c63a01`](https://github.com/apache/spark/commit/4c63a01695ca68eeb003b642d8553c5916868012).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24720234
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -664,6 +664,18 @@ def _api(self):
         return _api
     
     
    +def df_varargs_api(f, *args):
    +    def _api(self):
    --- End diff --
    
    This is wrong, it should be:
    ```
    def df_varargs_api(f):
          def _api(self, *cols):
                pass
    ....
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74406088
  
      [Test build #27504 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27504/consoleFull) for   PR 4592 at commit [`b079e6b`](https://github.com/apache/spark/commit/b079e6bfd0efcc947f23283e66b5692dffcdd5bd).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74292424
  
      [Test build #27442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27442/consoleFull) for   PR 4592 at commit [`371a3f7`](https://github.com/apache/spark/commit/371a3f722c2c4c92275c30bfe060b353c719a216).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74493853
  
      [Test build #27549 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27549/consoleFull) for   PR 4592 at commit [`9446896`](https://github.com/apache/spark/commit/94468966d5595bf97b545c0a127a787b23566fd0).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `public class JavaStatefulNetworkWordCount `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74383474
  
      [Test build #27491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27491/consoleFull) for   PR 4592 at commit [`880c2ac`](https://github.com/apache/spark/commit/880c2acdc6beb9a73515371e4356bc37329b3693).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24720203
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala ---
    @@ -88,12 +88,24 @@ private[sql] class DataFrameImpl protected[sql](
         }
       }
     
    -  protected[sql] def numericColumns: Seq[Expression] = {
    -    schema.fields.filter(_.dataType.isInstanceOf[NumericType]).map { n =>
    -      queryExecution.analyzed.resolve(n.name, sqlContext.analyzer.resolver).get
    +  protected[sql] def numericColumns(colNames: String*): Seq[Expression] = {
    +    val allNumbericCols = schema.fields.filter(_.dataType.isInstanceOf[NumericType]).map(_.name)
    +    val diff = colNames.diff(allNumbericCols)
    +    if (diff.nonEmpty) {
    --- End diff --
    
    i think an easier way is to just traverse colNames, and throw an exception if it is a not a numeric column. Really no need to be so complicated to get the diff ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74387281
  
      [Test build #27491 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27491/consoleFull) for   PR 4592 at commit [`880c2ac`](https://github.com/apache/spark/commit/880c2acdc6beb9a73515371e4356bc37329b3693).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24703110
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala ---
    @@ -149,28 +157,70 @@ class GroupedData protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expressio
        * The resulting [[DataFrame]] will also contain the grouping columns.
        */
       def mean(): DataFrame = aggregateNumericColumns(Average)
    + 
    +  /**
    +   * Compute the average value for given numeric columns for each group. This is an alias for `avg`.
    +   * The resulting [[DataFrame]] will also contain the grouping columns.
    +   */
    +  def mean(colName: String, colNames: String*): DataFrame = {
    +    aggregateNumericColumns(colName, colNames:_*)(Average)
    +  }
     
       /**
        * Compute the max value for each numeric columns for each group.
        * The resulting [[DataFrame]] will also contain the grouping columns.
        */
       def max(): DataFrame = aggregateNumericColumns(Max)
    + 
    +  /**
    +   * Compute the max value for given numeric columns for each group.
    +   * The resulting [[DataFrame]] will also contain the grouping columns.
    +   */
    +  def max(colName: String, colNames: String*): DataFrame = {
    --- End diff --
    
    seems to me you can combine the two max functions, and then run aggregate all on columns if the vararg is empty. 
    
    also make sure you annotate the varargs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24713599
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala ---
    @@ -93,7 +93,14 @@ private[sql] class DataFrameImpl protected[sql](
           queryExecution.analyzed.resolve(n.name, sqlContext.analyzer.resolver).get
         }
       }
    -
    + 
    +  protected[sql] def numericColumns(colNames: String*): Seq[Expression] = {
    --- End diff --
    
    It is used to obtain these numeric columns we want if they are in the df. So we can just aggregate on them instead of all numeric columns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24731797
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala ---
    @@ -39,13 +39,14 @@ class GroupedData protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expressio
           df.sqlContext, Aggregate(groupingExprs, namedGroupingExprs ++ aggExprs, df.logicalPlan))
       }
     
    -  private[this] def aggregateNumericColumns(f: Expression => Expression): Seq[NamedExpression] = {
    -    df.numericColumns.map { c =>
    -      val a = f(c)
    -      Alias(a, a.toString)()
    -    }
    +  private[this] def aggregateNumericColumns(colNames: String*)
    --- End diff --
    
    btw i just wrote the above in an editor. haven't tried compiling it it so it probably wouldn't compile.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24713040
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -154,6 +154,18 @@ class DataFrameSuite extends QueryTest {
           testData2.agg(sum('b)),
           Row(9)
         )
    +
    +    val df1 = Seq(("a", 1, 0, "b"), ("b", 2, 4, "c"), ("a", 2, 3, "d"))
    +      .toDataFrame("key", "value1", "value2", "rest")
    --- End diff --
    
    this is now toDF


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24713035
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala ---
    @@ -93,7 +93,14 @@ private[sql] class DataFrameImpl protected[sql](
           queryExecution.analyzed.resolve(n.name, sqlContext.analyzer.resolver).get
         }
       }
    -
    + 
    +  protected[sql] def numericColumns(colNames: String*): Seq[Expression] = {
    --- End diff --
    
    i don't think you'd want this. if an user explicitly specifies a column, we should just throw an error rather than silently dropping the column from numeric aggregation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24719364
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -714,30 +714,45 @@ def count(self):
             [Row(age=2, count=1), Row(age=5, count=1)]
             """
     
    -    @dfapi
    -    def mean(self):
    +    def mean(self, *cols):
    --- End diff --
    
    There are lots of duplicated codes here, you could update the `dfapi` to take any number of cols as arguments, then have a different implementation for `count`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4592


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74373241
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27480/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74364348
  
      [Test build #27479 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27479/consoleFull) for   PR 4592 at commit [`27069c3`](https://github.com/apache/spark/commit/27069c39b84ae6312efd80b7a2965c45a17348da).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74425215
  
    LGTM, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24713729
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala ---
    @@ -93,7 +93,14 @@ private[sql] class DataFrameImpl protected[sql](
           queryExecution.analyzed.resolve(n.name, sqlContext.analyzer.resolver).get
         }
       }
    -
    + 
    +  protected[sql] def numericColumns(colNames: String*): Seq[Expression] = {
    --- End diff --
    
    okay, i think i know what you meant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74365750
  
      [Test build #27479 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27479/consoleFull) for   PR 4592 at commit [`27069c3`](https://github.com/apache/spark/commit/27069c39b84ae6312efd80b7a2965c45a17348da).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74410972
  
      [Test build #27511 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27511/consoleFull) for   PR 4592 at commit [`54ed0c4`](https://github.com/apache/spark/commit/54ed0c43b1850c349a59a702af325b4dc5946971).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24720617
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameImpl.scala ---
    @@ -88,12 +88,24 @@ private[sql] class DataFrameImpl protected[sql](
         }
       }
     
    -  protected[sql] def numericColumns: Seq[Expression] = {
    -    schema.fields.filter(_.dataType.isInstanceOf[NumericType]).map { n =>
    -      queryExecution.analyzed.resolve(n.name, sqlContext.analyzer.resolver).get
    +  protected[sql] def numericColumns(colNames: String*): Seq[Expression] = {
    +    val allNumbericCols = schema.fields.filter(_.dataType.isInstanceOf[NumericType]).map(_.name)
    +    val diff = colNames.diff(allNumbericCols)
    +    if (diff.nonEmpty) {
    --- End diff --
    
    One reason to filter all wrong given columns out is we can reports them at once.  I will fix it for the concern.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24720235
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -664,6 +664,18 @@ def _api(self):
         return _api
     
     
    +def df_varargs_api(f, *args):
    +    def _api(self):
    +        jargs = ListConverter().convert(list(args),
    --- End diff --
    
    `list()` is not necessary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74292435
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27442/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24720200
  
    --- Diff: python/pyspark/sql/dataframe.py ---
    @@ -714,28 +726,28 @@ def count(self):
             [Row(age=2, count=1), Row(age=5, count=1)]
             """
     
    -    @dfapi
    -    def mean(self):
    +    @df_varargs_api
    --- End diff --
    
    can u add some tests for python? thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4592#discussion_r24731783
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala ---
    @@ -39,13 +39,14 @@ class GroupedData protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expressio
           df.sqlContext, Aggregate(groupingExprs, namedGroupingExprs ++ aggExprs, df.logicalPlan))
       }
     
    -  private[this] def aggregateNumericColumns(f: Expression => Expression): Seq[NamedExpression] = {
    -    df.numericColumns.map { c =>
    -      val a = f(c)
    -      Alias(a, a.toString)()
    -    }
    +  private[this] def aggregateNumericColumns(colNames: String*)
    --- End diff --
    
    can you change this to
    
    ```scala
      private[this] def aggregateNumericColumns(colNames: String*)(f: Expression => Expression)
        : Seq[NamedExpression] = {
    
        val columnExprs = if (colNames.isEmpty) {
          // No columns specified. Use all numeric columns.
          df.numericColumns
        } else {
          // Make sure all specified columns are numeric
          colNames.map { colName =>
            val namedExpr = df.resolve(colName)
            if (!namedExpr.dataType.isInstanceOf[NumericType]) {
              throw new AnalysisException(
                s""""$colName" is not a numeric column. """ +
                "Aggregation function can only be performed on a numeric column.")
            }
            namedExpr
          }
        }
        columnExprs.map { c =>
          val a = f(c)
          Alias(a, a.toString)()
        }
      }
    ```
    
    and just remove the old one? There is actually a bug in the old one, because it doesn't use a resolver to compare column names. On top of that, it is not uncommon tables have thousands of columns. You have a operation that is actually O(n^2)...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5799][SQL] Compute aggregation function...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4592#issuecomment-74371781
  
      [Test build #27480 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27480/consoleFull) for   PR 4592 at commit [`b1a24fc`](https://github.com/apache/spark/commit/b1a24fc0275b895376e3de0af1cd7b3e2c72048b).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org