You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/20 17:29:16 UTC

[GitHub] [spark] tools4origins opened a new pull request #30107: [WIP][SPARK-33196][SQL] Expose filtered aggregations in spark.sql.functions

tools4origins opened a new pull request #30107:
URL: https://github.com/apache/spark/pull/30107


   ### What changes were proposed in this pull request?
   Spark already supports filtered aggregations (explanation of their behavior [here](https://modern-sql.com/feature/filter)):
   
   ```scala
   scala> val df = spark.range(100)
   scala> df.registerTempTable("df")
   scala> spark.sql("select count(1) as classic_cnt, count(1) FILTER (WHERE id < 50) from df").show()
   +-----------+-------------------------------------------------+ 
   |classic_cnt|count(1) FILTER (WHERE (id < CAST(50 AS BIGINT)))|
   +-----------+-------------------------------------------------+
   |        100|                                               50|
   +-----------+-------------------------------------------------+
   ```
   
   But this syntax is not exposed when manipulating columns and functions.
   
   This PRs adds the `filtered` function and allows the following syntax with the same behaviour as above:
   ```
   df.select(filtered(count(lit(1)), where=df("id") < 50)).show()
   ```
   
   ### Why are the changes needed?
   These aggregations are especially useful when filtering on overlapping datasets (where a pivot would not work):
   
   ```sql
   SELECT 
    AVG(revenue) FILTER (WHERE age < 25),
    AVG(revenue) FILTER (WHERE age < 35),
    AVG(revenue) FILTER (WHERE age < 45)
   FROM people;
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, a new function and a simplification of filtered aggregations definition (see above)
   
   ### How was this patch tested?
   Test was added.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30107: [WIP][SPARK-33196][SQL] Expose filtered aggregations in spark.sql.functions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30107:
URL: https://github.com/apache/spark/pull/30107#issuecomment-713023160


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #30107: [WIP][SPARK-33196][SQL] Expose filtered aggregations in spark.sql.functions

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #30107:
URL: https://github.com/apache/spark/pull/30107#issuecomment-713225028


   I agree with @zero323. It was added into SparkSQL more because of ANSI standard, see SPARK-27986. DSL wouldn't necessarily have to have it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tools4origins commented on pull request #30107: [WIP][SPARK-33196][SQL] Expose filtered aggregations in spark.sql.functions

Posted by GitBox <gi...@apache.org>.
tools4origins commented on pull request #30107:
URL: https://github.com/apache/spark/pull/30107#issuecomment-713022363


   It is still a WIP as I guess there will be feedback (it does change the API).
   
   It is at least missing the exposure of the function in R and Python, but I would like to confirm the implementation before going further.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tools4origins closed pull request #30107: [WIP][SPARK-33196][SQL] Expose filtered aggregations in spark.sql.functions

Posted by GitBox <gi...@apache.org>.
tools4origins closed pull request #30107:
URL: https://github.com/apache/spark/pull/30107


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 edited a comment on pull request #30107: [WIP][SPARK-33196][SQL] Expose filtered aggregations in spark.sql.functions

Posted by GitBox <gi...@apache.org>.
zero323 edited a comment on pull request #30107:
URL: https://github.com/apache/spark/pull/30107#issuecomment-713155178


   I have mixed feelings about this. At the moment we can easily use `when` to handle this use case
   
   ```scala
   count(when(df("id") < 50, 1))
   ```
   
   It doesn't require extensions of the DSL, and it is in practice as concise and (arguably) as readable as the proposed addition.
   
   **Edit**
   
   It is also straightforward to use FILTER within expr
   
   ```scala
   expr("count(1) FILTER (WHERE id < 50)")
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #30107: [WIP][SPARK-33196][SQL] Expose filtered aggregations in spark.sql.functions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #30107:
URL: https://github.com/apache/spark/pull/30107#issuecomment-713022471


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tools4origins commented on pull request #30107: [WIP][SPARK-33196][SQL] Expose filtered aggregations in spark.sql.functions

Posted by GitBox <gi...@apache.org>.
tools4origins commented on pull request #30107:
URL: https://github.com/apache/spark/pull/30107#issuecomment-713454654


   Thank you @zero323 and all for your feedback. I agree with you too. I did not have your solution in mind. The impact on the DSL is indeed high as it introduces a new API pattern (a function that only applies on aggregation).
   
   For completeness in case someone look at this issue in the future I am referencing here how to handle filtered aggregations with your approach:
   
   - `count(1)`: `count(when(df("id") < 50, 1))`
   - `count(*)`: `count(when(df("id") < 50, 1))`  (as `when` does not support `*`)
   - `count(id)`: `count(when(df("id") < 50, df("id")))`
   - Other aggregations, e.g. `avg(id)`: `avg(when(df("id") < 50, df("id")))`
   
   I was wondering if the same approach would work with distinct aggregations. I think in this case it is needed to use `expr()` but expr does indeed the job: `expr("stddev(distinct colName")`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #30107: [WIP][SPARK-33196][SQL] Expose filtered aggregations in spark.sql.functions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #30107:
URL: https://github.com/apache/spark/pull/30107#issuecomment-713022471


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maropu commented on pull request #30107: [WIP][SPARK-33196][SQL] Expose filtered aggregations in spark.sql.functions

Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #30107:
URL: https://github.com/apache/spark/pull/30107#issuecomment-713401723


   +1 with @zero323, too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zero323 commented on pull request #30107: [WIP][SPARK-33196][SQL] Expose filtered aggregations in spark.sql.functions

Posted by GitBox <gi...@apache.org>.
zero323 commented on pull request #30107:
URL: https://github.com/apache/spark/pull/30107#issuecomment-713155178


   I have mixed feelings about this. At the moment we can easily use `when` to handle this use case
   
   ```scala
   count(when(df("id") < 50, 1))
   ```
   
   It doesn't require extensions of the DSL, and it is in practice as concise and (arguably) as readable as the proposed addition.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org