You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Erwang Guyomarc'h (Jira)" <ji...@apache.org> on 2020/10/20 17:15:00 UTC

[jira] [Created] (SPARK-33196) Expose filtered aggregation API

Erwang Guyomarc'h created SPARK-33196:
-----------------------------------------

             Summary: Expose filtered aggregation API
                 Key: SPARK-33196
                 URL: https://issues.apache.org/jira/browse/SPARK-33196
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Erwang Guyomarc'h


Spark currently supports filtered aggregation but does not expose API allowing to use them when using the `spark.sql.functions` package.

It is possible to use them when writing directly SQL:
{code:scala}
scala> val df = spark.range(100)
scala> df.registerTempTable("df")
scala> spark.sql("select count(1) as classic_cnt, count(1) FILTER (WHERE id < 50) from df").show()
+-----------+-------------------------------------------------+ 
|classic_cnt|count(1) FILTER (WHERE (id < CAST(50 AS BIGINT)))|
+-----------+-------------------------------------------------+
|        100|                                               50|
+-----------+-------------------------------------------------+{code}
These aggregations are especially useful when filtering on overlapping datasets (where a pivot would not work):
{code:sql}
SELECT 
 AVG(revenue) FILTER (WHERE age < 25),
 AVG(revenue) FILTER (WHERE age < 35),
 AVG(revenue) FILTER (WHERE age < 45)
FROM people;{code}

I did not find an issue tracking this, hence I am creating this one and I will join a PR to illustrate a possible implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org