You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Feng Zhu (JIRA)" <ji...@apache.org> on 2017/09/09 12:33:00 UTC
[jira] [Created] (SPARK-21964) Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics

Feng Zhu created SPARK-21964:
--------------------------------

             Summary: Enable splitting the Aggregate (on Expand) into a number of Aggregates for grouing analytics
                 Key: SPARK-21964
                 URL: https://issues.apache.org/jira/browse/SPARK-21964
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.2.0, 2.1.1, 2.1.0, 2.0.2, 2.0.0
            Reporter: Feng Zhu


In current versions, Spark SQL implements grouping analytics clauses (i.e., cube, rollup and grouping sets) as a single Aggregate operator on a single Expand operator. With this implementation, we can read the table only once.

However, for many scenarios (e.g., high dimensions cube), the Expand operator is too "heavy" with a large number of projections, resulting vast shuffle write.

In our production environment, we have encountered various such cases, leading to low performance or even OOM issues for direct buffer memory.

We demonstrate the issue with the following real-world query of a 6-dimensional cube.
  
{code:sql}
SELECT CASE
           WHEN grouping(iFrom) = 1 THEN -1
           ELSE iFrom
       END AS iFrom
      ,CASE
           WHEN grouping(iSrcId) = 1 THEN -1
           ELSE iSrcId
       END AS iSrcId
      ,CASE
           WHEN grouping(sgametype) = 1 THEN '-1'
           ELSE sgametype
       END AS sgametype
      ,CASE
           WHEN grouping(iOperId) = 1 THEN -1
           ELSE iOperId
       END AS iOperId
      ,CASE
           WHEN grouping(igameid) = 1 THEN -1
           ELSE igameid
       END AS igameid
      ,CASE
           WHEN grouping(iacttypeid) = 1 THEN -1
           ELSE iacttypeid
       END AS iacttypeid
      ,SUM(iclickcnt) AS iclickcnt
FROM p_day_advert
WHERE  statedate = 20170810
GROUP BY iFrom, iSrcId, sgametype, iOperId, igameid, iacttypeid WITH CUBE
{code}

For such query, the Expand operator will generates 64 (i.e., 64=2^6) projections. Though the query reads only about 3GB data, it produces about 250GB data for shuffle write. In our environment, the first stage costs about 2 hours, and the second stage is easy to get an OOM error unless we enlarge the some configurations.

Therefore, we tend to provide another choice which enables splitting the heavyweight aggregate into a number of lightweight aggregates for each group. Actually, it implements the grouping analytics as Union and executes the aggregates one by one. Though it reads the data many times, we can still achieve overall high performance.

With such implementation, the query can be accomplished in about 20 mins, of which each aggregation takes 1~4 mins.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org