You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/06/26 07:39:00 UTC
[jira] [Commented] (SPARK-24650) GroupingSet
[ https://issues.apache.org/jira/browse/SPARK-24650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523351#comment-16523351 ]
Hyukjin Kwon commented on SPARK-24650:
--------------------------------------
Please avoid to set a blocker which is usually reserved for a committer.
> GroupingSet
> -----------
>
> Key: SPARK-24650
> URL: https://issues.apache.org/jira/browse/SPARK-24650
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.3.1
> Environment: CDH 5.X, Spark 2.3
> Reporter: Mihir Sahu
> Priority: Major
> Labels: Grouping, Sets
>
> If a grouping set is used in spark sql, then the plan does not perform optimally.
> If input to a grouping set is X rows and the grouping sets has y group, then the number of rows that are processed is currently x*y rows.
> Example : Let a Dataframe have col1, col2, col3 and col4 columns and number of row be rowNo.
> and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2
> Number of row processed in such case is 3*(rowNos * size of each row).
> However is this the optimal way of processing data.
> If the groups of y are derivable for each other, can we reduce the amount of volume processed by removing columns as we progress to the lower dimension of processing.
> Currently while doing processing percentile, a lot of data seems to be processed causing performance issue.
> Need to look if this can be optimised
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org