You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/07/30 13:32:00 UTC
[jira] [Resolved] (SPARK-21577) Issue is handling too many
aggregations
[ https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-21577.
----------------------------------
Resolution: Invalid
I don't think this JIRA describes a particular issue / suggestion but sounds rather a question or asking investigation. Let's start this on mailing list. I am resolving this.
> Issue is handling too many aggregations
> ----------------------------------------
>
> Key: SPARK-21577
> URL: https://issues.apache.org/jira/browse/SPARK-21577
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.6.0
> Environment: Cloudera CDH 1.8.3
> Spark 1.6.0
> Reporter: Kannan Subramanian
>
> my requirement, reading the table from hive(Size - around 1.6 TB). I have to do more than 200 aggregation operations mostly avg, sum and std_dev. Spark application total execution time is take more than 12 hours. To Optimize the code I used shuffle Partitioning and memory tuning and all. But Its nothelpful for me. Please note that same query I ran in hive on map reduce. MR job completion time taken around only 5 hours. Kindly let me know is there any way to optimize or efficient way of handling multiple aggregation operations. val inputDataDF = hiveContext.read.parquet("/inputparquetData") inputDataDF.groupBy("seq_no","year", "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"), avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")... like 200 columns)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org