You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "lichenglin (JIRA)" <ji...@apache.org> on 2016/03/18 07:29:33 UTC

[jira] [Created] (SPARK-13999) Run 'group by' before building cube

lichenglin created SPARK-13999:
----------------------------------

             Summary: Run 'group by'  before building cube
                 Key: SPARK-13999
                 URL: https://issues.apache.org/jira/browse/SPARK-13999
             Project: Spark
          Issue Type: Improvement
            Reporter: lichenglin


When I'm trying to build a cube on a data set witch has about 1 billion count.
The cube has 7 dimensions.
It takes a whole day to finish the job with 16 cores;

Then I run the 'select count (1) from table group by A,B,C,D,E,F,G' first
and run the cube with the 'group by' result data set.
The dimensions is the same as 'group by' and do sum on 'count'.
It just need 45 minutes.

the group by will reduce the data set's count from billions to  millions.
This depends on  the number  of dimension.

We can try in the new version.

The process of averaging may be complex.Should get the sum and count during the group by .

  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org