You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Adam Kramer (JIRA)" <ji...@apache.org> on 2011/08/10 00:53:27 UTC

[jira] [Created] (HIVE-2363) Implicitly CLUSTER BY when dynamically partitioning

Implicitly CLUSTER BY when dynamically partitioning
---------------------------------------------------

                 Key: HIVE-2363
                 URL: https://issues.apache.org/jira/browse/HIVE-2363
             Project: Hive
          Issue Type: Improvement
          Components: Query Processor
            Reporter: Adam Kramer
            Priority: Critical


Whenever someone is dynamically creating partitions, the underlying implementation is to look at the output data, write it to a file so long as the partition columns are contiguous, then to close that file and open a new one if the partition column changes. This leads to potentially way too many files generated.

The solution is to ensure that a partition column's data all appears in a row and on the same reducer. I.e., to cluster by the partitioning columns on the way out.

This improvement is to detect whether a query is clustering by the eventual partition columns, and if not, to do so as an additional step at the end of the query. This will potentially save lots of space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira