You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Adam Kramer (JIRA)" <ji...@apache.org> on 2011/08/10 00:53:27 UTC
[jira] [Created] (HIVE-2363) Implicitly CLUSTER BY when dynamically
partitioning
Implicitly CLUSTER BY when dynamically partitioning
---------------------------------------------------
Key: HIVE-2363
URL: https://issues.apache.org/jira/browse/HIVE-2363
Project: Hive
Issue Type: Improvement
Components: Query Processor
Reporter: Adam Kramer
Priority: Critical
Whenever someone is dynamically creating partitions, the underlying implementation is to look at the output data, write it to a file so long as the partition columns are contiguous, then to close that file and open a new one if the partition column changes. This leads to potentially way too many files generated.
The solution is to ensure that a partition column's data all appears in a row and on the same reducer. I.e., to cluster by the partitioning columns on the way out.
This improvement is to detect whether a query is clustering by the eventual partition columns, and if not, to do so as an additional step at the end of the query. This will potentially save lots of space.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira