You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "E. Sammer (JIRA)" <ji...@apache.org> on 2010/02/11 18:22:30 UTC

[jira] Commented: (HIVE-50) Tag columns as partitioning columns

    [ https://issues.apache.org/jira/browse/HIVE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832576#action_12832576 ] 

E. Sammer commented on HIVE-50:
-------------------------------

Rather than having two places to define columns, it seems like it would be nicer to specify all columns once and then reference them in the partitioned by clause.

Ex:

CREATE TABLE events ( year int, month int, day int, event_type int, user_id int ) PARTITIONED BY ( year, month, day );

One of the side effects of this is that the column output order is defined solely by the actual column definition. As of 0.4.1, partitioned columns are always after "normal" columns which is annoying for cases where you want to expect query output to match the source files on which the query was run in terms of layout without having to have some explicit external ordering knowledge. In other words, partitioned columns should only be special to the query parser / optimizer and directory structures. Today, partitioning creates a requirement on the field ordering in files which violates the notion that there is no Hive file format and means reformatting files needlessly to partition them. Partitioned columns should be able to appear anywhere in the file layout.

> Tag columns as partitioning columns
> -----------------------------------
>
>                 Key: HIVE-50
>                 URL: https://issues.apache.org/jira/browse/HIVE-50
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Venky Iyer
>
>     CREATE TABLE tname (INT cname1, INT pcol PARTITIONING )
>     COMMENT 'This is a table' 
>     PARTITIONED BY(dt STRING) 
>     STORED AS SEQUENCEFILE; 
> The goal here is to annotate a column as being a "partitioning" column. Consider pcol in the above example. It is annotated with 'PARTITIONING', which implies that the create table
> has 
> PARTITIONED BY (dt, pcol)
> and every write to this table has implicitly
> INSERT OVERWRITE tname PARTITION (pcol='X')
> WHERE output.pcol = 'X'
> for every distinct value X that pcol takes.
> This is ideally an addition on top of the explicit partitioning that is already in the syntax, so that if I said
> INSERT OVERWRITE tname PARTITION (dt='D')
> it would still go into the partition (dt='D", pcol='Y') when the value of pcol is Y.
> It would be up to the user to make sure the cardinality of these columns is reasonable, and that enough data goes into each partition that there is some net benefit (just as it is in the explicit case).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.