You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Travis Crawford (JIRA)" <ji...@apache.org> on 2012/06/07 23:44:23 UTC
[jira] [Commented] (HIVE-2950) Hive should store the full table schema in partition storage descriptors

    [ https://issues.apache.org/jira/browse/HIVE-2950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291338#comment-13291338 ] 

Travis Crawford commented on HIVE-2950:
---------------------------------------

Updated the branch against trunk and reran the tests.

https://travis.ci.cloudbees.com/job/HIVE-2950_partition_full_schema/1/

There were some test failures in CI so I ran these locally and they passed:

{code}
ant clean package test -Dtestcase=TestCliDriver -Dqfile=create_view_partitioned.q
ant clean package test -Dtestcase=TestNegativeCliDriver -Dqfile=create_or_replace_view1.q
ant clean package test -Dtestcase=TestNegativeCliDriver -Dqfile=create_or_replace_view2.q
ant clean package test -Dtestcase=TestHiveServerSessions
{code}
                
> Hive should store the full table schema in partition storage descriptors
> ------------------------------------------------------------------------
>
>                 Key: HIVE-2950
>                 URL: https://issues.apache.org/jira/browse/HIVE-2950
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Travis Crawford
>            Assignee: Travis Crawford
>         Attachments: HIVE-2950.D2769.1.patch
>
>
> Hive tables have a schema, which is copied into the partition storage descriptor when adding a partition. Currently only columns stored in the table storage descriptor are copied - columns that are reported by the serde are not copied. Instead of copying the table storage descriptor columns into the partition columns, the full table schema should be copied.
> DETAILS
> This is a little long but is necessary to show 3 things: current behavior when explicitly listing columns, behavior with HIVE-2941 patched in and serde reported columns, and finally the behavior with this patch (full table schema copied into the partition storage descriptor).
> Here's an example of what currently happens. Note the following:
> * the two manually-defined fields defined for the table are listed in the table storage descriptor.
> * both fields are present in the partition storage descriptor
> This works great because users who query for a partition can look at its storage descriptor and get the schema.
> {code}
> hive> create external table foo_test (name string, age int) partitioned by (part_dt string);
> hive> describe extended foo_test;
> OK
> name	string	
> age	int	
> part_dt	string	
> 	 	 
> Detailed Table Information	Table(tableName:foo_test, dbName:travis_test, owner:travis, createTime:1334256062, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/warehouse/travis_test.db/foo_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256062}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)	
> Time taken: 0.082 seconds
> hive> alter table foo_test add partition (part_dt = '20120331T000000Z') location 'hdfs://foo.com/foo/2012/03/31/00';
> hive> describe extended foo_test partition (part_dt = '20120331T000000Z');
> OK
> name	string	
> age	int	
> part_dt	string	
> 	 	 
> Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test, tableName:foo_test, createTime:1334256131, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:age, type:int, comment:null), FieldSchema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=1334256131})	
> {code}
> CURRENT BEHAVIOR WITH HIVE-2941 PATCHED IN
> Now let's examine what happens when creating a table when the serde reports the schema. Notice the following:
> * The table storage descriptor contains an empty list of columns. However, the table schema is available from the serde reflecting on the serialization class.
> * The partition storage descriptor does contain a single "part_dt" column that was copied from the table partition keys. The actual data columns are not present.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with serdeproperties ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person") stored as inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.08 seconds
> hive> describe extended person_test;
> OK
> name	struct<first_name:string,last_name:string>	from deserializer
> id	int	from deserializer
> email	string	from deserializer
> phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
> part_dt	string	
> 	 	 
> Detailed Table Information	Table(tableName:person_test, dbName:travis_test, owner:travis, createTime:1334256942, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.db/person_test, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334256942}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)	
> Time taken: 0.147 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z') location 'hdfs://foo.com/foo/2012/03/31/00'; 
> OK
> Time taken: 0.149 seconds
> hive> describe extended person_test partition (part_dt = '20120331T000000Z');
> OK
> part_dt	string	
> 	 	 
> Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test, tableName:person_test, createTime:1334257029, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=1334257029})	
> Time taken: 0.106 seconds
> hive> 
> {code}
> PROPOSED BEHAVIOR
> I believe the correct thing to do is copy the full table schema (serde-reported columns + partition keys) into the partition storage descriptor. Notice the following:
> * Table storage descriptor does not contain any columns, because they are reported by the serde.
> * Partition storage descriptor now contains both the serde-reported schema, and full table schema.
> {code}
> hive> create external table travis_test.person_test partitioned by (part_dt string) row format serde "com.twitter.elephantbird.hive.serde.ThriftSerDe" with serdeproperties ("serialization.class"="com.twitter.elephantbird.examples.thrift.Person") stored as inputformat "com.twitter.elephantbird.mapred.input.HiveMultiInputFormat" outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
> OK
> Time taken: 0.076 seconds
> hive> describe extended person_test;                                                                                                                                     OK                                                                                                                                                                       name    struct<first_name:string,last_name:string>      from deserializer
> id	int	from deserializer
> email	string	from deserializer
> phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
> part_dt	string	
> 	 	 
> Detailed Table Information	Table(tableName:person_test, dbName:travis_test, owner:travis, createTime:1334257489, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[], location:hdfs://foo.com/warehouse/travis_test.db/person_test, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), partitionKeys:[FieldSchema(name:part_dt, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1334257489}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)	
> Time taken: 0.155 seconds
> hive> alter table person_test add partition (part_dt = '20120331T000000Z') location 'hdfs://foo.com/foo/2012/03/31/00';
> OK                                                                                                                                                                       Time taken: 0.296 seconds                                        
> hive> describe extended person_test partition (part_dt = '20120331T000000Z');                                                                                            OK                                                                                                                                                                       name    struct<first_name:string,last_name:string>      from deserializer
> id	int	from deserializer
> email	string	from deserializer
> phones	array<struct<number:string,type:struct<value:int>>>	from deserializer
> part_dt	string	
> 	 	 
> Detailed Partition Information	Partition(values:[20120331T000000Z], dbName:travis_test, tableName:person_test, createTime:1334257504, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:struct<first_name:string,last_name:string>, comment:from deserializer), FieldSchema(name:id, type:int, comment:from deserializer), FieldSchema(name:email, type:string, comment:from deserializer), FieldSchema(name:phones, type:array<struct<number:string,type:struct<value:int>>>, comment:from deserializer), FieldSchema(name:part_dt, type:string, comment:null)], location:hdfs://foo.com/foo/2012/03/31/00, inputFormat:com.twitter.elephantbird.mapred.input.HiveMultiInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:com.twitter.elephantbird.hive.serde.ThriftSerDe, parameters:{serialization.class=com.twitter.elephantbird.examples.thrift.Person, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, primaryRegionName:, secondaryRegions:[]), parameters:{transient_lastDdlTime=1334257504})	
> Time taken: 0.133 seconds
> hive> 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira