You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Ryan Blue <bl...@apache.org> on 2014/12/01 21:13:57 UTC

Re: Review Request 28372: HIVE-8950: Add support in ParquetHiveSerde to create table schema from a parquet file

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28372/#review63424
-----------------------------------------------------------



common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<https://reviews.apache.org/r/28372/#comment105680>

    What does this change do?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ColInfoFromParquetFile.java
<https://reviews.apache.org/r/28372/#comment105672>

    I don't think ColInfoFromParquetFile is a great name for this class. What you've implemented here is a schema converter, like HiveSchemaConverter but from MessageType to TypeInfo rather than the other way around. I think this should either go in HiveSchemaConverter or a new class, like ParquetToHiveSchemaConverter.
    
    This update would also help clarify the methods exposed. I think that the primary method this should expose is:
      StructType convert(GroupType parquetSchema)
      
    File reading should be done elsewhere if it is necessary.
    
    Also, I think this should return a StructType instead of a Pair with Strings. That would be a lot cleaner and would avoid the custom type string building immediately followed by parsing those type strings.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java
<https://reviews.apache.org/r/28372/#comment105675>

    Could you please undo the import moves? I like to avoid non-functional changes.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java
<https://reviews.apache.org/r/28372/#comment105676>

    Is it possible to avoid setting a file property? There are lots of cases where a file in the dataset would be removed so this is a brittle method of configuring the table. Ideally, we would check for the LIST_COLUMN_TYPES property and if we don't find it, conver the schema of the first file that we find.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java
<https://reviews.apache.org/r/28372/#comment105678>

    This non-functional change is fine with me because it fixes the style in a method you're editing. But if you add a newline after this conditional, then you should also add one after line 149.


- Ryan Blue


On Nov. 26, 2014, 5:08 p.m., Ashish Singh wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28372/
> -----------------------------------------------------------
> 
> (Updated Nov. 26, 2014, 5:08 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8950
>     https://issues.apache.org/jira/browse/HIVE-8950
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-8950: Add support in ParquetHiveSerde to create table schema from a parquet file
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java fafd78e63e9b41c9fdb0e017b567dc719d151784 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ColInfoFromParquetFile.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 4effe736fcf9d3715f03eed9885c299a7aa040dd 
>   ql/src/test/queries/clientpositive/parquet_array_of_multi_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_optional_elements_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_required_elements_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_single_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_structs_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_unannotated_groups_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_unannotated_primitives_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_avro_array_of_primitives_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_avro_array_of_single_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_decimal_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_thrift_array_of_primitives_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_thrift_array_of_single_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_multi_field_struct_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_optional_elements_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_required_elements_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_single_field_struct_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_structs_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_unannotated_groups_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_unannotated_primitives_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_avro_array_of_primitives_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_avro_array_of_single_field_struct_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_decimal_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_thrift_array_of_primitives_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_thrift_array_of_single_field_struct_gen_schema.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/28372/diff/
> 
> 
> Testing
> -------
> 
> Tested by adding appropriate qTests.
> 
> 
> Thanks,
> 
> Ashish Singh
> 
>


Re: Review Request 28372: HIVE-8950: Add support in ParquetHiveSerde to create table schema from a parquet file

Posted by Ashish Singh <as...@cloudera.com>.

> On Dec. 1, 2014, 8:13 p.m., Ryan Blue wrote:
> >

Ryan, thanks for the review. Made appropriate changes based on your comments. Kindly take a look.


- Ashish


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28372/#review63424
-----------------------------------------------------------


On Dec. 7, 2014, 1:40 a.m., Ashish Singh wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28372/
> -----------------------------------------------------------
> 
> (Updated Dec. 7, 2014, 1:40 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8950
>     https://issues.apache.org/jira/browse/HIVE-8950
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-8950: Add support in ParquetHiveSerde to create table schema from a parquet file
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java fafd78e63e9b41c9fdb0e017b567dc719d151784 
>   metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java 2db2658fbc57fba01c892c9213baef6c498e659b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ParquetSchemaReader.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ParquetToHiveSchemaConverter.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 4effe736fcf9d3715f03eed9885c299a7aa040dd 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java cd3d349e8bd8785d6cadaf9ed8fa7598f223774a 
>   ql/src/test/queries/clientpositive/parquet_array_of_multi_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_optional_elements_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_required_elements_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_single_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_structs_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_structs_gen_schema_ext.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_unannotated_groups_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_unannotated_primitives_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_avro_array_of_primitives_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_avro_array_of_single_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_decimal_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_thrift_array_of_primitives_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_thrift_array_of_single_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_multi_field_struct_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_optional_elements_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_required_elements_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_single_field_struct_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_structs_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_structs_gen_schema_ext.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_unannotated_groups_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_unannotated_primitives_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_avro_array_of_primitives_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_avro_array_of_single_field_struct_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_decimal_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_thrift_array_of_primitives_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_thrift_array_of_single_field_struct_gen_schema.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/28372/diff/
> 
> 
> Testing
> -------
> 
> Tested by adding appropriate qTests.
> 
> 
> Thanks,
> 
> Ashish Singh
> 
>


Re: Review Request 28372: HIVE-8950: Add support in ParquetHiveSerde to create table schema from a parquet file

Posted by Ashish Singh <as...@cloudera.com>.

> On Dec. 1, 2014, 8:13 p.m., Ryan Blue wrote:
> > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 762
> > <https://reviews.apache.org/r/28372/diff/2/?file=777120#file777120line762>
> >
> >     What does this change do?

Serdes in this list must have one or more columns in the create table statement. This change removes this check for ParquetHiveSerDe.


> On Dec. 1, 2014, 8:13 p.m., Ryan Blue wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ColInfoFromParquetFile.java, line 36
> > <https://reviews.apache.org/r/28372/diff/2/?file=777121#file777121line36>
> >
> >     I don't think ColInfoFromParquetFile is a great name for this class. What you've implemented here is a schema converter, like HiveSchemaConverter but from MessageType to TypeInfo rather than the other way around. I think this should either go in HiveSchemaConverter or a new class, like ParquetToHiveSchemaConverter.
> >     
> >     This update would also help clarify the methods exposed. I think that the primary method this should expose is:
> >       StructType convert(GroupType parquetSchema)
> >       
> >     File reading should be done elsewhere if it is necessary.
> >     
> >     Also, I think this should return a StructType instead of a Pair with Strings. That would be a lot cleaner and would avoid the custom type string building immediately followed by parsing those type strings.

Made appropriate changes.


> On Dec. 1, 2014, 8:13 p.m., Ryan Blue wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java, line 107
> > <https://reviews.apache.org/r/28372/diff/2/?file=777122#file777122line107>
> >
> >     Is it possible to avoid setting a file property? There are lots of cases where a file in the dataset would be removed so this is a brittle method of configuring the table. Ideally, we would check for the LIST_COLUMN_TYPES property and if we don't find it, conver the schema of the first file that we find.

Below are the different scenarios possible and how they will be handled.

1. Managed table with parquet.file specified:
1.a. While creating the table ParquetHiveSerDe will use that file to create hive schema. 
1.b. If at a later point of time, say while select data from table, file specified by parquet.file is deleted, then ParquetHiveSerDe will look for a file in table dir and will try to create hive schema from that file's parquet metadata.
1.c. If parquet.file is deleted and table dir is empty, there is no way hive schema can be created. So, an informative exception will be thrown, "Either provide schema for table or point to parquet file using parquet.file in tblproperties or make sure that table has atleast one parquet file with required metadata". Note that if a table reaches this state, it can be fixed either by altering table and pointing parquet.file to a valid file or by copying a parquet file to table dir. This is valid for all cases.

2. Managed table without parquet.file specified:
2.a. While creating the table ParquetHiveSerDe will use any file in table dir to create hive schema. If table dir is empty, it will throw an informative exception.

3. External table with parquet.file specified: Same as (1).
4. External table without parquet.file specified and empty table dir: Same as (1.c)
5. External table without parquet.file specified and a parquet file present in table dir: Same as (2)


- Ashish


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28372/#review63424
-----------------------------------------------------------


On Nov. 27, 2014, 1:08 a.m., Ashish Singh wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28372/
> -----------------------------------------------------------
> 
> (Updated Nov. 27, 2014, 1:08 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8950
>     https://issues.apache.org/jira/browse/HIVE-8950
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-8950: Add support in ParquetHiveSerde to create table schema from a parquet file
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java fafd78e63e9b41c9fdb0e017b567dc719d151784 
>   metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java 2db2658fbc57fba01c892c9213baef6c498e659b 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ParquetSchemaReader.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ParquetToHiveSchemaConverter.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 4effe736fcf9d3715f03eed9885c299a7aa040dd 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java cd3d349e8bd8785d6cadaf9ed8fa7598f223774a 
>   ql/src/test/queries/clientpositive/parquet_array_of_multi_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_optional_elements_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_required_elements_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_single_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_structs_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_structs_gen_schema_ext.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_unannotated_groups_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_unannotated_primitives_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_avro_array_of_primitives_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_avro_array_of_single_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_decimal_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_thrift_array_of_primitives_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_thrift_array_of_single_field_struct_gen_schema.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_multi_field_struct_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_optional_elements_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_required_elements_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_single_field_struct_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_structs_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_structs_gen_schema_ext.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_unannotated_groups_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_unannotated_primitives_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_avro_array_of_primitives_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_avro_array_of_single_field_struct_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_decimal_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_thrift_array_of_primitives_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_thrift_array_of_single_field_struct_gen_schema.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/28372/diff/
> 
> 
> Testing
> -------
> 
> Tested by adding appropriate qTests.
> 
> 
> Thanks,
> 
> Ashish Singh
> 
>