You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2019/09/17 00:37:47 UTC

[GitHub] [incubator-hudi] umehrot2 commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself

umehrot2 commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself
URL: https://github.com/apache/incubator-hudi/pull/770#issuecomment-532010337
 
 
   I pulled in this PR and ran tests with `Decimal` types. These changes are `not sufficient` to support `Decimal` types it seems.
   
   The tables in Hive end up being created with `Binary` type for `Decimal` type columns, making them un-queryable.
   
   ```
   hive> describe my_table;
   _hoodie_commit_time 	string              	                    
   _hoodie_commit_seqno	string              	                    
   _hoodie_record_key  	string              	                    
   _hoodie_partition_path	string              	                    
   _hoodie_file_name   	string 
   ....
   wholesale_cost   	binary  // Should have been Decimal(7,2)   	                    
   list_price       	binary  // Should have been Decimal(7,2)	                    
   sales_price      	binary  // Should have been Decimal(7,2) 	                    
   discount_amt 	binary  // Should have been Decimal(7,2)
   ```
   
   Upon diving further into this issue, I am able to narrow it down to this line, where the Parquet footer is read to get the schema which is written as `parquet.avro.schema`
   
   https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L437
   
   What is happening here, is that in this schema conversion from `parquet.avro.schema` to Parquet's schema i.e. `MessageType` it is loosing context of Avro's `LogicalType Decimal`.
   
   The following blob for `Decimal` in `parquet.avro.schema`:
   
   ```
   {
       "name" : "wholesale_cost",
       "type" : [ {
         "type" : "fixed",
         "name" : "wholesale_cost",
         "size" : 4,
         "logicalType" : "decimal",
         "precision" : 7,
         "scale" : 2
       }, "null" ]
     }
   ```
   It end's up as following upon conversion to `MessageType`:
   
   ```
   {
       "name" : "wholesale_cost",
       "type" : [ "null", {
         "type" : "fixed",
         "name" : "wholesale_cost",
         "namespace" : "",
         "size" : 4
       } ],
       "default" : null
     }
   ```
   Thus any context of this field being `Decimal` is lost. Now, when this parquet schema is later converted to hive schema to generate the DDL for creating table, it treats is `Fixed Length Byte Array`.
   
   The following line which checks whether `OriginalType` is `Decimal` has no affect, because `OriginalType` ends up as `Null` for `Decimal` fields:
   https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/util/SchemaUtil.java#L179
   
   Ultimately it is converting it to `Binary` by treating it as `Fixed Length Byte Array`:
   https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/util/SchemaUtil.java#L218
   
   Create Table command generated by Hudi:
   ```
   19/09/16 23:53:35 INFO HoodieHiveClient: Creating table with CREATE EXTERNAL TABLE  IF NOT EXISTS xxxx( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string,  ... `wholesale_cost` binary, `list_price` binary, `sales_price` binary, `discount_amt` binary...) PARTITIONED BY (sold_date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'xxxxx'
   ```
   
   
   
   
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services