You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2019/09/17 00:37:47 UTC
[GitHub] [incubator-hudi] umehrot2 commented on issue #770: remove
com.databricks:spark-avro to build spark avro schema by itself
umehrot2 commented on issue #770: remove com.databricks:spark-avro to build spark avro schema by itself
URL: https://github.com/apache/incubator-hudi/pull/770#issuecomment-532010337
I pulled in this PR and ran tests with `Decimal` types. These changes are `not sufficient` to support `Decimal` types it seems.
The tables in Hive end up being created with `Binary` type for `Decimal` type columns, making them un-queryable.
```
hive> describe my_table;
_hoodie_commit_time string
_hoodie_commit_seqno string
_hoodie_record_key string
_hoodie_partition_path string
_hoodie_file_name string
....
wholesale_cost binary // Should have been Decimal(7,2)
list_price binary // Should have been Decimal(7,2)
sales_price binary // Should have been Decimal(7,2)
discount_amt binary // Should have been Decimal(7,2)
```
Upon diving further into this issue, I am able to narrow it down to this line, where the Parquet footer is read to get the schema which is written as `parquet.avro.schema`
https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java#L437
What is happening here, is that in this schema conversion from `parquet.avro.schema` to Parquet's schema i.e. `MessageType` it is loosing context of Avro's `LogicalType Decimal`.
The following blob for `Decimal` in `parquet.avro.schema`:
```
{
"name" : "wholesale_cost",
"type" : [ {
"type" : "fixed",
"name" : "wholesale_cost",
"size" : 4,
"logicalType" : "decimal",
"precision" : 7,
"scale" : 2
}, "null" ]
}
```
It end's up as following upon conversion to `MessageType`:
```
{
"name" : "wholesale_cost",
"type" : [ "null", {
"type" : "fixed",
"name" : "wholesale_cost",
"namespace" : "",
"size" : 4
} ],
"default" : null
}
```
Thus any context of this field being `Decimal` is lost. Now, when this parquet schema is later converted to hive schema to generate the DDL for creating table, it treats is `Fixed Length Byte Array`.
The following line which checks whether `OriginalType` is `Decimal` has no affect, because `OriginalType` ends up as `Null` for `Decimal` fields:
https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/util/SchemaUtil.java#L179
Ultimately it is converting it to `Binary` by treating it as `Fixed Length Byte Array`:
https://github.com/apache/incubator-hudi/blob/master/hudi-hive/src/main/java/org/apache/hudi/hive/util/SchemaUtil.java#L218
Create Table command generated by Hudi:
```
19/09/16 23:53:35 INFO HoodieHiveClient: Creating table with CREATE EXTERNAL TABLE IF NOT EXISTS xxxx( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, ... `wholesale_cost` binary, `list_price` binary, `sales_price` binary, `discount_amt` binary...) PARTITIONED BY (sold_date string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'xxxxx'
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services