You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "pengzhiwei (Jira)" <ji...@apache.org> on 2020/11/24 04:25:00 UTC
[jira] [Created] (HUDI-1415) Incorrect query result for hudi table
when using spark sql
pengzhiwei created HUDI-1415:
--------------------------------
Summary: Incorrect query result for hudi table when using spark sql
Key: HUDI-1415
URL: https://issues.apache.org/jira/browse/HUDI-1415
Project: Apache Hudi
Issue Type: Bug
Components: Spark Integration
Reporter: pengzhiwei
Fix For: 0.6.1
Currently hudi can sync the meta data to hive meta store using HiveSyncTool. The table description synced to hive just like this:
{code:java}
CREATE EXTERNAL TABLE `tbl_price_insert0`(
`_hoodie_commit_time` string,
`_hoodie_commit_seqno` string,
`_hoodie_record_key` string,
`_hoodie_partition_path` string,
`_hoodie_file_name` string,
`id` int,
`name` string,
`version` int,
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'file:/tmp/hudi/tbl_price_insert0'
TBLPROPERTIES (
'last_commit_time_sync'='20201124105009',
'transient_lastDdlTime'='1606186222')
{code}
When we query this table using spark sql, spark sql trait it as a Hive Table and convert it to parquet LogicalRelation in HiveStrategies#RelationConversions. This may lead to an incorrect query result.
Inorder to query hudi table correctly in spark sql, more table properties and serde properties must be added to the hive meta,just like the follow:
{code:java}
CREATE EXTERNAL TABLE `tbl_price_cow0`(
`_hoodie_commit_time` string,
`_hoodie_commit_seqno` string,
`_hoodie_record_key` string,
`_hoodie_partition_path` string,
`_hoodie_file_name` string,
`id` int,
`name` string,
`version` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='/tmp/hudi/tbl_price_cow0')
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'file:/tmp/hudi/tbl_price_cow0'
TBLPROPERTIES (
'last_commit_time_sync'='20201124120532',
'spark.sql.sources.provider'='hudi',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}',
'transient_lastDdlTime'='1606190729')
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)