You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "pengzhiwei (Jira)" <ji...@apache.org> on 2021/02/19 03:09:00 UTC

[jira] [Updated] (HUDI-1415) Read Hoodie Table As a Spark DataSource Table

     [ https://issues.apache.org/jira/browse/HUDI-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

pengzhiwei updated HUDI-1415:
-----------------------------
    Summary: Read Hoodie Table As a Spark DataSource Table   (was: Incorrect query result for hudi hive table when using spark sql)

> Read Hoodie Table As a Spark DataSource Table 
> ----------------------------------------------
>
>                 Key: HUDI-1415
>                 URL: https://issues.apache.org/jira/browse/HUDI-1415
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>            Reporter: pengzhiwei
>            Assignee: pengzhiwei
>            Priority: Major
>              Labels: pull-request-available, user-support-issues
>             Fix For: 0.8.0
>
>
> If we update a hudi table twice more, we will get an incorrect query count by spark sql.
>  
> Currently hudi can sync the meta data to hive meta store using HiveSyncTool. The table description  synced to hive  just like this:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_insert0`(
>   `_hoodie_commit_time` string, 
>   `_hoodie_commit_seqno` string, 
>   `_hoodie_record_key` string, 
>   `_hoodie_partition_path` string, 
>   `_hoodie_file_name` string, 
>   `id` int, 
>   `name` string, 
>   `price` double,
>   `version` int, 
>   `dt` string)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>   'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/tmp/hudi/tbl_price_insert0'
> TBLPROPERTIES (
>   'last_commit_time_sync'='20201124105009', 
>   'transient_lastDdlTime'='1606186222')
> {code}
> When we query this table using spark sql, it trait it as a Hive Table, not a spark data source table and convert it to parquet LogicalRelation in HiveStrategies#RelationConversions. As a result, spark sql read the hudi table just like a parquet data source.  This lead to an incorrect query result.
> Inorder to query hudi table correctly in spark sql, more table properties and serde properties must be added to the hive meta,just like the follow:
> {code:java}
> CREATE EXTERNAL TABLE `tbl_price_cow0`(
>   `_hoodie_commit_time` string, 
>   `_hoodie_commit_seqno` string, 
>   `_hoodie_record_key` string, 
>   `_hoodie_partition_path` string, 
>   `_hoodie_file_name` string, 
>   `id` int, 
>   `name` string, 
>   `price` double,
>   `version` int)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> WITH SERDEPROPERTIES ( 
>   'path'='/tmp/hudi/tbl_price_cow0') 
> STORED AS INPUTFORMAT 
>   'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
> OUTPUTFORMAT 
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/tmp/hudi/tbl_price_cow0'
> TBLPROPERTIES (
>   'last_commit_time_sync'='20201124120532', 
>   'spark.sql.sources.provider'='hudi', 
>   'spark.sql.sources.schema.numParts'='1', 
>   'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"id","type":"integer","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"price","type":"double","nullable":false,"metadata":{}},{"name":"version","type":"integer","nullable":false,"metadata":{}}]}', 
>   'transient_lastDdlTime'='1606190729')
> {code}
> These are the missing table properties:
> {code:java}
> spark.sql.sources.provider= 'hudi'
> spark.sql.sources.schema.numParts = 'xx'
> spark.sql.sources.schema.part.{num} ='xx'
> spark.sql.sources.schema.numPartCols = 'xx'
> spark.sql.sources.schema.partCol.{num} = 'xx'{code}
> and serde property:
> {code:java}
> 'path'='/path/to/hudi'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)