You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "HunterXHunter (Jira)" <ji...@apache.org> on 2023/01/20 17:44:00 UTC

[jira] [Assigned] (HUDI-5584) When the table to be synchronized already exists in hive, need to update serde/table properties

     [ https://issues.apache.org/jira/browse/HUDI-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

HunterXHunter reassigned HUDI-5584:
-----------------------------------

    Assignee: HunterXHunter

> When the table to be synchronized already exists in hive, need to update serde/table properties
> -----------------------------------------------------------------------------------------------
>
>                 Key: HUDI-5584
>                 URL: https://issues.apache.org/jira/browse/HUDI-5584
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: HunterXHunter
>            Assignee: HunterXHunter
>            Priority: Major
>              Labels: pull-request-available
>
> when we set hoodie.datasource.hive_sync.table.strategy='ro', we expect only one table to be synchronized to hive without suffix _ro.
> But sometimes tables have been created in hive early,
> like:
> {code:java}
> create table hive.test.HUDI_5584 (
>   id int,
>  ts int)
>  using hudi
>  tblproperties (
>   type = 'mor',
>   primaryKey = 'id',
>   preCombineField = 'ts',
>   hoodie.datasource.hive_sync.enable = 'true',
> hoodie.datasource.hive_sync.table.strategy='ro'
> ) location '/tmp/HUDI_5584'  {code}
> and show create table .
> {code:java}
> CREATE EXTERNAL TABLE `hudi_5584`(
>   `_hoodie_commit_time` string,
>   `_hoodie_commit_seqno` string,
>   `_hoodie_record_key` string,
>   `_hoodie_partition_path` string,
>   `_hoodie_file_name` string,
>   `id` int,
>   `ts` int)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES (
>   'path'='file:///tmp/HUDI_5584')
> STORED AS INPUTFORMAT
>   'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'file:/tmp/HUDI_5584'
> TBLPROPERTIES (
>   'hoodie.datasource.hive_sync.enable'='true',
>   'hoodie.datasource.hive_sync.table.strategy'='ro',
>   'preCombineField'='ts',
>   'primaryKey'='id',
>   'spark.sql.create.version'='3.3.1',
>   'spark.sql.sources.provider'='hudi',
>   'spark.sql.sources.schema.numParts'='1',
>   'spark.sql.sources.schema.part.0'='xx'
>   'transient_lastDdlTime'='1674108302',
>   'type'='mor') {code}
> *The table like a realtime table.*
>  
> When we finish writing data and synchronize ro table , because the table already exists, so SERDEPROPERTIES and  OUTPUTFORMAT will not be modified.
> This causes the type of the table is not match as expect.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)