You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "StarBoy1005 (Jira)" <ji...@apache.org> on 2023/04/18 05:08:00 UTC
[jira] [Commented] (HUDI-4459) Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type

    [ https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713390#comment-17713390 ] 

StarBoy1005 commented on HUDI-4459:
-----------------------------------

Hi! I met a problem, I use flink 1.14.5 and hudi 1.13.0, read a csv file in hdfs and sink to hudi cow table. no matter streaming
 mode nor batch mode, if use bulk_insert, it can‘t finish the job,  the instant always in flight state.
this is my cow table ddl:

create table web_returns_cow (
   rid bigint PRIMARY KEY NOT ENFORCED,
   wr_returned_date_sk bigint,
   wr_returned_time_sk bigint,
   wr_item_sk bigint,
   wr_refunded_customer_sk bigint,
   wr_refunded_cdemo_sk bigint,
   wr_refunded_hdemo_sk bigint,
   wr_refunded_addr_sk bigint,
   wr_returning_customer_sk bigint,
   wr_returning_cdemo_sk bigint,
   wr_returning_hdemo_sk bigint,
   wr_returning_addr_sk bigint,
   wr_web_page_sk bigint,
   wr_reason_sk bigint,
   wr_order_number bigint,
   wr_return_quantity int,
   wr_return_amt float,
   wr_return_tax float,
   wr_return_amt_inc_tax float,
   wr_fee float,
   wr_return_ship_cost float,
   wr_refunded_cash float,
   wr_reversed_charge float,
   wr_account_credit float,
   wr_net_loss float
)
PARTITIONED BY (`wr_returned_date_sk`)
 WITH (
'connector'='hudi',
'path'='/tmp/data_gen/web_returns_cow',
'table.type'='COPY_ON_WRITE',
'read.start-commit'='earliest',
'read.streaming.enabled'='false',
'changelog.enabled'='true',
'write.precombine'='false',
'write.precombine.field'='no_precombine',
'write.operation'='insert',
'read.tasks'='5',
'write.tasks'='10',
'index.type'='BUCKET',
'metadata.enabled'='false',
'hoodie.bucket.index.hash.field'='rid',
'hoodie.bucket.index.num.buckets'='10',
'index.global.enabled'='false'
);

> Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-4459
>                 URL: https://issues.apache.org/jira/browse/HUDI-4459
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Leo zhang
>            Assignee: Rajesh Mahindra
>            Priority: Major
>         Attachments: statements.sql, table.ddl
>
>
> I am trying to sync a huge table with 4000+ fields into hudi, using cow table with bulk_insert  operate type.
> The job can finished without any exception,but when I am trying to read data from the table,I get empty result.The parquet file is corrupted, can't be read correctly. 
> I had tried to  trace the problem, and found it was caused by SortOperator. After the record is serialized in the sorter, all the field get disorder and is deserialized into one field.And finally the wrong record is written into parquet file,and make the file unreadable.
> Here's a few steps to reproduce the bug in the flink sql-client:
> 1、execute the table ddl(provided in the table.ddl file  in the attachments)
> 2、execute the insert statement (provided in the statement.sql file  in the attachments)
> 3、execute a select statement to query hudi table  (provided in the statement.sql file  in the attachments)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)