You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/25 10:22:21 UTC

[GitHub] [hudi] TengHuo opened a new issue, #6208: [SUPPORT] Hudi append only pipeline failed due to parquet FileNotFoundException

TengHuo opened a new issue, #6208:
URL: https://github.com/apache/hudi/issues/6208

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Flink append only pipeline will failed due to a FileNotFoundException. It showed a parquet file can't be found after triggering a checkpoint.
   
   **To Reproduce**
   
   Steps to reproduce the behavior: very hard to reproduce, add detail later
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   Flink append only pipeline run properly.
   
   **Environment Description**
   
   * Hudi version : 0.11.1
   
   * Spark version : N/A
   
   * Hive version : N/A
   
   * Hadoop version : 2.10
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   I have checked HDFS audit log, it shows this missing file was created, but deleted just few seconds
   
   ```log
   2022-07-19 17:16:22,961 INFO FSNamesystem.audit: allowed=true	ugi=xxxxxx (auth:PROXY) via flink (auth:SIMPLE)	ip=/xxx.xxx	cmd=create	src=/xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet	dst=null	perm=xxxxxx:xxxxxx:rw-rw-r--	proto=rpc	callerContext=clientIp:xx.xx.xx.242
   2022-07-19 17:16:24,824 INFO FSNamesystem.audit: allowed=true	ugi=xxxxxx (auth:PROXY) via flink (auth:SIMPLE)	ip=/xxx.xxx	cmd=delete	src=/xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet	dst=null	perm=null	proto=rpc	callerContext=clientIp:xx.xx.xx.173
   ```
   
   **Stacktrace**
   
   ```log
   2022-07-19 17:16:24,465 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 6713 for job 20c9582448454df74af4e9424245709d (412626 bytes in 2183 ms).
   2022-07-19 17:16:25,805 INFO  org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Commit instant [20220719171325240] success!
   2022-07-19 17:16:25,919 INFO  org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Create instant [20220719171625805] for table [wt_test_hudi] with type [COPY_ON_WRITE]
   2022-07-19 17:16:25,919 INFO  org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Executor executes action [commits the instant 20220719171325240] success!
   2022-07-19 17:16:26,734 INFO  org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Executor executes action [sync hive metadata for instant 20220719171625805] success!
   2022-07-19 17:19:21,805 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 6714 (type=CHECKPOINT) @ 1658222361802 for job 20c9582448454df74af4e9424245709d.
   2022-07-19 17:19:21,810 INFO  org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Executor executes action [taking checkpoint 6714] success!
   2022-07-19 17:19:22,169 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: TableSourceScan(table=[[...]], fields=[...]) -> Calc(select=[..., DATE_FORMAT(TO_TIMESTAMP_LTZ((timestamp / 1000000000), 0), _UTF-16LE'yyyy-MM-dd') AS date]) -> hoodie_append_write -> Sink: dummy (10/128) (4dacfeb990f80a0fe30c810a062e2e3b) switched from RUNNING to FAILED on container_e1004_1656494149297_195917_01_000062 @ xxxxxxx(dataPort=39371).
   java.lang.Exception: Could not perform checkpoint 6714 for operator Source: TableSourceScan(table=[[...]], fields=[...]) -> Calc(select=[..., DATE_FORMAT(TO_TIMESTAMP_LTZ((timestamp / 1000000000), 0), _UTF-16LE'yyyy-MM-dd') AS date]) -> hoodie_append_write -> Sink: dummy (10/128)#0.
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:1006)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$7(StreamTask.java:958)
   	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
   	at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
   	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:344)
   	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:330)
   	at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:684)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:639)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:650)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
   	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:782)
   	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 6714 for operator Source: TableSourceScan(table=[[...]], fields=[...]) -> Calc(select=[..., DATE_FORMAT(TO_TIMESTAMP_LTZ((timestamp / 1000000000), 0), _UTF-16LE'yyyy-MM-dd') AS date]) -> hoodie_append_write -> Sink: dummy (10/128)#0. Failure reason: Checkpoint was declined.
   	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:264)
   	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:169)
   	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:371)
   	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:706)
   	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:627)
   	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:590)
   	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:312)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:1092)
   	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1076)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:994)
   	... 13 more
   Caused by: org.apache.hudi.exception.HoodieException: Error collect the write status for task [9]
   	at org.apache.hudi.sink.bulk.BulkInsertWriterHelper.getWriteStatuses(BulkInsertWriterHelper.java:184)
   	at org.apache.hudi.sink.append.AppendWriteFunction.flushData(AppendWriteFunction.java:123)
   	at org.apache.hudi.sink.append.AppendWriteFunction.snapshotState(AppendWriteFunction.java:78)
   	at org.apache.hudi.sink.common.AbstractStreamWriteFunction.snapshotState(AbstractStreamWriteFunction.java:157)
   	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
   	at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
   	at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:89)
   	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:218)
   	at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:169)
   	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:371)
   	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:706)
   	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:627)
   	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:590)
   	at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:312)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:1092)
   	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1076)
   	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:994)
   	... 13 more
   Caused by: java.io.FileNotFoundException: File does not exist: /xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet (inode 6318419053) Holder DFSClient_NONMAPREDUCE_1440496717_78 does not have any open files.
   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2782)
   	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:521)
   	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:161)
   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2663)
   	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:889)
   	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:520)
   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
   	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1087)
   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1109)
   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1030)
   	at javax.security.auth.Subject.doAs(Subject.java:422)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2038)
   	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3039)
   
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
   	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
   	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
   	at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1115)
   	at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1944)
   	at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1750)
   	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:751)
   Caused by: org.apache.hadoop.ipc.RemoteException: File does not exist: /xxxxxxxx/hudi_cow/date=2022-07-19/d78f9de0-afde-4086-a01b-5c9029697b53-0_9-128-0_20220719171325240.parquet (inode 6318419053) Holder DFSClient_NONMAPREDUCE_1440496717_78 does not have any open files.
   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2782)
   	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:521)
   	at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:161)
   	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2663)
   	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:889)
   	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:520)
   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:529)
   	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1087)
   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1109)
   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1030)
   	at javax.security.auth.Subject.doAs(Subject.java:422)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2038)
   	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3039)
   
   	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1590)
   	at org.apache.hadoop.ipc.Client.call(Client.java:1521)
   	at org.apache.hadoop.ipc.Client.call(Client.java:1418)
   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:251)
   	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:130)
   	at com.sun.proxy.$Proxy27.addBlock(Unknown Source) ~[?:?]
   	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:472)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:449)
   	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:175)
   	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:167)
   	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:105)
   	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:375)
   	at com.sun.proxy.$Proxy28.addBlock(Unknown Source) ~[?:?]
   	at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1112)
   	at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1944)
   	at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1750)
   	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:751)
   2022-07-19 17:19:22,200 WARN  org.apache.hudi.sink.StreamWriteOperatorCoordinator          [] - Reset the event for task [9]
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] TengHuo commented on issue #6208: [SUPPORT] Hudi append only pipeline failed due to parquet FileNotFoundException

Posted by GitBox <gi...@apache.org>.
TengHuo commented on issue #6208:
URL: https://github.com/apache/hudi/issues/6208#issuecomment-1198815645

   Yeah, sure, the patch fixed my pipeline, let me close this issue. I will append more detail about the root cause here later
   Thanks thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] TengHuo commented on issue #6208: [SUPPORT] Hudi append only pipeline failed due to parquet FileNotFoundException

Posted by GitBox <gi...@apache.org>.
TengHuo commented on issue #6208:
URL: https://github.com/apache/hudi/issues/6208#issuecomment-1194950322

   Thanks Bruce, yeah, we have tried the PR you mentioned, we run a pipeline last night with that PR patched, and everything goes very well. Really appreciate your PR.
   
   But I still don't understand fully how the file was deleted before fix, which part of code deleted the file. Could you explain more detail? thanks thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] TengHuo closed issue #6208: [SUPPORT] Hudi append only pipeline failed due to parquet FileNotFoundException

Posted by GitBox <gi...@apache.org>.
TengHuo closed issue #6208: [SUPPORT] Hudi append only pipeline failed due to parquet FileNotFoundException
URL: https://github.com/apache/hudi/issues/6208


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on issue #6208: [SUPPORT] Hudi append only pipeline failed due to parquet FileNotFoundException

Posted by GitBox <gi...@apache.org>.
yihua commented on issue #6208:
URL: https://github.com/apache/hudi/issues/6208#issuecomment-1198774372

   @TengHuo if the problem is fully resolved, feel free to close this issue.  The discussion can still happen here regarding the root cause and the merged fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on issue #6208: [SUPPORT] Hudi append only pipeline failed due to parquet FileNotFoundException

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on issue #6208:
URL: https://github.com/apache/hudi/issues/6208#issuecomment-1194943778

   There are several PRs to fix this problem. 
   Can you try this branch directly? 
   PR: [#6182](https://github.com/apache/hudi/pull/6182).
   It has merged the relevant PR. @TengHuo 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] TengHuo commented on issue #6208: [SUPPORT] Hudi append only pipeline failed due to parquet FileNotFoundException

Posted by GitBox <gi...@apache.org>.
TengHuo commented on issue #6208:
URL: https://github.com/apache/hudi/issues/6208#issuecomment-1193865858

   We found there is a commit for fixing an issue in `AppendWriteFunction`
   
   Issue: https://issues.apache.org/jira/browse/HUDI-4332
   PR: https://github.com/apache/hudi/pull/5988
   
   As I understand, this patch changed when `currentInstant` changing in `AppendWriteFunction`. Before fix, this `currentInstant` was changed in the method `initWriterHelper()` when handle a new record. In the patch, `currentInstant` was changed in method `flushData(boolean endInput)` when trigger a new checkpoint.
   
   May I know how this patch fix this issue? Since there is barely information in the ticket, could you share more detail? Really appreciate @BruceKellan
   
   cc @fengjian428 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org