You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Xiaoqiao He (Jira)" <ji...@apache.org> on 2022/03/10 04:52:00 UTC

[jira] [Created] (HUDI-3599) Not atomicity commit could cause streaming read loss data

Xiaoqiao He created HUDI-3599:
---------------------------------

             Summary: Not atomicity commit could cause streaming read loss data
                 Key: HUDI-3599
                 URL: https://issues.apache.org/jira/browse/HUDI-3599
             Project: Apache Hudi
          Issue Type: Bug
          Components: core
            Reporter: Xiaoqiao He


The current `commit` implement call hierarchy show as following, and `transitionState` invoke write deltacommit file to complete this commit. But `write file` is not atomicity operation on HDFS for instance. 
{code:java}
HoodieActiveTimeline.transitionState(HoodieInstant, HoodieInstant, Option<byte[]>, boolean)  (org.apache.hudi.common.table.timeline)
 HoodieActiveTimeline.transitionState(HoodieInstant, HoodieInstant, Option<byte[]>)  (org.apache.hudi.common.table.timeline)
  HoodieActiveTimeline.saveAsComplete(HoodieInstant, Option<byte[]>)  (org.apache.hudi.common.table.timeline)
   BaseHoodieWriteClient.commit(HoodieTable, String, String, HoodieCommitMetadata, List<HoodieWriteStat>)  (org.apache.hudi.client)
    BaseHoodieWriteClient.commitStats(String, List<HoodieWriteStat>, Option<Map<String, String>>, String, Map<String, List<String>>)  (org.apache.hudi.client)
     HoodieFlinkWriteClient.commit(String, List<WriteStatus>, Option<Map<String, String>>, String, Map<String, List<String>>)  (org.apache.hudi.client)
     HoodieJavaWriteClient.commit(String, List<WriteStatus>, Option<Map<String, String>>, String, Map<String, List<String>>)  (org.apache.hudi.client)
{code}
As the org.apache.hudi.common.table.timeline.HoodieActiveTimeline#createImmutableFileInPath said as below, there are three step to complete data write: A. create file, B. write data, C. close file handle. Consider `StreamReadMonitoring` traverse this deltacommit file but content is null between step A and B then it will read nothing at the loop. IMO it could loss some commit data for stream read.  
{code:java}
  private void createImmutableFileInPath(Path fullPath, Option<byte[]> content) {
    FSDataOutputStream fsout = null;
    try {
      fsout = metaClient.getFs().create(fullPath, false);
      if (content.isPresent()) {
        fsout.write(content.get());
      }
    } catch (IOException e) {
      throw new HoodieIOException("Failed to create file " + fullPath, e);
    } finally {
      try {
        if (null != fsout) {
          fsout.close();
        }
      } catch (IOException e) {
        throw new HoodieIOException("Failed to close file " + fullPath, e);
      }
    }
  }
{code}
In order to avoid this corner case, I think we should dependency on `rename` operation to complete commit rather than create-write-close flow. Please correct me if something I missed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)