You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@gobblin.apache.org by "Zihan Li (Jira)" <ji...@apache.org> on 2021/04/02 18:20:00 UTC

[jira] [Resolved] (GOBBLIN-1343) Fix the data loss issue caused by the cache expiration in PartitionerDataWriter

     [ https://issues.apache.org/jira/browse/GOBBLIN-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zihan Li resolved GOBBLIN-1343.
-------------------------------
    Resolution: Fixed

> Fix the data loss issue caused by the cache expiration in PartitionerDataWriter
> -------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-1343
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1343
>             Project: Apache Gobblin
>          Issue Type: Task
>            Reporter: Zihan Li
>            Priority: Major
>
> Problem statement:
> Previously, we maintain a cache in PartitionedDataWriter to avoid accumulate writer in memory in long running job. But when we expire the writer, we only close it without flush/commit, so it may cause data loss when there is a slowness happening on HDFS.
>  
> Potential solution:
>  # In the removal logic, we can make sure the writer has been committed correctly, i.e. force it to commit before close.  But the issue here is we still remove the writer from cache, so next flush message will be handled and return without call commit for the right writer, and watermark will move without data being published to HDFS.
>  # We calculate the time for the write operation, and if it takes a long time, we force to add the writer back to cache so that next flush message will be picked up by the writer. 
> Here we use the second solution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)