You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Zhilong Hong (Jira)" <ji...@apache.org> on 2021/08/17 07:00:03 UTC

[jira] [Created] (FLINK-23833) Cache of ShuffleDescriptors should be individually cleaned up

Zhilong Hong created FLINK-23833:
------------------------------------

             Summary: Cache of ShuffleDescriptors should be individually cleaned up
                 Key: FLINK-23833
                 URL: https://issues.apache.org/jira/browse/FLINK-23833
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.14.0
            Reporter: Zhilong Hong
             Fix For: 1.14.0


{color:#172b4d}In FLINK-23005, we introduce the cache of compressed serialized value for ShuffleDescriptors to improve the performance of deployment. To make sure the cache wouldn't stay too long and become a burden for GC, the cache would be cleaned up when the partition is released or reset for new execution. In the implementation, the cache of the entire IntermediateResult is cleaned up because a partition is released only when the entire IntermediateResult is released. {color}

{color:#172b4d}However, after FLINK-22017, the BLOCKING result partition is allowed to be consumable individually. It also means that the result partition doesn't need to wait for other result partitions and can be released individually. After this change, there may be a scene: when a result partition is finished, the cache of IntermediateResult on the blob is deleted, while other result partitions corresponding to this IntermediateResult is just deployed to the TaskExecutor. Then when TaskExecutors are trying to download TDD from the blob, they will find the blob is deleted and get stuck.{color}

{color:#172b4d}This bug only happens for jobs with POINTWISE BLOCKING edge. Also, the {{blob.offload.minsize}} is set to be a extremely small value, since the size of  ShuffleDescriptors of POINTWISE BLOCKING edges is usually small. To solve this issue, we just need to clean up the cache of ShuffleDescriptors individually.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)