You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "zhijiang (Jira)" <ji...@apache.org> on 2020/01/06 06:54:00 UTC

[jira] [Commented] (FLINK-14163) Execution#producedPartitions is possibly not assigned when used

    [ https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008580#comment-17008580 ] 

zhijiang commented on FLINK-14163:
----------------------------------

Thanks for reporting this potential issue [~zhuzh]!

After double checking the related codes, this issue indeed exists only for DefaultScheduler. When the ShuffleMaster#registerPartitionWithProducer was firstly introduced before, we already considered the async behavior in the scheduler process (legacy now) at that time.

The above mentioned three usages are mainly caused by the deployment process not considering the completed future of registering partition in new DefaultScheduler. If the new scheduler can also take the async way into account like the legacy scheduler did during deployment, I think we can solve all the existing concerns. 

I also feel that the current public method of Execution#getPartitionIds might bring potential risks to use in practice, because the returned partition might be an empty collection if the registration future was not completed yet, but the caller is not aware of this thing. 

From the shuffle aspect, it is indeed meaningful for providing the async way for registerPartitionWithProducer in a long term, which is flexible to satisfy different scenarios. But from the existing implementation and possible future extending implementations like yarn shuffle service etc, the sync way can also satisfy the requirements I guess. So if this way would bring more troubles for the scheduler and it is not easy to adjust for other components, it also makes sense to adjust the registerPartitionWithProducer as sync way instead on my side. We can make things easy. 

Are there any thoughts or inputs [~azagrebin]?

> Execution#producedPartitions is possibly not assigned when used
> ---------------------------------------------------------------
>
>                 Key: FLINK-14163
>                 URL: https://issues.apache.org/jira/browse/FLINK-14163
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Major
>             Fix For: 1.10.0
>
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions have completed the registration to shuffle master in {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so {{Execution#producedPartitions}} is possible[1] not set when used. 
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result partitions assigned, and the job would hang. (DefaultScheduler issue only, since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks: 
> 3. retrieve {{ResultPartitionID}} for partition releasing: 
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is not problematic at the moment since it returns a completed future on registration, so that it would be a synchronized process. However, if users implement their own shuffle service in which the {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it can be a problem. This is possible since customizable shuffle service is open to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either 
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async assigning, or 
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)