You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2019/06/23 11:45:55 UTC

[GitHub] [flink] gaoyunhaii opened a new pull request #8841: [FLINK-12765][coordinator] Bookkeeping of available resources of allocated slots in SlotPool

gaoyunhaii opened a new pull request #8841: [FLINK-12765][coordinator] Bookkeeping of available resources of allocated slots in SlotPool
URL: https://github.com/apache/flink/pull/8841
 
 
   ## What is the purpose of the change
   
   This PR is to introduce the bookkeeping logic for the shared slots and colocated slots. It is a part of introducing the fine-grained resource management for Flink. Based on the current design, a task will always request the resource according to its own resource need, and the returned slot may be larger than requested resource. Therefore, it leaves chance for slot sharing and colocation group.
   
   For slot sharing, if the resource is enough for all the slot requests, they will be fulfilled directly, otherwise the over-allocated requests will retry and apply for the new slot. Besides, when checking the resolved allocated slots, the remaining resource is used for matching instead of the total resource.
   
   For co-location group, if the resource of the allocated slot is not enough for all the co-located tasks, the allocation will fail with no retry. To be more concrete, if the requests have already exceeded the allocated resource when the slot is offered by RM, all the requests will fail directly without retry. On the other hand, if the requests have not exceeded the allocated resource when the slot is offered by RM, they will be marked as successful. However, if the following co-located requests find that there are not enough resource left, these new requests will fail without retry. Since all the co-located tasks belong to the same region, all the co-located tasks will fail eventually. This implementation avoids postponing the requests till all requests of the co-located group are seen, therefore it will not introduce drawbacks for requests without the resource requirements.
   
   ## Brief change log
   
   1. Introduce the statistics of the resource requested in the hierarchical structure of MultiTaskSlot/SingleTaskSlot to bookkeeping the already requested resources.
   2. Modify the interface of `SlotSelectionStrategy` to also pass the remaining resource of the underlying slot. The implementation of the strategies should use the remaining resource instead of the total resource.
   3. Add the resource checking logic when the underlying slot is resolved. The over-allocated requests will be marked fail. The failure is able to retry iff some requests are fulfilled by the underlying slot. 
   4. Add the retry logic for over-allocated requests in `SchedulerImpl` if the exception is marked as able to retry.
   5. For the co-located requests, add the checking of whether the remaining resource is possible to fulfill the requests if the underlying slot is already resolved.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - Added tests that validate the calculated of requested resource for the hierarchical structure of MultiTaskSlot/SingleTaskSlot. 
     - Added tests that validate the routine to fail the over-allocated requests when the underlying slot is resolved.
     - Added tests that validate the retry logic after failing due to over-allocation.
     - Added tests that validate the failure of the co-located requests if the slot is not enough for all the co-located tasks.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): **no**
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no**
     - The serializers: **no**
     - The runtime per-record code paths (performance sensitive): **no**
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: **yes**
     - The S3 file system connector: **no**
   
   ## Documentation
   
     - Does this pull request introduce a new feature? **no**
     - If yes, how is the feature documented? [Doc](https://docs.google.com/document/d/1UR3vYsLOPXMGVyXHXYg3b5yNZcvSTvot4wela8knVAY/edit?usp=sharing)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services