You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2019/11/06 07:44:09 UTC

[GitHub] [flink] hwanju opened a new pull request #10099: [FLINK-14589] [Runtime/Coordination] Redundant slot requests with the same AllocationID lead…

hwanju opened a new pull request #10099: [FLINK-14589] [Runtime/Coordination] Redundant slot requests with the same AllocationID lead…
URL: https://github.com/apache/flink/pull/10099
 
 
   …s to inconsistent slot table
   
   ## What is the purpose of the change
   
   When a slot request is redundantly made with the same AllocationID to a
   slot index other than the already allocated one, slot table becomes
   inconsistent having two slot indices allocated but one AllocationID
   assigned to only the latest slot index. This can lead to slot leakage.
   This patch prevents such redundent slot request from rendering
   inconsistent slot allocation state by rejecting the request.
   
   ## Brief change log
   
     - Let `TaskSlotTable.allocateSlot` disallow slot allocation request if a requested allocation ID is already occupied by any slot.
     - Added a unit test to `TaskSlotTableTest`
   
   ## Verifying this change
    - Existing tests should pass
    - A newly added `testRedundantSlotAllocation` is succeeded with the fix.
    - Manually verified the change by running a constantly failing app (by throwing exception in UDF to trigger fail-over and swallowing interrupts on source to make cancellation stuck) with 64 parallelism with 8 slots per taskmanager and 600ms heartbeat timeout (100ms heartbeat interval) for both continuous fail-over and heartbeat timeout. Without fix, this stress test hits this bug and then keeps getting slot allocation failure exception. With the fix, it does survive without slot allocation failure for days (with log `Allocation ID {} is already allocated in {}.` printed, indicating it exercise the bug).
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services