You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@helix.apache.org by GitBox <gi...@apache.org> on 2020/11/06 19:07:23 UTC

[GitHub] [helix] jiajunwang opened a new issue #1512: A possible race condition causes ERROR task and block the job.

jiajunwang opened a new issue #1512:
URL: https://github.com/apache/helix/issues/1512


   ### Describe the bug
   This issue is believed to be triggered by PR https://github.com/apache/helix/commit/f11396e5feebe20552d259e553342c17a8573a8e
   The theory is that the PR logic is following the design and OK. But it triggers a potential problem.
   
   1. We agreed that if the state transition task schedule fails, we should make the partition in ERROR state/
   2. When we close a participant, we shut down the executor first. But since the callback handler is still alive, there is a race condition that the handler will try to execute a state transition. Since the thread pool already shutdown, it fails. And the partition is in ERROR state.
   3. In most cases it is fine since when we shut down the thread pool the participant will be shut down immediately. And no thread pool so no side effect to the real application logic
   4. Unfortunately, TF may get the error state during the race condition. And I find it will stop processing the job due to this ERROR task, even though the participant has been shutdown. No live instance.
   
   I notice this because TestTaskRebalancerFailover becomes unstable due to this race condition.
   Two potential ways to fix:
   1. Change TF logic so it ignores the ERROR partition in an offline node.
   2. Or we fix the participant shutdown process.
   
   ### To Reproduce
   Run TestTaskRebalancerFailover several times and it usually stuck on 2nd or 3rd try.
   
   ### Expected behavior
   The job shall finish.
   
   ### Additional context
   Add any other context about the problem here.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@helix.apache.org
For additional commands, e-mail: reviews-help@helix.apache.org