You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@reef.apache.org by Jason Jeong <cu...@gmail.com> on 2015/09/22 10:58:58 UTC

Group Communication timeout

Hi,

I have a question regarding Group Communication, both Java and .NET.
I faced a situation where Group Communication (Java) fails to detect a task
failure in the communication topology, waiting forever for a message that
should've been sent by the failed task if it were alive. The job ends with
a timeout.

While examining the code, I noticed that both OperatorTopologyStructImpl (Java)
and OperatorTopology (.NET) wait for incoming messages at some point, using
the methods BlockingQueue.take() or BlockingCollection.take(). Even if a
node failure occurs during take(), there is no way to escape from take()
and the job runs endlessly. Is this simply because the current code doesn't
consider such failure cases? Or am I missing something?

I'd greatly appreciate advice from anyone who has participated in writing
the Group Communication code.

Thanks,
Jason