You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Michael Ho (JIRA)" <ji...@apache.org> on 2018/06/23 01:30:00 UTC

[jira] [Commented] (IMPALA-6818) Rethink data-stream sender/receiver startup sequencing

    [ https://issues.apache.org/jira/browse/IMPALA-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520935#comment-16520935 ] 

Michael Ho commented on IMPALA-6818:
------------------------------------

[~dhecht] and I discussed about an alternative which may allow us to get rid of this timeout option.

When a sender issues a {{TransmitData()}} RPC before the receiver is created, the data stream manager should just reply to the sender with an error message that it's not found. The sender will then keep pinging the destination fragment periodically until the receiver shows up or until the sender's fragment is cancelled. There are some races which may still occur:

- The receiver is closed before the sender manages to send the first row batch to the receiver. In which case, the sender will keep pinging the destination fragment until the sender fragment is cancelled. The assumption is that the receiver will only be closed prematurely before receiving all the data from the sender when (1) the query is cancelled or (2) the ancestor executors of the exchange node has a limit and closes the exchange node once the limit is reached. In case (2), it's expected that the report status thread of the sender's fragment will notice that all results have been returned and cancel the sender fragment (I believe this may not yet be the case until IMPALA-5119 is fixed. [~dhecht], please correct me if I misunderstood anything).

> Rethink data-stream sender/receiver startup sequencing
> ------------------------------------------------------
>
>                 Key: IMPALA-6818
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6818
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Distributed Exec
>            Reporter: Dan Hecht
>            Assignee: Michael Ho
>            Priority: Major
>
> IMPALA-1599 introduced parallel fragment startup, which is good for startup latency. However, it meant that data-stream senders can start before receivers, and there is a timeout to handle the case when the receiver never shows up:
> {code:java}
> Sender timed out waiting for receiver fragment instance{code}
> We see this timeout fairly regularly (e.g. when a host has a spike in load and does not process the exec rpc for a while). Let's rethink how this works to see if we can make it robust but being careful to not sacrifice startup time too much.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org