You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by "Sorabh Hamirwasia (JIRA)" <ji...@apache.org> on 2017/08/14 20:58:00 UTC

[jira] [Created] (DRILL-5722) Query with root and non-root fragments can potentially hang if Control Connection fails due to network issue.

Sorabh Hamirwasia created DRILL-5722:
----------------------------------------

Summary: Query with root and non-root fragments can potentially hang if Control Connection fails due to network issue.
Key: DRILL-5722
URL: https://issues.apache.org/jira/browse/DRILL-5722
Project: Apache Drill
Issue Type: Bug
Reporter: Sorabh Hamirwasia

Recently I found an issue (Thanks to [~knguyen] for creating the test scenario) related to Fragment Status reporting and would like some feedback on it.

When a client submits a query to Foreman, then it is planned by Foreman and later fragments are scheduled to root and non-root nodes. Foreman creates a DriilbitStatusListener and FragmentStatusListener to know about the health of Drillbit node and a fragment respectively. The way root and non-root fragments are setup by Foreman are different:
Root fragments are setup without any communication over control channel (since it is executed locally on Foreman)
Non-root fragments are setup by sending control message (REQ_INITIALIZE_FRAGMENTS_VALUE) over wire. If there is failure in sending any such control message (like due to network hiccup's) during query setup then the query is failed and client is notified.

Each fragment is executed on it's node with the help Fragment Executor which has an instance for FragmentStatusReporter. FragmentStatusReporter helps to update the status of a fragment to Foreman node over a control tunnel or connection using RPC message (REQ_FRAGMENT_STATUS) both for root and non-root fragments.

Based on above when root fragment is submitted for setup then it is done locally without any RPC communication whereas when status for that fragment is reported by fragment executor that happens over control connection by sending a RPC message. But for non-root fragment setup and status update both happens using RPC message over control connection.

Issue 2:
For complex query where we have root and non-root fragments, if the fragment setup was done fine and later during the query execution the control connection is lost. But as part of status update, fragments will try to create a new Control connection to foreman but let say they fails to do so and eventually got completed. Then in this case also Foreman will think that the fragment is still running and Query is not completed and can hang the query.

I don't see any timeout logic on the Foreman side for a query in execution which might be because we don't know how long a query will take for execution. Nor did I see any message from Foreman which keep's check on Fragment status in case if it didn't received any update from other fragments for a time interval. If not may be we should add the second option or look for other alternatives as well. It would be helpful to get some feedback on both the issues and proposed solutions.

--
This message was sent by Atlassian JIRA
(v6.4.14#64029)