You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/01/07 18:05:00 UTC
[jira] [Commented] (IMPALA-7931) test_shutdown_executor fails with timeout waiting for query target state

    [ https://issues.apache.org/jira/browse/IMPALA-7931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736116#comment-16736116 ] 

ASF subversion and git services commented on IMPALA-7931:
---------------------------------------------------------

Commit a91b24cb7962200f330c4887f38f4704a52f7c7e in impala's branch refs/heads/master from Tim Armstrong
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=a91b24c ]

IMPALA-7931: fix executor shutdown races

There were two races:
* queries were terminated because of an impalad being detected
  as failed by the statestore even if the query had finished
  executing on that impalad.
* NUM_FRAGMENTS_IN_FLIGHT was used to detect the backend being
  idle, but it was decremented before the final status report
  was sent.

The fixes are:
* keep track of the backends that triggered the potential cancellation,
  and only proceed with the cancellation if the coordinator has fragments
  still executing on the backend.
* add a new metric that keeps track of the number of executing queries,
  which isn't decremented until the final status report is sent.

Also do some cleanup/improvements in this code:
* use proper error codes for some errors
* more overloads for Status::Expected()
* also add a metric for the total number of queries executed on the
  backend

Testing:
Add a new version of test_shutdown_executor with delays that
trigger both races. This test only runs in exhaustive to avoid
adding ~20s to core build time.

Ran exhaustive tests.

Looped test_restart_services overnight.

Change-Id: I7c1a80304cb6695d228aca8314e2231727ab1998
Reviewed-on: http://gerrit.cloudera.org:8080/12082
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> test_shutdown_executor fails with timeout waiting for query target state
> ------------------------------------------------------------------------
>
>                 Key: IMPALA-7931
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7931
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.2.0
>            Reporter: Lars Volker
>            Assignee: Tim Armstrong
>            Priority: Critical
>              Labels: broken-build
>         Attachments: impala-7931-impalad-logs.tar.gz
>
>
> On a recent S3 test run test_shutdown_executor hit a timeout waiting for a query to reach state FINISHED. Instead the query stays at state 5 (EXCEPTION).
> {noformat}
> 12:51:11 __________________ TestShutdownCommand.test_shutdown_executor __________________
> 12:51:11 custom_cluster/test_restart_services.py:209: in test_shutdown_executor
> 12:51:11     assert self.__fetch_and_get_num_backends(QUERY, before_shutdown_handle) == 3
> 12:51:11 custom_cluster/test_restart_services.py:356: in __fetch_and_get_num_backends
> 12:51:11     self.client.QUERY_STATES['FINISHED'], timeout=20)
> 12:51:11 common/impala_service.py:267: in wait_for_query_state
> 12:51:11     target_state, query_state)
> 12:51:11 E   AssertionError: Did not reach query state in time target=4 actual=5
> {noformat}
> From the logs I can see that the query fails because one of the executors becomes unreachable:
> {noformat}
> I1204 12:31:39.954125  5609 impala-server.cc:1792] Query a34c3a84775e5599:b2b25eb900000000: Failed due to unreachable impalad(s): jenkins-worker:22001
> {noformat}
> The query was {{select count\(*) from functional_parquet.alltypes where sleep(1) = bool_col}}. 
> It seems that the query took longer than expected and was still running when the executor shut down.
> I can reproduce by adding a sleep to the test:
> {noformat}
> diff --git a/tests/custom_cluster/test_restart_services.py b/tests/custom_cluster/test_restart_services.py
> index e441cbc..32bc8a1 100644
> --- a/tests/custom_cluster/test_restart_services.py
> +++ b/tests/custom_cluster/test_restart_services.py
> @@ -206,7 +206,7 @@ class TestShutdownCommand(CustomClusterTestSuite, HS2TestSuite):
>      after_shutdown_handle = self.__exec_and_wait_until_running(QUERY)
>  
>      # Finish executing the first query before the backend exits.
> -    assert self.__fetch_and_get_num_backends(QUERY, before_shutdown_handle) == 3
> +    assert self.__fetch_and_get_num_backends(QUERY, before_shutdown_handle, delay=5) == 3
>  
>      # Wait for the impalad to exit, then start it back up and run another query, which
>      # should be scheduled on it again.
> @@ -349,11 +349,14 @@ class TestShutdownCommand(CustomClusterTestSuite, HS2TestSuite):
>                  self.client.QUERY_STATES['RUNNING'], timeout=20)
>      return handle
>  
> -  def __fetch_and_get_num_backends(self, query, handle):
> +  def __fetch_and_get_num_backends(self, query, handle, delay=0):
>      """Fetch the results of 'query' from the beeswax handle 'handle', close the
>      query and return the number of backends obtained from the profile."""
>      self.impalad_test_service.wait_for_query_state(self.client, handle,
>                  self.client.QUERY_STATES['FINISHED'], timeout=20)
> +    if delay > 0:
> +      LOG.info("sleeping for {0}".format(delay))
> +      time.sleep(delay)
>      self.client.fetch(query, handle)
>      profile = self.client.get_runtime_profile(handle)
>      self.client.close_query(handle)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org