You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by tm...@apache.org on 2019/02/13 03:39:00 UTC
[impala] 04/04: IMPALA-8183: fix test_reportexecstatus_retry flakiness

This is an automated email from the ASF dual-hosted git repository.

tmarshall pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 9492d451d5d5a82bfc6f4c93c3a0c6e6d0cc4981
Author: Thomas Tauber-Marshall <tm...@cloudera.com>
AuthorDate: Tue Feb 12 22:47:52 2019 +0000

    IMPALA-8183: fix test_reportexecstatus_retry flakiness
    
    The test is designed to cause ReportExecStatus() rpcs to fail by
    backing up the control service queue. Previously, after a failed
    ReportExecStatus() we would wait 'report_status_retry_interval_ms'
    between retries, which was 100ms by default and wasn't touched by the
    test. That 100ms was right on the edge of being enough time for the
    coordinator to keep up with processing the reports, so that some would
    fail but most would succeed. It was always possible that we could hit
    IMPALA-2990 in this setup, but it was unlikely.
    
    Now, with IMPALA-4555 'report_status_retry_interval_ms' was removed
    and we instead wait 'status_report_interval_ms' between retries. By
    default, this is 5000ms, so it should give the coordinator even more
    time and make these issues less likely. However, the test sets
    'status_report_interval_ms' to 10ms, which isn't nearly enough time
    for the coordinator to do its processing, causing lots of the
    ReportExecStatus() rpcs to fail and making us hit IMPALA-2990 pretty
    often.
    
    The solution is to set 'status_report_interval_ms' to 100ms in the
    test, which roughly achieves the same retry frequency as before. The
    same change is made to a similar test test_reportexecstatus_timeout.
    
    Testing:
    - Ran test_reportexecstatus_retry in a loop 400 times without seeing a
      failure. It previously repro-ed for me about once per 50 runs.
    - Manually verified that both tests are still hitting the error paths
      that they are supposed to be testing.
    
    Change-Id: I7027a6e099c543705e5845ee0e5268f1f9a3fb05
    Reviewed-on: http://gerrit.cloudera.org:8080/12461
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 tests/custom_cluster/test_rpc_timeout.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/custom_cluster/test_rpc_timeout.py b/tests/custom_cluster/test_rpc_timeout.py
index d007ef4..e1a959c 100644
--- a/tests/custom_cluster/test_rpc_timeout.py
+++ b/tests/custom_cluster/test_rpc_timeout.py
@@ -128,7 +128,7 @@ class TestRPCTimeout(CustomClusterTestSuite):
 
   # Inject jitter into the RPC handler of ReportExecStatus() to trigger RPC timeout.
   @pytest.mark.execute_serially
-  @CustomClusterTestSuite.with_args("--status_report_interval_ms=10"
+  @CustomClusterTestSuite.with_args("--status_report_interval_ms=100"
       " --backend_client_rpc_timeout_ms=1000")
   def test_reportexecstatus_timeout(self, vector):
     query_options = {'debug_action': 'REPORT_EXEC_STATUS_DELAY:JITTER@1500@0.5'}
@@ -137,7 +137,7 @@ class TestRPCTimeout(CustomClusterTestSuite):
   # Use a small service queue memory limit and a single service thread to exercise
   # the retry paths in the ReportExecStatus() RPC
   @pytest.mark.execute_serially
-  @CustomClusterTestSuite.with_args("--status_report_interval_ms=10"
+  @CustomClusterTestSuite.with_args("--status_report_interval_ms=100"
       " --control_service_queue_mem_limit=1 --control_service_num_svc_threads=1")
   def test_reportexecstatus_retry(self, vector):
     self.execute_query_verify_metrics(self.TEST_QUERY, None, 10)