You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Joe McDonnell (JIRA)" <ji...@apache.org> on 2019/03/20 20:21:00 UTC

[jira] [Commented] (IMPALA-8322) S3 tests encounter "timed out waiting for receiver fragment instance"

    [ https://issues.apache.org/jira/browse/IMPALA-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16797515#comment-16797515 ] 

Joe McDonnell commented on IMPALA-8322:
---------------------------------------

Added run_tests_swimlane.json.gz a swimlane view of pytest execution during the failure scenario. (This can be viewed in Chrome by going to about://tracing and loading the file.) For this run, the following tests were running concurrently when the failure occurred:
 # query_test/test_decimal_fuzz.py::TestDecimalFuzz::()::test_decimal_ops[exec_option: \{'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 5000, 'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0}]
 # query_test/test_cancellation.py::TestCancellationParallel::()::test_cancel_select[protocol: beeswax | table_format: avro/snap/block | exec_option: \{'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | query_type: SELECT | wait_action: 0:GETNEXT:WAIT | cancel_delay: 0.01 | cpu_limit_s: 100000 | query: select * from lineitem order by l_orderkey | fail_rpc_action: COORD_CANCEL_QUERY_FINSTANCES_RPC:FAIL | join_before_close: True | buffer_pool_limit: 0]
 # query_test/test_kudu.py::TestDropDb::()::test_drop_non_empty_db
 # query_test/test_exprs.py::TestExprs::()::test_exprs[protocol: beeswax | exec_option: \{'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: text/none | enable_expr_rewrites: 0]
 # query_test/test_chars.py::TestStringQueries::()::test_chars_tmp_tables[protocol: beeswax | exec_option: \{'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 'disable_codegen': True, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: text/none]
 # query_test/test_aggregation.py::TestDistinctAggregation::()::test_multiple_distinct[protocol: beeswax | exec_option: \{'disable_codegen': True, 'shuffle_distinct_exprs': False} | table_format: text/none]
 # query_test/test_mt_dop.py::TestMtDop::()::test_compute_stats[mt_dop: 8 | protocol: beeswax | exec_option: \{'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: parquet/none]
 # metadata/test_last_ddl_time_update.py::TestLastDdlTimeUpdate::()::test_kudu[protocol: beeswax | exec_option: \{'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: text/none]

Of these, #3 and #8 hit the error in this run. Some of the others (like #6, #7) have hit issues in previous runs (and some others that have hit issues are not listed here).

> S3 tests encounter "timed out waiting for receiver fragment instance"
> ---------------------------------------------------------------------
>
>                 Key: IMPALA-8322
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8322
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.3.0
>            Reporter: Joe McDonnell
>            Priority: Blocker
>              Labels: broken-build
>         Attachments: run_tests_swimlane.json.gz
>
>
> This has been seen multiple times when running s3 tests:
> {noformat}
> query_test/test_join_queries.py:57: in test_basic_joins
>     self.run_test_case('QueryTest/joins', new_vector)
> common/impala_test_suite.py:472: in run_test_case
>     result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:699: in __execute_query
>     return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
>     return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
>     handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
>     self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:381: in wait_for_finished
>     raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> E    Query aborted:Sender 127.0.0.1 timed out waiting for receiver fragment instance: 6c40d992bb87af2f:0ce96e5d00000007, dest node: 4{noformat}
> This is related to IMPALA-6818. On a bad run, there are various time outs in the impalad logs:
> {noformat}
> I0316 10:47:16.359313 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 timed out waiting for receiver fragment instance: ef4a5dc32a6565bd:a8720b8500000007, dest node: 5
> I0316 10:47:16.359345 20175 rpcz_store.cc:265] Call impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 14881) took 120182ms. Request Metrics: {}
> I0316 10:47:16.359380 20175 krpc-data-stream-mgr.cc:354] Sender 127.0.0.1 timed out waiting for receiver fragment instance: d148d83e11a4603d:54dc35f700000004, dest node: 3
> I0316 10:47:16.359395 20175 rpcz_store.cc:265] Call impala.DataStreamService.TransmitData from 127.0.0.1:40030 (request call id 14880) took 123097ms. Request Metrics: {}
> ... various messages ...
> I0316 10:47:56.364990 20154 kudu-util.h:108] Cancel() RPC failed: Timed out: CancelQueryFInstances RPC to 127.0.0.1:27000 timed out after 10.000s (SENT)
> ... various messages ...
> W0316 10:48:15.056421 20150 rpcz_store.cc:251] Call impala.ControlService.CancelQueryFInstances from 127.0.0.1:40912 (request call id 202) took 48695ms (client timeout 10000).
> W0316 10:48:15.056473 20150 rpcz_store.cc:255] Trace:
> 0316 10:47:26.361265 (+ 0us) impala-service-pool.cc:165] Inserting onto call queue
> 0316 10:47:26.361285 (+ 20us) impala-service-pool.cc:245] Handling call
> 0316 10:48:15.056398 (+48695113us) inbound_call.cc:162] Queueing success response
> Metrics: {}
> I0316 10:48:15.057087 20139 connection.cc:584] Got response to call id 202 after client already timed out or cancelled{noformat}
> So far, this has only happened on s3. The system load at the time is not higher than normal. If anything it is lower than normal. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org