You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Sailesh Mukil (JIRA)" <ji...@apache.org> on 2018/03/14 18:07:00 UTC

[jira] [Created] (IMPALA-6662) Make stress test resilient to hangs due to client crashes

Sailesh Mukil created IMPALA-6662:
-------------------------------------

             Summary: Make stress test resilient to hangs due to client crashes
                 Key: IMPALA-6662
                 URL: https://issues.apache.org/jira/browse/IMPALA-6662
             Project: IMPALA
          Issue Type: Task
          Components: Infrastructure
            Reporter: Sailesh Mukil
            Assignee: Sailesh Mukil


The concurrent_select.py process starts multiple sub processes (called query runners), to run the queries. It also starts 2 threads called the query producer thread and the query consumer thread. The query producer thread adds queries to a query queue and the query consumer thread pulls off the queue and feeds the queries to the query runners.

The query runner, once it gets queries, does the following:

{code:java}
(pseudo code. Real code here: https://github.com/apache/impala/blob/d49f629c447ea59ad73ceeb0547fde4d41c651d1/tests/stress/concurrent_select.py#L583-L595)

with _submit_query_lock:
    increment(num_queries_started)
run_query()    # One runner crashes here.
increment(num_queries_finished)

{code}

One of the runners crash inside run_query(), thereby never incrementing num_queries_finished.

Another thread that's supposed to check for memory leaks (but actually doesn't), periodically acquires '_submit_query_lock' and waits for the number of running queries to reach 0 before releasing the lock:
https://github.com/apache/impala/blob/d49f629c447ea59ad73ceeb0547fde4d41c651d1/tests/stress/concurrent_select.py#L449-L511

However, in the above case, the number of running queries will never reach 0 because one of the query runners hasn't incremented 'num_queries_finished' and exited. Therefore, the poll_mem_usage() function will hold the lock indefinitely, causing no new queries to be submitted, nor the stress test to complete running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)