You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Lars Volker (JIRA)" <ji...@apache.org> on 2017/04/09 10:36:41 UTC
[jira] [Resolved] (IMPALA-3794) test_breakpad.py is flaky

     [ https://issues.apache.org/jira/browse/IMPALA-3794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Volker resolved IMPALA-3794.
---------------------------------
    Resolution: Fixed

I'm marking this one as fixed, since the test should not be flaky with this fix. Followup work is tracked in IMPALA-5187.

IMPALA-3794: Workaround for Breakpad ID conflicts

Breakpad determines the ID of the minidump file to be written in case of
a crash during startup of the process randomly, seeded with the current
system time with second granularity. If two impalads start up within the
same second, there is a chance for a name conflict. The one second delay
between starting impalads in start-impala-cluster.py is not sufficient:

I0407 22:34:52.018563 28473 minidump.cc:245] Setting minidump size limit
to 20971520.
I0407 22:34:52.997046 28749 minidump.cc:245] Setting minidump size limit
to 20971520.

When sending a signal to all of them, one process can overwrite the
minidump of another one. This is an upstream issue and is tracked in
Breakpad-681. I further confirmed my suspicion by tentatively making an
own output folder for each running instance of impalad and was then
unable to reproduce the issue. However, it is a more clear solution to
fix the underlying issue than to change the folder locations for
minidumps in impala.

Until this is fixed upstream, we can make sure that we see at least one
minidump for the group of impalads in the test cluster. It is not a
product defect, since we don't support running multiple impalads on a
single host, let alone starting them all at once.

To test this I ran the following loop for about an hour on my dev
machine without hitting the issue:

while [ $? -eq 0 ]; do impala-py.test
tests/custom_cluster/test_breakpad.py --exploration_strategy=exhaustive
-k test_minidump_relative_path -x -s; done

Change-Id: I4ae589f6eb5cbbfb860943214edc0e6415eeb862
Reviewed-on: http://gerrit.cloudera.org:8080/6588
Reviewed-by: Lars Volker <lv...@cloudera.com>
Tested-by: Impala Public Jenkins

> test_breakpad.py is flaky
> -------------------------
>
>                 Key: IMPALA-3794
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3794
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 2.6.0, Impala 2.8.0, Impala 2.9.0
>            Reporter: Lars Volker
>            Assignee: Lars Volker
>            Priority: Critical
>              Labels: breakpad, broken-build, flaky
>
> Using the following command usually fails within an hour: {{while [ $? -eq 0 ]; do ./run-tests.py ./custom_cluster/test_breakpad.py --exploration_strategy=exhaustive -k test_minidump_relative_path -x -s; done}}. The problem is, that only two of the three impalad processes will write a minidump, while one of them won't. However in that case there is no breakpad related error message in its logfile.
> Creating the {{minidump_base_dir}} seems to alleviate the problem, suggesting there might be a race condition somewhere in the call to {{boost::filesystem::create_directories}}.
> This problem only seems to arise in the specific scenario of multiple impalad processes sharing the same {{minidump_base_dir}} which also has to be non-existent. Outside of tests it seems like it cannot occur.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)