You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Joe McDonnell (Jira)" <ji...@apache.org> on 2020/01/03 18:43:00 UTC
[jira] [Resolved] (IMPALA-9241) Minicluster service status ambiguous when pids wrap around

     [ https://issues.apache.org/jira/browse/IMPALA-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joe McDonnell resolved IMPALA-9241.
-----------------------------------
    Fix Version/s: Impala 3.4.0
       Resolution: Fixed

> Minicluster service status ambiguous when pids wrap around
> ----------------------------------------------------------
>
>                 Key: IMPALA-9241
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9241
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 3.4.0
>            Reporter: Joe McDonnell
>            Assignee: Joe McDonnell
>            Priority: Critical
>              Labels: broken-build
>             Fix For: Impala 3.4.0
>
>
> In a recent test run, a large number of tests failed due to being unable to contact the Kudu master. Messages like:
>  
> {noformat}
> query_test/test_acid.py:26: in <module>
>     from tests.common.skip import (SkipIfHive2, SkipIfCatalogV2, SkipIfS3, SkipIfABFS,
> common/skip.py:108: in <module>
>     class SkipIfKudu:
> common/skip.py:112: in SkipIfKudu
>     get_kudu_master_flag("--use_hybrid_clock") == "false",
> common/kudu_test_suite.py:59: in get_kudu_master_flag
>     varz = get_kudu_master_webpage("varz")
> common/kudu_test_suite.py:55: in get_kudu_master_webpage
>     return requests.get(url).text
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:69: in get
>     return request('get', url, params=params, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/api.py:50: in request
>     response = session.request(method=method, url=url, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:465: in request
>     resp = self.send(prep, **send_kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/sessions.py:573: in send
>     r = adapter.send(request, **kwargs)
> /home/user/Impala/infra/python/env/lib/python2.7/site-packages/requests/adapters.py:415: in send
>     raise ConnectionError(err, request=request)
> E   ConnectionError: ('Connection aborted.', error(111, 'Connection refused')){noformat}
> I checked the logs/cluster/cdh6-node-1/kudu/master directory, and there was a log file for the minicluster startup to do dataload, but not one for the restart of the minicluster at the end of dataload ([https://github.com/apache/impala/blob/fc4a91cf8c87966a910106dded7e7eb8d215270a/testdata/bin/create-load-data.sh#L717).|https://github.com/apache/impala/blob/fc4a91cf8c87966a910106dded7e7eb8d215270a/testdata/bin/create-load-data.sh#L717)]
>  
> Interestingly, two of the tablet servers did start up as part of that restart, so this was not a universal thing. In fact, quite a few tests ran fine.
> My theory is that this could be due to stale PIDs. When starting up one of the services covered by testdata/cluster/admin, it calls the status function to see if the service is already running ([https://github.com/apache/impala/blob/master/testdata/cluster/admin#L423]):
>  
> {noformat}
> if "$SCRIPT" status &>/dev/null; then  
>   RUNNING=true
> else
>   RUNNING=false
> fi{noformat}
> If it is already RUNNING, it skips the startup. The status call is common across the different services and reads the PID from a file and checks to see if that PID is still running:
>  
>  
> {noformat}
> function status {
>   local PID=$(read_pid)
>   if [[ -z $PID ]]; then
>     echo Not started
>     return 1
>   fi
>   if pid_exists $PID; then
>     echo Running
>   else
>     echo Not Running
>     return 1
>   fi
> }{noformat}
> However, it doesn't delete the pid file when it shuts down. If something happens to be running with that pid when we try to start up, it would think it is already running and fail to start up. Silently!
> This could apply to kudu, hdfs, kms, yarn, etc. It does not apply to hbase, hive, sentry, ranger, as those do not use this framework.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)