You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Bill Farner (JIRA)" <ji...@apache.org> on 2014/05/06 21:54:27 UTC

[jira] [Resolved] (AURORA-204) unavailable username causes hung executor

     [ https://issues.apache.org/jira/browse/AURORA-204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bill Farner resolved AURORA-204.
--------------------------------

    Resolution: Fixed

> unavailable username causes hung executor
> -----------------------------------------
>
>                 Key: AURORA-204
>                 URL: https://issues.apache.org/jira/browse/AURORA-204
>             Project: Aurora
>          Issue Type: Bug
>          Components: Executor
>            Reporter: brian wickman
>            Assignee: brian wickman
>            Priority: Critical
>             Fix For: 0.5.0
>
>
> Reported by [~vinodkone]
> looks like a bunch of tasks for job "balexandrescu/devel/skyfall" are stuck in STARTING.
> digging into one of the instances, it seems the user doesn't exist anymore. the strange thing is that the executor never exited/aborted.
> {noformat}
> I0209 01:05:07.244604 38799 slave.cpp:1757] Handling status update TASK_STARTING (UUID: 735bf1b2-64b6-4cda-a8f0-826820e82480) for task 1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22 of framework 201108030032-0000000011-0000 from executor(1)@10.35.5.131:46234I0209 01:05:07.245381 38799 status_update_manager.cpp:314] Received status update TASK_STARTING (UUID: 735bf1b2-64b6-4cda-a8f0-826820e82480) for task 1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22 of framework 201108030032-0000000011-0000
> I0209 01:05:07.246099 38799 status_update_manager.hpp:342] Checkpointing UPDATE for status update TASK_STARTING (UUID: 735bf1b2-64b6-4cda-a8f0-826820e82480) for task 1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22 of framework 201108030032-0000000011-0000
> I0209 01:05:07.733875 38799 status_update_manager.cpp:367] Forwarding status update TASK_STARTING (UUID: 735bf1b2-64b6-4cda-a8f0-826820e82480) for task 1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22 of framework 201108030032-0000000011-0000 to master@10.35.9.132:5050
> I0209 01:05:07.735437 38799 slave.cpp:1882] Sending acknowledgement for status update TASK_STARTING (UUID: 735bf1b2-64b6-4cda-a8f0-826820e82480) for task 1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22 of framework 201108030032-0000000011-0000 to executor(1)@10.35.5.131:46234
> I0209 01:05:07.743939 38799 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 735bf1b2-64b6-4cda-a8f0-826820e82480) for task 1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22 of framework 201108030032-0000000011-0000
> I0209 01:05:07.795338 38799 status_update_manager.hpp:342] Checkpointing ACK for status update TASK_STARTING (UUID: 735bf1b2-64b6-4cda-a8f0-826820e82480) for task 1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22 of framework 201108030032-0000000011-0000
> I0209 01:10:00.185257 38784 slave.cpp:1005] Asked to kill task 1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22 of framework 201108030032-0000000011-0000
> $ ps aux | grep 44286
> vinod 4511 0.0 0.0 61248 780 pts/0 S+ 07:55 0:00 grep 44286
> root 44286 0.0 0.0 544024 35716 ? Ssl 01:05 0:20 python2.6 ./thermos_executor
> $ cat /var/lib/mesos/slaves/201309131923-1829643018-5050-22101-3/frameworks/201108030032-0000000011-0000/executors/thermos-1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22/runs/a149e178-ff35-4544-b2ef-620e2ee3f586/stderr
> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
> Writing log files to disk in executor_logs
> twitter.common.app debug: Initializing: twitter.common_internal.app.modules.chickadee_handler (Chickadee exception handler.)
> Traceback (most recent call last):
> File "/var/lib/mesos/slaves/201309131923-1829643018-5050-22101-3/frameworks/201108030032-0000000011-0000/executors/thermos-1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22/runs/a149e178-ff35-4544-b2ef-620e2ee3f586/thermos_executor/twitter/common/exceptions/_init_.py", line 126, in _excepting_run
> File "/var/lib/mesos/slaves/201309131923-1829643018-5050-22101-3/frameworks/201108030032-0000000011-0000/executors/thermos-1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22/runs/a149e178-ff35-4544-b2ef-620e2ee3f586/thermos_executor/twitter/common/concurrent/deferred.py", line 43, in run
> File "/root/.pex/install/apache.aurora.executor-0.5.0_DEV1391543893-py2.6.egg.56d773b76d0356d74134a338f0c58eb205d540ba/apache.aurora.executor-0.5.0_DEV1391543893-py2.6.egg/apache/aurora/executor/thermos_executor.py", line 280, in <lambda>
> defer(lambda: self._run(driver, assigned_task, mesos_task))
> File "/root/.pex/install/apache.aurora.executor-0.5.0_DEV1391543893-py2.6.egg.56d773b76d0356d74134a338f0c58eb205d540ba/apache.aurora.executor-0.5.0_DEV1391543893-py2.6.egg/apache/aurora/executor/thermos_executor.py", line 113, in _run
> if not self._initialize_sandbox(driver, assigned_task):
> File "/root/.pex/install/apache.aurora.executor-0.5.0_DEV1391543893-py2.6.egg.56d773b76d0356d74134a338f0c58eb205d540ba/apache.aurora.executor-0.5.0_DEV1391543893-py2.6.egg/apache/aurora/executor/thermos_executor.py", line 143, in _initialize_sandbox
> daemon=True, propagate=True)
> File "/var/lib/mesos/slaves/201309131923-1829643018-5050-22101-3/frameworks/201108030032-0000000011-0000/executors/thermos-1391907903048-balexandrescu-devel-skyfall-0-f1771e87-10a4-409a-8855-db29ffe4bc22/runs/a149e178-ff35-4544-b2ef-620e2ee3f586/thermos_executor/twitter/common/concurrent/deadline.py", line 68, in deadline
> KeyError: 'getpwnam(): name not found: balexandrescu'
> ERROR] Asked to kill task with incomplete sandbox - aborting runner start
> {noformat}
> N.B. [~wickman] This is caused by an untrapped KeyError by getpwnam.  Should fix the original issue, but also probably wrap some of these critical operations in a try/finally.



--
This message was sent by Atlassian JIRA
(v6.2#6252)