You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Mohit Jaggi <mo...@uber.com> on 2017/10/27 18:19:53 UTC
orphan executor
Folks,
Often I see some orphaned executors in my cluster. These are cases where
the framework was informed of task loss, so has forgotten about them as
expected, but the container(docker) is still around. AFAIK, Mesos agent is
the only entity that has knowledge of these containers. How do I ensure
that they get cleaned up by the agent?
Mohit.
Re: orphan executor
Posted by Benjamin Mahler <bm...@apache.org>.
I filed one: https://issues.apache.org/jira/browse/MESOS-8167
It's a pretty significant effort, and hasn't been requested a lot, so it's
unlikely to be worked on for some time.
On Tue, Oct 31, 2017 at 8:18 PM, Mohit Jaggi <mo...@uber.com> wrote:
> :-)
> Is there a Jira ticket to track this? Any idea when this will be worked on?
>
> On Tue, Oct 31, 2017 at 5:22 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
>> The question was posed merely to point out that there is no notion of the
>> executor "running away" currently, due to the answer I provided: there
>> isn't a complete lifecycle API for the executor. (This includes
>> healthiness, state updates, reconciliation, ability for scheduler to shut
>> it down, etc).
>>
>> On Tue, Oct 31, 2017 at 4:27 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Good question.
>>> - I don't know what the interaction between mesos agent and executor is.
>>> Is there a health check?
>>> - There is a reconciliation between Mesos and Frameworks: will Mesos
>>> include the "orphan" executor in the list there, so framework can find
>>> runaways and kill them(using Mesos provided API)?
>>>
>>> On Tue, Oct 31, 2017 at 3:49 PM, Benjamin Mahler <bm...@apache.org>
>>> wrote:
>>>
>>>> What defines a runaway executor?
>>>>
>>>> Mesos does not know that this particular executor should self-terminate
>>>> within some reasonable time after its task terminates. In this case the
>>>> framework (Aurora) knows this expected behavior of Thermos and can clean up
>>>> ones that get stuck after the task terminates. However, we currently don't
>>>> provide a great executor lifecycle API to enable schedulers to do this
>>>> (it's long overdue).
>>>>
>>>> On Tue, Oct 31, 2017 at 2:47 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> I was asking if this can happen automatically.
>>>>>
>>>>> On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler <bm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> You can kill it manually by SIGKILLing the executor process.
>>>>>> Using the agent API, you can launch a nested container session and
>>>>>> kill the executor. +jie,gilbert, is there a CLI command for 'exec'ing into
>>>>>> the container?
>>>>>>
>>>>>> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes. There is a fix available now in Aurora/Thermos to try and exit
>>>>>>> in such scenarios. But I am curious to know if Mesos agent has the
>>>>>>> functionality to reap runaway executors.
>>>>>>>
>>>>>>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <
>>>>>>> bmahler@apache.org> wrote:
>>>>>>>
>>>>>>>> Is my understanding correct that the Thermos transitions the task
>>>>>>>> to TASK_FAILED, but Thermos gets stuck and can't terminate itself? The
>>>>>>>> typical workflow for thermos, as a 1:1 task:executor approach, is that the
>>>>>>>> executor terminates itself after the task is terminal.
>>>>>>>>
>>>>>>>> The full logs of the agent during this window would help, it looks
>>>>>>>> like an agent termination is involved here as well?
>>>>>>>>
>>>>>>>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Here are some relevant logs. Aurora scheduler logs shows the task
>>>>>>>>> going from:
>>>>>>>>> INIT
>>>>>>>>> ->PENDING
>>>>>>>>> ->ASSIGNED
>>>>>>>>> ->STARTING
>>>>>>>>> ->RUNNING for a long time
>>>>>>>>> ->FAILED due to health check error, OSError: Resource temporarily
>>>>>>>>> unavailable (I think this is referring to running out of PID space, see
>>>>>>>>> thermos logs below)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --- mesos agent ---
>>>>>>>>>
>>>>>>>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
>>>>>>>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
>>>>>>>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>>>> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
>>>>>>>>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --- thermos (Aurora) ----
>>>>>>>>>
>>>>>>>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>>>>>>>>> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>>>> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>>>> 24 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>>>>>>>>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>>>>>>>>> 27 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>>> 28 Traceback (most recent call last):
>>>>>>>>> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>>>> 30 self.__real_run(*args, **kw)
>>>>>>>>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
>>>>>>>>> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
>>>>>>>>> 33 thread.start()
>>>>>>>>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>> 35 _start_new_thread(self.__bootstrap, ())
>>>>>>>>> 36 thread.error: can't start new thread
>>>>>>>>> 37 ERROR] Failed to stop health checkers:
>>>>>>>>> 38 ERROR] Traceback (most recent call last):
>>>>>>>>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>>>>>>>>> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
>>>>>>>>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>>>> 42 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>>>> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>>>> 44 AnonymousThread().start()
>>>>>>>>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>> 46 _start_new_thread(self.__bootstrap, ())
>>>>>>>>> 47 error: can't start new thread
>>>>>>>>> 48
>>>>>>>>> 49 ERROR] Failed to stop runner:
>>>>>>>>> 50 ERROR] Traceback (most recent call last):
>>>>>>>>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>>>>>>>>> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>>>>>>>>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>>>> 54 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>>>> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>>>> 56 AnonymousThread().start()
>>>>>>>>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>> 58 _start_new_thread(self.__bootstrap, ())
>>>>>>>>> 59 error: can't start new thread
>>>>>>>>> 60
>>>>>>>>> 61 Traceback (most recent call last):
>>>>>>>>> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>>>> 63 self.__real_run(*args, **kw)
>>>>>>>>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
>>>>>>>>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>>>>>>>>> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
>>>>>>>>> 67 deferred.start()
>>>>>>>>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>> 69 _start_new_thread(self.__bootstrap, ())
>>>>>>>>> 70 thread.error: can't start new thread
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Can you share the agent and executor logs of an example orphaned
>>>>>>>>>> executor? That would help us diagnose the issue.
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <
>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Folks,
>>>>>>>>>>> Often I see some orphaned executors in my cluster. These are
>>>>>>>>>>> cases where the framework was informed of task loss, so has forgotten about
>>>>>>>>>>> them as expected, but the container(docker) is still around. AFAIK, Mesos
>>>>>>>>>>> agent is the only entity that has knowledge of these containers. How do I
>>>>>>>>>>> ensure that they get cleaned up by the agent?
>>>>>>>>>>>
>>>>>>>>>>> Mohit.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: orphan executor
Posted by Mohit Jaggi <mo...@uber.com>.
:-)
Is there a Jira ticket to track this? Any idea when this will be worked on?
On Tue, Oct 31, 2017 at 5:22 PM, Benjamin Mahler <bm...@apache.org> wrote:
> The question was posed merely to point out that there is no notion of the
> executor "running away" currently, due to the answer I provided: there
> isn't a complete lifecycle API for the executor. (This includes
> healthiness, state updates, reconciliation, ability for scheduler to shut
> it down, etc).
>
> On Tue, Oct 31, 2017 at 4:27 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> Good question.
>> - I don't know what the interaction between mesos agent and executor is.
>> Is there a health check?
>> - There is a reconciliation between Mesos and Frameworks: will Mesos
>> include the "orphan" executor in the list there, so framework can find
>> runaways and kill them(using Mesos provided API)?
>>
>> On Tue, Oct 31, 2017 at 3:49 PM, Benjamin Mahler <bm...@apache.org>
>> wrote:
>>
>>> What defines a runaway executor?
>>>
>>> Mesos does not know that this particular executor should self-terminate
>>> within some reasonable time after its task terminates. In this case the
>>> framework (Aurora) knows this expected behavior of Thermos and can clean up
>>> ones that get stuck after the task terminates. However, we currently don't
>>> provide a great executor lifecycle API to enable schedulers to do this
>>> (it's long overdue).
>>>
>>> On Tue, Oct 31, 2017 at 2:47 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> I was asking if this can happen automatically.
>>>>
>>>> On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler <bm...@apache.org>
>>>> wrote:
>>>>
>>>>> You can kill it manually by SIGKILLing the executor process.
>>>>> Using the agent API, you can launch a nested container session and
>>>>> kill the executor. +jie,gilbert, is there a CLI command for 'exec'ing into
>>>>> the container?
>>>>>
>>>>> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Yes. There is a fix available now in Aurora/Thermos to try and exit
>>>>>> in such scenarios. But I am curious to know if Mesos agent has the
>>>>>> functionality to reap runaway executors.
>>>>>>
>>>>>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <bmahler@apache.org
>>>>>> > wrote:
>>>>>>
>>>>>>> Is my understanding correct that the Thermos transitions the task to
>>>>>>> TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical
>>>>>>> workflow for thermos, as a 1:1 task:executor approach, is that the executor
>>>>>>> terminates itself after the task is terminal.
>>>>>>>
>>>>>>> The full logs of the agent during this window would help, it looks
>>>>>>> like an agent termination is involved here as well?
>>>>>>>
>>>>>>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here are some relevant logs. Aurora scheduler logs shows the task
>>>>>>>> going from:
>>>>>>>> INIT
>>>>>>>> ->PENDING
>>>>>>>> ->ASSIGNED
>>>>>>>> ->STARTING
>>>>>>>> ->RUNNING for a long time
>>>>>>>> ->FAILED due to health check error, OSError: Resource temporarily
>>>>>>>> unavailable (I think this is referring to running out of PID space, see
>>>>>>>> thermos logs below)
>>>>>>>>
>>>>>>>>
>>>>>>>> --- mesos agent ---
>>>>>>>>
>>>>>>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
>>>>>>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
>>>>>>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>>> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
>>>>>>>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --- thermos (Aurora) ----
>>>>>>>>
>>>>>>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>>>>>>>> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>>> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>>> 24 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>>>>>>>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>>>>>>>> 27 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>> 28 Traceback (most recent call last):
>>>>>>>> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>>> 30 self.__real_run(*args, **kw)
>>>>>>>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
>>>>>>>> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
>>>>>>>> 33 thread.start()
>>>>>>>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>> 35 _start_new_thread(self.__bootstrap, ())
>>>>>>>> 36 thread.error: can't start new thread
>>>>>>>> 37 ERROR] Failed to stop health checkers:
>>>>>>>> 38 ERROR] Traceback (most recent call last):
>>>>>>>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>>>>>>>> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
>>>>>>>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>>> 42 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>>> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>>> 44 AnonymousThread().start()
>>>>>>>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>> 46 _start_new_thread(self.__bootstrap, ())
>>>>>>>> 47 error: can't start new thread
>>>>>>>> 48
>>>>>>>> 49 ERROR] Failed to stop runner:
>>>>>>>> 50 ERROR] Traceback (most recent call last):
>>>>>>>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>>>>>>>> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>>>>>>>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>>> 54 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>>> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>>> 56 AnonymousThread().start()
>>>>>>>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>> 58 _start_new_thread(self.__bootstrap, ())
>>>>>>>> 59 error: can't start new thread
>>>>>>>> 60
>>>>>>>> 61 Traceback (most recent call last):
>>>>>>>> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>>> 63 self.__real_run(*args, **kw)
>>>>>>>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
>>>>>>>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>>>>>>>> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
>>>>>>>> 67 deferred.start()
>>>>>>>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>> 69 _start_new_thread(self.__bootstrap, ())
>>>>>>>> 70 thread.error: can't start new thread
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Can you share the agent and executor logs of an example orphaned
>>>>>>>>> executor? That would help us diagnose the issue.
>>>>>>>>>
>>>>>>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Folks,
>>>>>>>>>> Often I see some orphaned executors in my cluster. These are
>>>>>>>>>> cases where the framework was informed of task loss, so has forgotten about
>>>>>>>>>> them as expected, but the container(docker) is still around. AFAIK, Mesos
>>>>>>>>>> agent is the only entity that has knowledge of these containers. How do I
>>>>>>>>>> ensure that they get cleaned up by the agent?
>>>>>>>>>>
>>>>>>>>>> Mohit.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: orphan executor
Posted by Benjamin Mahler <bm...@apache.org>.
The question was posed merely to point out that there is no notion of the
executor "running away" currently, due to the answer I provided: there
isn't a complete lifecycle API for the executor. (This includes
healthiness, state updates, reconciliation, ability for scheduler to shut
it down, etc).
On Tue, Oct 31, 2017 at 4:27 PM, Mohit Jaggi <mo...@uber.com> wrote:
> Good question.
> - I don't know what the interaction between mesos agent and executor is.
> Is there a health check?
> - There is a reconciliation between Mesos and Frameworks: will Mesos
> include the "orphan" executor in the list there, so framework can find
> runaways and kill them(using Mesos provided API)?
>
> On Tue, Oct 31, 2017 at 3:49 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
>> What defines a runaway executor?
>>
>> Mesos does not know that this particular executor should self-terminate
>> within some reasonable time after its task terminates. In this case the
>> framework (Aurora) knows this expected behavior of Thermos and can clean up
>> ones that get stuck after the task terminates. However, we currently don't
>> provide a great executor lifecycle API to enable schedulers to do this
>> (it's long overdue).
>>
>> On Tue, Oct 31, 2017 at 2:47 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> I was asking if this can happen automatically.
>>>
>>> On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler <bm...@apache.org>
>>> wrote:
>>>
>>>> You can kill it manually by SIGKILLing the executor process.
>>>> Using the agent API, you can launch a nested container session and kill
>>>> the executor. +jie,gilbert, is there a CLI command for 'exec'ing into the
>>>> container?
>>>>
>>>> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> Yes. There is a fix available now in Aurora/Thermos to try and exit in
>>>>> such scenarios. But I am curious to know if Mesos agent has the
>>>>> functionality to reap runaway executors.
>>>>>
>>>>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <bm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Is my understanding correct that the Thermos transitions the task to
>>>>>> TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical
>>>>>> workflow for thermos, as a 1:1 task:executor approach, is that the executor
>>>>>> terminates itself after the task is terminal.
>>>>>>
>>>>>> The full logs of the agent during this window would help, it looks
>>>>>> like an agent termination is involved here as well?
>>>>>>
>>>>>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Here are some relevant logs. Aurora scheduler logs shows the task
>>>>>>> going from:
>>>>>>> INIT
>>>>>>> ->PENDING
>>>>>>> ->ASSIGNED
>>>>>>> ->STARTING
>>>>>>> ->RUNNING for a long time
>>>>>>> ->FAILED due to health check error, OSError: Resource temporarily
>>>>>>> unavailable (I think this is referring to running out of PID space, see
>>>>>>> thermos logs below)
>>>>>>>
>>>>>>>
>>>>>>> --- mesos agent ---
>>>>>>>
>>>>>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
>>>>>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
>>>>>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
>>>>>>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --- thermos (Aurora) ----
>>>>>>>
>>>>>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>>>>>>> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>> 24 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>>>>>>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>>>>>>> 27 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>> 28 Traceback (most recent call last):
>>>>>>> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>> 30 self.__real_run(*args, **kw)
>>>>>>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
>>>>>>> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
>>>>>>> 33 thread.start()
>>>>>>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>> 35 _start_new_thread(self.__bootstrap, ())
>>>>>>> 36 thread.error: can't start new thread
>>>>>>> 37 ERROR] Failed to stop health checkers:
>>>>>>> 38 ERROR] Traceback (most recent call last):
>>>>>>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>>>>>>> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
>>>>>>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>> 42 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>> 44 AnonymousThread().start()
>>>>>>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>> 46 _start_new_thread(self.__bootstrap, ())
>>>>>>> 47 error: can't start new thread
>>>>>>> 48
>>>>>>> 49 ERROR] Failed to stop runner:
>>>>>>> 50 ERROR] Traceback (most recent call last):
>>>>>>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>>>>>>> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>>>>>>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>> 54 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>> 56 AnonymousThread().start()
>>>>>>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>> 58 _start_new_thread(self.__bootstrap, ())
>>>>>>> 59 error: can't start new thread
>>>>>>> 60
>>>>>>> 61 Traceback (most recent call last):
>>>>>>> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>> 63 self.__real_run(*args, **kw)
>>>>>>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
>>>>>>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>>>>>>> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
>>>>>>> 67 deferred.start()
>>>>>>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>> 69 _start_new_thread(self.__bootstrap, ())
>>>>>>> 70 thread.error: can't start new thread
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Can you share the agent and executor logs of an example orphaned
>>>>>>>> executor? That would help us diagnose the issue.
>>>>>>>>
>>>>>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Folks,
>>>>>>>>> Often I see some orphaned executors in my cluster. These are cases
>>>>>>>>> where the framework was informed of task loss, so has forgotten about them
>>>>>>>>> as expected, but the container(docker) is still around. AFAIK, Mesos agent
>>>>>>>>> is the only entity that has knowledge of these containers. How do I ensure
>>>>>>>>> that they get cleaned up by the agent?
>>>>>>>>>
>>>>>>>>> Mohit.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: orphan executor
Posted by Mohit Jaggi <mo...@uber.com>.
Good question.
- I don't know what the interaction between mesos agent and executor is. Is
there a health check?
- There is a reconciliation between Mesos and Frameworks: will Mesos
include the "orphan" executor in the list there, so framework can find
runaways and kill them(using Mesos provided API)?
On Tue, Oct 31, 2017 at 3:49 PM, Benjamin Mahler <bm...@apache.org> wrote:
> What defines a runaway executor?
>
> Mesos does not know that this particular executor should self-terminate
> within some reasonable time after its task terminates. In this case the
> framework (Aurora) knows this expected behavior of Thermos and can clean up
> ones that get stuck after the task terminates. However, we currently don't
> provide a great executor lifecycle API to enable schedulers to do this
> (it's long overdue).
>
> On Tue, Oct 31, 2017 at 2:47 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> I was asking if this can happen automatically.
>>
>> On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler <bm...@apache.org>
>> wrote:
>>
>>> You can kill it manually by SIGKILLing the executor process.
>>> Using the agent API, you can launch a nested container session and kill
>>> the executor. +jie,gilbert, is there a CLI command for 'exec'ing into the
>>> container?
>>>
>>> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> Yes. There is a fix available now in Aurora/Thermos to try and exit in
>>>> such scenarios. But I am curious to know if Mesos agent has the
>>>> functionality to reap runaway executors.
>>>>
>>>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <bm...@apache.org>
>>>> wrote:
>>>>
>>>>> Is my understanding correct that the Thermos transitions the task to
>>>>> TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical
>>>>> workflow for thermos, as a 1:1 task:executor approach, is that the executor
>>>>> terminates itself after the task is terminal.
>>>>>
>>>>> The full logs of the agent during this window would help, it looks
>>>>> like an agent termination is involved here as well?
>>>>>
>>>>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Here are some relevant logs. Aurora scheduler logs shows the task
>>>>>> going from:
>>>>>> INIT
>>>>>> ->PENDING
>>>>>> ->ASSIGNED
>>>>>> ->STARTING
>>>>>> ->RUNNING for a long time
>>>>>> ->FAILED due to health check error, OSError: Resource temporarily
>>>>>> unavailable (I think this is referring to running out of PID space, see
>>>>>> thermos logs below)
>>>>>>
>>>>>>
>>>>>> --- mesos agent ---
>>>>>>
>>>>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
>>>>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
>>>>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
>>>>>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>
>>>>>>
>>>>>>
>>>>>> --- thermos (Aurora) ----
>>>>>>
>>>>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>>>>>> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>> 24 Writing log files to disk in /mnt/mesos/sandbox
>>>>>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>>>>>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>>>>>> 27 Writing log files to disk in /mnt/mesos/sandbox
>>>>>> 28 Traceback (most recent call last):
>>>>>> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>> 30 self.__real_run(*args, **kw)
>>>>>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
>>>>>> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
>>>>>> 33 thread.start()
>>>>>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>> 35 _start_new_thread(self.__bootstrap, ())
>>>>>> 36 thread.error: can't start new thread
>>>>>> 37 ERROR] Failed to stop health checkers:
>>>>>> 38 ERROR] Traceback (most recent call last):
>>>>>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>>>>>> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
>>>>>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>> 42 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>> 44 AnonymousThread().start()
>>>>>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>> 46 _start_new_thread(self.__bootstrap, ())
>>>>>> 47 error: can't start new thread
>>>>>> 48
>>>>>> 49 ERROR] Failed to stop runner:
>>>>>> 50 ERROR] Traceback (most recent call last):
>>>>>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>>>>>> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>>>>>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>> 54 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>> 56 AnonymousThread().start()
>>>>>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>> 58 _start_new_thread(self.__bootstrap, ())
>>>>>> 59 error: can't start new thread
>>>>>> 60
>>>>>> 61 Traceback (most recent call last):
>>>>>> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>> 63 self.__real_run(*args, **kw)
>>>>>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
>>>>>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>>>>>> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
>>>>>> 67 deferred.start()
>>>>>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>> 69 _start_new_thread(self.__bootstrap, ())
>>>>>> 70 thread.error: can't start new thread
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Can you share the agent and executor logs of an example orphaned
>>>>>>> executor? That would help us diagnose the issue.
>>>>>>>
>>>>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Folks,
>>>>>>>> Often I see some orphaned executors in my cluster. These are cases
>>>>>>>> where the framework was informed of task loss, so has forgotten about them
>>>>>>>> as expected, but the container(docker) is still around. AFAIK, Mesos agent
>>>>>>>> is the only entity that has knowledge of these containers. How do I ensure
>>>>>>>> that they get cleaned up by the agent?
>>>>>>>>
>>>>>>>> Mohit.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: orphan executor
Posted by Benjamin Mahler <bm...@apache.org>.
What defines a runaway executor?
Mesos does not know that this particular executor should self-terminate
within some reasonable time after its task terminates. In this case the
framework (Aurora) knows this expected behavior of Thermos and can clean up
ones that get stuck after the task terminates. However, we currently don't
provide a great executor lifecycle API to enable schedulers to do this
(it's long overdue).
On Tue, Oct 31, 2017 at 2:47 PM, Mohit Jaggi <mo...@uber.com> wrote:
> I was asking if this can happen automatically.
>
> On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
>> You can kill it manually by SIGKILLing the executor process.
>> Using the agent API, you can launch a nested container session and kill
>> the executor. +jie,gilbert, is there a CLI command for 'exec'ing into the
>> container?
>>
>> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Yes. There is a fix available now in Aurora/Thermos to try and exit in
>>> such scenarios. But I am curious to know if Mesos agent has the
>>> functionality to reap runaway executors.
>>>
>>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <bm...@apache.org>
>>> wrote:
>>>
>>>> Is my understanding correct that the Thermos transitions the task to
>>>> TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical
>>>> workflow for thermos, as a 1:1 task:executor approach, is that the executor
>>>> terminates itself after the task is terminal.
>>>>
>>>> The full logs of the agent during this window would help, it looks like
>>>> an agent termination is involved here as well?
>>>>
>>>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> Here are some relevant logs. Aurora scheduler logs shows the task
>>>>> going from:
>>>>> INIT
>>>>> ->PENDING
>>>>> ->ASSIGNED
>>>>> ->STARTING
>>>>> ->RUNNING for a long time
>>>>> ->FAILED due to health check error, OSError: Resource temporarily
>>>>> unavailable (I think this is referring to running out of PID space, see
>>>>> thermos logs below)
>>>>>
>>>>>
>>>>> --- mesos agent ---
>>>>>
>>>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
>>>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
>>>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
>>>>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>
>>>>>
>>>>>
>>>>> --- thermos (Aurora) ----
>>>>>
>>>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>>>>> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>> 24 Writing log files to disk in /mnt/mesos/sandbox
>>>>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>>>>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>>>>> 27 Writing log files to disk in /mnt/mesos/sandbox
>>>>> 28 Traceback (most recent call last):
>>>>> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>> 30 self.__real_run(*args, **kw)
>>>>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
>>>>> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
>>>>> 33 thread.start()
>>>>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>> 35 _start_new_thread(self.__bootstrap, ())
>>>>> 36 thread.error: can't start new thread
>>>>> 37 ERROR] Failed to stop health checkers:
>>>>> 38 ERROR] Traceback (most recent call last):
>>>>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>>>>> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
>>>>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>> 42 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>> 44 AnonymousThread().start()
>>>>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>> 46 _start_new_thread(self.__bootstrap, ())
>>>>> 47 error: can't start new thread
>>>>> 48
>>>>> 49 ERROR] Failed to stop runner:
>>>>> 50 ERROR] Traceback (most recent call last):
>>>>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>>>>> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>>>>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>> 54 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>> 56 AnonymousThread().start()
>>>>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>> 58 _start_new_thread(self.__bootstrap, ())
>>>>> 59 error: can't start new thread
>>>>> 60
>>>>> 61 Traceback (most recent call last):
>>>>> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>> 63 self.__real_run(*args, **kw)
>>>>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
>>>>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>>>>> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
>>>>> 67 deferred.start()
>>>>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>> 69 _start_new_thread(self.__bootstrap, ())
>>>>> 70 thread.error: can't start new thread
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Can you share the agent and executor logs of an example orphaned
>>>>>> executor? That would help us diagnose the issue.
>>>>>>
>>>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Folks,
>>>>>>> Often I see some orphaned executors in my cluster. These are cases
>>>>>>> where the framework was informed of task loss, so has forgotten about them
>>>>>>> as expected, but the container(docker) is still around. AFAIK, Mesos agent
>>>>>>> is the only entity that has knowledge of these containers. How do I ensure
>>>>>>> that they get cleaned up by the agent?
>>>>>>>
>>>>>>> Mohit.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: orphan executor
Posted by Mohit Jaggi <mo...@uber.com>.
I was asking if this can happen automatically.
On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler <bm...@apache.org> wrote:
> You can kill it manually by SIGKILLing the executor process.
> Using the agent API, you can launch a nested container session and kill
> the executor. +jie,gilbert, is there a CLI command for 'exec'ing into the
> container?
>
> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <mo...@uber.com>
> wrote:
>
>> Yes. There is a fix available now in Aurora/Thermos to try and exit in
>> such scenarios. But I am curious to know if Mesos agent has the
>> functionality to reap runaway executors.
>>
>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <bm...@apache.org>
>> wrote:
>>
>>> Is my understanding correct that the Thermos transitions the task to
>>> TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical
>>> workflow for thermos, as a 1:1 task:executor approach, is that the executor
>>> terminates itself after the task is terminal.
>>>
>>> The full logs of the agent during this window would help, it looks like
>>> an agent termination is involved here as well?
>>>
>>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> Here are some relevant logs. Aurora scheduler logs shows the task going
>>>> from:
>>>> INIT
>>>> ->PENDING
>>>> ->ASSIGNED
>>>> ->STARTING
>>>> ->RUNNING for a long time
>>>> ->FAILED due to health check error, OSError: Resource temporarily
>>>> unavailable (I think this is referring to running out of PID space, see
>>>> thermos logs below)
>>>>
>>>>
>>>> --- mesos agent ---
>>>>
>>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
>>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
>>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
>>>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>
>>>>
>>>>
>>>> --- thermos (Aurora) ----
>>>>
>>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>>>> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>> 24 Writing log files to disk in /mnt/mesos/sandbox
>>>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>>>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>>>> 27 Writing log files to disk in /mnt/mesos/sandbox
>>>> 28 Traceback (most recent call last):
>>>> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>> 30 self.__real_run(*args, **kw)
>>>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
>>>> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
>>>> 33 thread.start()
>>>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>> 35 _start_new_thread(self.__bootstrap, ())
>>>> 36 thread.error: can't start new thread
>>>> 37 ERROR] Failed to stop health checkers:
>>>> 38 ERROR] Traceback (most recent call last):
>>>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>>>> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
>>>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>> 42 return deadline(*args, daemon=True, propagate=True, **kw)
>>>> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>> 44 AnonymousThread().start()
>>>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>> 46 _start_new_thread(self.__bootstrap, ())
>>>> 47 error: can't start new thread
>>>> 48
>>>> 49 ERROR] Failed to stop runner:
>>>> 50 ERROR] Traceback (most recent call last):
>>>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>>>> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>>>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>> 54 return deadline(*args, daemon=True, propagate=True, **kw)
>>>> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>> 56 AnonymousThread().start()
>>>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>> 58 _start_new_thread(self.__bootstrap, ())
>>>> 59 error: can't start new thread
>>>> 60
>>>> 61 Traceback (most recent call last):
>>>> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>> 63 self.__real_run(*args, **kw)
>>>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
>>>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>>>> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
>>>> 67 deferred.start()
>>>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>> 69 _start_new_thread(self.__bootstrap, ())
>>>> 70 thread.error: can't start new thread
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org>
>>>> wrote:
>>>>
>>>>> Can you share the agent and executor logs of an example orphaned
>>>>> executor? That would help us diagnose the issue.
>>>>>
>>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Folks,
>>>>>> Often I see some orphaned executors in my cluster. These are cases
>>>>>> where the framework was informed of task loss, so has forgotten about them
>>>>>> as expected, but the container(docker) is still around. AFAIK, Mesos agent
>>>>>> is the only entity that has knowledge of these containers. How do I ensure
>>>>>> that they get cleaned up by the agent?
>>>>>>
>>>>>> Mohit.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
Re: orphan executor
Posted by Benjamin Mahler <bm...@apache.org>.
You can kill it manually by SIGKILLing the executor process.
Using the agent API, you can launch a nested container session and kill the
executor. +jie,gilbert, is there a CLI command for 'exec'ing into the
container?
On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <mo...@uber.com> wrote:
> Yes. There is a fix available now in Aurora/Thermos to try and exit in
> such scenarios. But I am curious to know if Mesos agent has the
> functionality to reap runaway executors.
>
> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
>> Is my understanding correct that the Thermos transitions the task to
>> TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical
>> workflow for thermos, as a 1:1 task:executor approach, is that the executor
>> terminates itself after the task is terminal.
>>
>> The full logs of the agent during this window would help, it looks like
>> an agent termination is involved here as well?
>>
>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Here are some relevant logs. Aurora scheduler logs shows the task going
>>> from:
>>> INIT
>>> ->PENDING
>>> ->ASSIGNED
>>> ->STARTING
>>> ->RUNNING for a long time
>>> ->FAILED due to health check error, OSError: Resource temporarily
>>> unavailable (I think this is referring to running out of PID space, see
>>> thermos logs below)
>>>
>>>
>>> --- mesos agent ---
>>>
>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>> Writing log files to disk in /mnt/mesos/sandbox
>>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
>>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>> Writing log files to disk in /mnt/mesos/sandbox
>>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>
>>>
>>>
>>> --- thermos (Aurora) ----
>>>
>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>>> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>> 24 Writing log files to disk in /mnt/mesos/sandbox
>>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>>> 27 Writing log files to disk in /mnt/mesos/sandbox
>>> 28 Traceback (most recent call last):
>>> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>> 30 self.__real_run(*args, **kw)
>>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
>>> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
>>> 33 thread.start()
>>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
>>> 35 _start_new_thread(self.__bootstrap, ())
>>> 36 thread.error: can't start new thread
>>> 37 ERROR] Failed to stop health checkers:
>>> 38 ERROR] Traceback (most recent call last):
>>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>>> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
>>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>> 42 return deadline(*args, daemon=True, propagate=True, **kw)
>>> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>> 44 AnonymousThread().start()
>>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
>>> 46 _start_new_thread(self.__bootstrap, ())
>>> 47 error: can't start new thread
>>> 48
>>> 49 ERROR] Failed to stop runner:
>>> 50 ERROR] Traceback (most recent call last):
>>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>>> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>> 54 return deadline(*args, daemon=True, propagate=True, **kw)
>>> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>> 56 AnonymousThread().start()
>>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
>>> 58 _start_new_thread(self.__bootstrap, ())
>>> 59 error: can't start new thread
>>> 60
>>> 61 Traceback (most recent call last):
>>> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>> 63 self.__real_run(*args, **kw)
>>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
>>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>>> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
>>> 67 deferred.start()
>>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
>>> 69 _start_new_thread(self.__bootstrap, ())
>>> 70 thread.error: can't start new thread
>>>
>>>
>>>
>>>
>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org>
>>> wrote:
>>>
>>>> Can you share the agent and executor logs of an example orphaned
>>>> executor? That would help us diagnose the issue.
>>>>
>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> Folks,
>>>>> Often I see some orphaned executors in my cluster. These are cases
>>>>> where the framework was informed of task loss, so has forgotten about them
>>>>> as expected, but the container(docker) is still around. AFAIK, Mesos agent
>>>>> is the only entity that has knowledge of these containers. How do I ensure
>>>>> that they get cleaned up by the agent?
>>>>>
>>>>> Mohit.
>>>>>
>>>>
>>>>
>>>
>>
>
Re: orphan executor
Posted by Mohit Jaggi <mo...@uber.com>.
Yes. There is a fix available now in Aurora/Thermos to try and exit in such
scenarios. But I am curious to know if Mesos agent has the functionality to
reap runaway executors.
On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <bm...@apache.org>
wrote:
> Is my understanding correct that the Thermos transitions the task to
> TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical
> workflow for thermos, as a 1:1 task:executor approach, is that the executor
> terminates itself after the task is terminal.
>
> The full logs of the agent during this window would help, it looks like an
> agent termination is involved here as well?
>
> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> Here are some relevant logs. Aurora scheduler logs shows the task going
>> from:
>> INIT
>> ->PENDING
>> ->ASSIGNED
>> ->STARTING
>> ->RUNNING for a long time
>> ->FAILED due to health check error, OSError: Resource temporarily
>> unavailable (I think this is referring to running out of PID space, see
>> thermos logs below)
>>
>>
>> --- mesos agent ---
>>
>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>> Writing log files to disk in /mnt/mesos/sandbox
>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>> Writing log files to disk in /mnt/mesos/sandbox
>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>
>>
>>
>> --- thermos (Aurora) ----
>>
>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>> 24 Writing log files to disk in /mnt/mesos/sandbox
>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>> 27 Writing log files to disk in /mnt/mesos/sandbox
>> 28 Traceback (most recent call last):
>> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>> 30 self.__real_run(*args, **kw)
>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
>> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
>> 33 thread.start()
>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
>> 35 _start_new_thread(self.__bootstrap, ())
>> 36 thread.error: can't start new thread
>> 37 ERROR] Failed to stop health checkers:
>> 38 ERROR] Traceback (most recent call last):
>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>> 42 return deadline(*args, daemon=True, propagate=True, **kw)
>> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>> 44 AnonymousThread().start()
>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
>> 46 _start_new_thread(self.__bootstrap, ())
>> 47 error: can't start new thread
>> 48
>> 49 ERROR] Failed to stop runner:
>> 50 ERROR] Traceback (most recent call last):
>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>> 54 return deadline(*args, daemon=True, propagate=True, **kw)
>> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>> 56 AnonymousThread().start()
>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
>> 58 _start_new_thread(self.__bootstrap, ())
>> 59 error: can't start new thread
>> 60
>> 61 Traceback (most recent call last):
>> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>> 63 self.__real_run(*args, **kw)
>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
>> 67 deferred.start()
>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
>> 69 _start_new_thread(self.__bootstrap, ())
>> 70 thread.error: can't start new thread
>>
>>
>>
>>
>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org> wrote:
>>
>>> Can you share the agent and executor logs of an example orphaned
>>> executor? That would help us diagnose the issue.
>>>
>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> Folks,
>>>> Often I see some orphaned executors in my cluster. These are cases
>>>> where the framework was informed of task loss, so has forgotten about them
>>>> as expected, but the container(docker) is still around. AFAIK, Mesos agent
>>>> is the only entity that has knowledge of these containers. How do I ensure
>>>> that they get cleaned up by the agent?
>>>>
>>>> Mohit.
>>>>
>>>
>>>
>>
>
Re: orphan executor
Posted by Benjamin Mahler <bm...@apache.org>.
Is my understanding correct that the Thermos transitions the task to
TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical
workflow for thermos, as a 1:1 task:executor approach, is that the executor
terminates itself after the task is terminal.
The full logs of the agent during this window would help, it looks like an
agent termination is involved here as well?
On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com> wrote:
> Here are some relevant logs. Aurora scheduler logs shows the task going
> from:
> INIT
> ->PENDING
> ->ASSIGNED
> ->STARTING
> ->RUNNING for a long time
> ->FAILED due to health check error, OSError: Resource temporarily
> unavailable (I think this is referring to running out of PID space, see
> thermos logs below)
>
>
> --- mesos agent ---
>
> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
> Writing log files to disk in /mnt/mesos/sandbox
> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
> Writing log files to disk in /mnt/mesos/sandbox
> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>
>
>
> --- thermos (Aurora) ----
>
> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
> 24 Writing log files to disk in /mnt/mesos/sandbox
> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
> 27 Writing log files to disk in /mnt/mesos/sandbox
> 28 Traceback (most recent call last):
> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
> 30 self.__real_run(*args, **kw)
> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
> 33 thread.start()
> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
> 35 _start_new_thread(self.__bootstrap, ())
> 36 thread.error: can't start new thread
> 37 ERROR] Failed to stop health checkers:
> 38 ERROR] Traceback (most recent call last):
> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
> 42 return deadline(*args, daemon=True, propagate=True, **kw)
> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
> 44 AnonymousThread().start()
> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
> 46 _start_new_thread(self.__bootstrap, ())
> 47 error: can't start new thread
> 48
> 49 ERROR] Failed to stop runner:
> 50 ERROR] Traceback (most recent call last):
> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
> 54 return deadline(*args, daemon=True, propagate=True, **kw)
> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
> 56 AnonymousThread().start()
> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
> 58 _start_new_thread(self.__bootstrap, ())
> 59 error: can't start new thread
> 60
> 61 Traceback (most recent call last):
> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
> 63 self.__real_run(*args, **kw)
> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
> 67 deferred.start()
> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
> 69 _start_new_thread(self.__bootstrap, ())
> 70 thread.error: can't start new thread
>
>
>
>
> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org> wrote:
>
>> Can you share the agent and executor logs of an example orphaned
>> executor? That would help us diagnose the issue.
>>
>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Folks,
>>> Often I see some orphaned executors in my cluster. These are cases where
>>> the framework was informed of task loss, so has forgotten about them as
>>> expected, but the container(docker) is still around. AFAIK, Mesos agent is
>>> the only entity that has knowledge of these containers. How do I ensure
>>> that they get cleaned up by the agent?
>>>
>>> Mohit.
>>>
>>
>>
>
Re: orphan executor
Posted by Mohit Jaggi <mo...@uber.com>.
Here are some relevant logs. Aurora scheduler logs shows the task going
from:
INIT
->PENDING
->ASSIGNED
->STARTING
->RUNNING for a long time
->FAILED due to health check error, OSError: Resource temporarily
unavailable (I think this is referring to running out of PID space, see
thermos logs below)
--- mesos agent ---
I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into
the sandbox directory
I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource
'/usr/bin/xxxxx' to
'/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx'
to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
WARNING: Your kernel does not support swap limit capabilities, memory
limited without swap.
twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
Writing log files to disk in /mnt/mesos/sandbox
I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent
b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
Writing log files to disk in /mnt/mesos/sandbox
I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework
has checkpointing enabled. Waiting 365days to reconnect with agent
b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
--- thermos (Aurora) ----
1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx'
to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
22 WARNING: Your kernel does not support swap limit capabilities,
memory limited without swap.
23 twitter.common.app debug: Initializing: twitter.common.log
(Logging subsystem.)
24 Writing log files to disk in /mnt/mesos/sandbox
25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on
agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
27 Writing log files to disk in /mnt/mesos/sandbox
28 Traceback (most recent call last):
29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
line 1 26, in _excepting_run
30 self.__real_run(*args, **kw)
31 File "apache/thermos/monitoring/resource.py", line 243, in run
32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
lin e 79, in wait
33 thread.start()
34 File "/usr/lib/python2.7/threading.py", line 745, in start
35 _start_new_thread(self.__bootstrap, ())
36 thread.error: can't start new thread
37 ERROR] Failed to stop health checkers:
38 ERROR] Traceback (most recent call last):
39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
40 propagate_deadline(self._chained_checker.stop,
timeout=self.STOP_TIMEOUT)
41 File "apache/aurora/executor/aurora_executor.py", line 35, in
propagate_deadline
42 return deadline(*args, daemon=True, propagate=True, **kw)
43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
line 6 1, in deadline
44 AnonymousThread().start()
45 File "/usr/lib/python2.7/threading.py", line 745, in start
46 _start_new_thread(self.__bootstrap, ())
47 error: can't start new thread
48
49 ERROR] Failed to stop runner:
50 ERROR] Traceback (most recent call last):
51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
53 File "apache/aurora/executor/aurora_executor.py", line 35, in
propagate_deadline
54 return deadline(*args, daemon=True, propagate=True, **kw)
55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
line 6 1, in deadline
56 AnonymousThread().start()
57 File "/usr/lib/python2.7/threading.py", line 745, in start
58 _start_new_thread(self.__bootstrap, ())
59 error: can't start new thread
60
61 Traceback (most recent call last):
62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
line 1 26, in _excepting_run
63 self.__real_run(*args, **kw)
64 File "apache/aurora/executor/status_manager.py", line 62, in run
65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
line 5 6, in defer
67 deferred.start()
68 File "/usr/lib/python2.7/threading.py", line 745, in start
69 _start_new_thread(self.__bootstrap, ())
70 thread.error: can't start new thread
On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org> wrote:
> Can you share the agent and executor logs of an example orphaned executor?
> That would help us diagnose the issue.
>
> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> Folks,
>> Often I see some orphaned executors in my cluster. These are cases where
>> the framework was informed of task loss, so has forgotten about them as
>> expected, but the container(docker) is still around. AFAIK, Mesos agent is
>> the only entity that has knowledge of these containers. How do I ensure
>> that they get cleaned up by the agent?
>>
>> Mohit.
>>
>
>
Re: orphan executor
Posted by Vinod Kone <vi...@apache.org>.
Can you share the agent and executor logs of an example orphaned executor?
That would help us diagnose the issue.
On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mo...@uber.com> wrote:
> Folks,
> Often I see some orphaned executors in my cluster. These are cases where
> the framework was informed of task loss, so has forgotten about them as
> expected, but the container(docker) is still around. AFAIK, Mesos agent is
> the only entity that has knowledge of these containers. How do I ensure
> that they get cleaned up by the agent?
>
> Mohit.
>