You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Benjamin Mahler <bm...@apache.org> on 2017/11/01 00:22:18 UTC
Re: orphan executor
The question was posed merely to point out that there is no notion of the
executor "running away" currently, due to the answer I provided: there
isn't a complete lifecycle API for the executor. (This includes
healthiness, state updates, reconciliation, ability for scheduler to shut
it down, etc).
On Tue, Oct 31, 2017 at 4:27 PM, Mohit Jaggi <mo...@uber.com> wrote:
> Good question.
> - I don't know what the interaction between mesos agent and executor is.
> Is there a health check?
> - There is a reconciliation between Mesos and Frameworks: will Mesos
> include the "orphan" executor in the list there, so framework can find
> runaways and kill them(using Mesos provided API)?
>
> On Tue, Oct 31, 2017 at 3:49 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
>> What defines a runaway executor?
>>
>> Mesos does not know that this particular executor should self-terminate
>> within some reasonable time after its task terminates. In this case the
>> framework (Aurora) knows this expected behavior of Thermos and can clean up
>> ones that get stuck after the task terminates. However, we currently don't
>> provide a great executor lifecycle API to enable schedulers to do this
>> (it's long overdue).
>>
>> On Tue, Oct 31, 2017 at 2:47 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> I was asking if this can happen automatically.
>>>
>>> On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler <bm...@apache.org>
>>> wrote:
>>>
>>>> You can kill it manually by SIGKILLing the executor process.
>>>> Using the agent API, you can launch a nested container session and kill
>>>> the executor. +jie,gilbert, is there a CLI command for 'exec'ing into the
>>>> container?
>>>>
>>>> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> Yes. There is a fix available now in Aurora/Thermos to try and exit in
>>>>> such scenarios. But I am curious to know if Mesos agent has the
>>>>> functionality to reap runaway executors.
>>>>>
>>>>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <bm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Is my understanding correct that the Thermos transitions the task to
>>>>>> TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical
>>>>>> workflow for thermos, as a 1:1 task:executor approach, is that the executor
>>>>>> terminates itself after the task is terminal.
>>>>>>
>>>>>> The full logs of the agent during this window would help, it looks
>>>>>> like an agent termination is involved here as well?
>>>>>>
>>>>>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Here are some relevant logs. Aurora scheduler logs shows the task
>>>>>>> going from:
>>>>>>> INIT
>>>>>>> ->PENDING
>>>>>>> ->ASSIGNED
>>>>>>> ->STARTING
>>>>>>> ->RUNNING for a long time
>>>>>>> ->FAILED due to health check error, OSError: Resource temporarily
>>>>>>> unavailable (I think this is referring to running out of PID space, see
>>>>>>> thermos logs below)
>>>>>>>
>>>>>>>
>>>>>>> --- mesos agent ---
>>>>>>>
>>>>>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
>>>>>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
>>>>>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
>>>>>>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --- thermos (Aurora) ----
>>>>>>>
>>>>>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>>>>>>> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>> 24 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>>>>>>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>>>>>>> 27 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>> 28 Traceback (most recent call last):
>>>>>>> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>> 30 self.__real_run(*args, **kw)
>>>>>>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
>>>>>>> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
>>>>>>> 33 thread.start()
>>>>>>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>> 35 _start_new_thread(self.__bootstrap, ())
>>>>>>> 36 thread.error: can't start new thread
>>>>>>> 37 ERROR] Failed to stop health checkers:
>>>>>>> 38 ERROR] Traceback (most recent call last):
>>>>>>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>>>>>>> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
>>>>>>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>> 42 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>> 44 AnonymousThread().start()
>>>>>>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>> 46 _start_new_thread(self.__bootstrap, ())
>>>>>>> 47 error: can't start new thread
>>>>>>> 48
>>>>>>> 49 ERROR] Failed to stop runner:
>>>>>>> 50 ERROR] Traceback (most recent call last):
>>>>>>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>>>>>>> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>>>>>>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>> 54 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>> 56 AnonymousThread().start()
>>>>>>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>> 58 _start_new_thread(self.__bootstrap, ())
>>>>>>> 59 error: can't start new thread
>>>>>>> 60
>>>>>>> 61 Traceback (most recent call last):
>>>>>>> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>> 63 self.__real_run(*args, **kw)
>>>>>>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
>>>>>>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>>>>>>> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
>>>>>>> 67 deferred.start()
>>>>>>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>> 69 _start_new_thread(self.__bootstrap, ())
>>>>>>> 70 thread.error: can't start new thread
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Can you share the agent and executor logs of an example orphaned
>>>>>>>> executor? That would help us diagnose the issue.
>>>>>>>>
>>>>>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Folks,
>>>>>>>>> Often I see some orphaned executors in my cluster. These are cases
>>>>>>>>> where the framework was informed of task loss, so has forgotten about them
>>>>>>>>> as expected, but the container(docker) is still around. AFAIK, Mesos agent
>>>>>>>>> is the only entity that has knowledge of these containers. How do I ensure
>>>>>>>>> that they get cleaned up by the agent?
>>>>>>>>>
>>>>>>>>> Mohit.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: orphan executor
Posted by Benjamin Mahler <bm...@apache.org>.
I filed one: https://issues.apache.org/jira/browse/MESOS-8167
It's a pretty significant effort, and hasn't been requested a lot, so it's
unlikely to be worked on for some time.
On Tue, Oct 31, 2017 at 8:18 PM, Mohit Jaggi <mo...@uber.com> wrote:
> :-)
> Is there a Jira ticket to track this? Any idea when this will be worked on?
>
> On Tue, Oct 31, 2017 at 5:22 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
>> The question was posed merely to point out that there is no notion of the
>> executor "running away" currently, due to the answer I provided: there
>> isn't a complete lifecycle API for the executor. (This includes
>> healthiness, state updates, reconciliation, ability for scheduler to shut
>> it down, etc).
>>
>> On Tue, Oct 31, 2017 at 4:27 PM, Mohit Jaggi <mo...@uber.com>
>> wrote:
>>
>>> Good question.
>>> - I don't know what the interaction between mesos agent and executor is.
>>> Is there a health check?
>>> - There is a reconciliation between Mesos and Frameworks: will Mesos
>>> include the "orphan" executor in the list there, so framework can find
>>> runaways and kill them(using Mesos provided API)?
>>>
>>> On Tue, Oct 31, 2017 at 3:49 PM, Benjamin Mahler <bm...@apache.org>
>>> wrote:
>>>
>>>> What defines a runaway executor?
>>>>
>>>> Mesos does not know that this particular executor should self-terminate
>>>> within some reasonable time after its task terminates. In this case the
>>>> framework (Aurora) knows this expected behavior of Thermos and can clean up
>>>> ones that get stuck after the task terminates. However, we currently don't
>>>> provide a great executor lifecycle API to enable schedulers to do this
>>>> (it's long overdue).
>>>>
>>>> On Tue, Oct 31, 2017 at 2:47 PM, Mohit Jaggi <mo...@uber.com>
>>>> wrote:
>>>>
>>>>> I was asking if this can happen automatically.
>>>>>
>>>>> On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler <bm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> You can kill it manually by SIGKILLing the executor process.
>>>>>> Using the agent API, you can launch a nested container session and
>>>>>> kill the executor. +jie,gilbert, is there a CLI command for 'exec'ing into
>>>>>> the container?
>>>>>>
>>>>>> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <mo...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes. There is a fix available now in Aurora/Thermos to try and exit
>>>>>>> in such scenarios. But I am curious to know if Mesos agent has the
>>>>>>> functionality to reap runaway executors.
>>>>>>>
>>>>>>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <
>>>>>>> bmahler@apache.org> wrote:
>>>>>>>
>>>>>>>> Is my understanding correct that the Thermos transitions the task
>>>>>>>> to TASK_FAILED, but Thermos gets stuck and can't terminate itself? The
>>>>>>>> typical workflow for thermos, as a 1:1 task:executor approach, is that the
>>>>>>>> executor terminates itself after the task is terminal.
>>>>>>>>
>>>>>>>> The full logs of the agent during this window would help, it looks
>>>>>>>> like an agent termination is involved here as well?
>>>>>>>>
>>>>>>>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Here are some relevant logs. Aurora scheduler logs shows the task
>>>>>>>>> going from:
>>>>>>>>> INIT
>>>>>>>>> ->PENDING
>>>>>>>>> ->ASSIGNED
>>>>>>>>> ->STARTING
>>>>>>>>> ->RUNNING for a long time
>>>>>>>>> ->FAILED due to health check error, OSError: Resource temporarily
>>>>>>>>> unavailable (I think this is referring to running out of PID space, see
>>>>>>>>> thermos logs below)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --- mesos agent ---
>>>>>>>>>
>>>>>>>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
>>>>>>>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
>>>>>>>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>>>> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
>>>>>>>>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --- thermos (Aurora) ----
>>>>>>>>>
>>>>>>>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>>>>>>>>> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>>>> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>>>> 24 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>>>>>>>>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>>>>>>>>> 27 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>>> 28 Traceback (most recent call last):
>>>>>>>>> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>>>> 30 self.__real_run(*args, **kw)
>>>>>>>>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
>>>>>>>>> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
>>>>>>>>> 33 thread.start()
>>>>>>>>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>> 35 _start_new_thread(self.__bootstrap, ())
>>>>>>>>> 36 thread.error: can't start new thread
>>>>>>>>> 37 ERROR] Failed to stop health checkers:
>>>>>>>>> 38 ERROR] Traceback (most recent call last):
>>>>>>>>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>>>>>>>>> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
>>>>>>>>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>>>> 42 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>>>> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>>>> 44 AnonymousThread().start()
>>>>>>>>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>> 46 _start_new_thread(self.__bootstrap, ())
>>>>>>>>> 47 error: can't start new thread
>>>>>>>>> 48
>>>>>>>>> 49 ERROR] Failed to stop runner:
>>>>>>>>> 50 ERROR] Traceback (most recent call last):
>>>>>>>>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>>>>>>>>> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>>>>>>>>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>>>> 54 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>>>> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>>>> 56 AnonymousThread().start()
>>>>>>>>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>> 58 _start_new_thread(self.__bootstrap, ())
>>>>>>>>> 59 error: can't start new thread
>>>>>>>>> 60
>>>>>>>>> 61 Traceback (most recent call last):
>>>>>>>>> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>>>> 63 self.__real_run(*args, **kw)
>>>>>>>>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
>>>>>>>>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>>>>>>>>> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
>>>>>>>>> 67 deferred.start()
>>>>>>>>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>>> 69 _start_new_thread(self.__bootstrap, ())
>>>>>>>>> 70 thread.error: can't start new thread
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Can you share the agent and executor logs of an example orphaned
>>>>>>>>>> executor? That would help us diagnose the issue.
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <
>>>>>>>>>> mohit.jaggi@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Folks,
>>>>>>>>>>> Often I see some orphaned executors in my cluster. These are
>>>>>>>>>>> cases where the framework was informed of task loss, so has forgotten about
>>>>>>>>>>> them as expected, but the container(docker) is still around. AFAIK, Mesos
>>>>>>>>>>> agent is the only entity that has knowledge of these containers. How do I
>>>>>>>>>>> ensure that they get cleaned up by the agent?
>>>>>>>>>>>
>>>>>>>>>>> Mohit.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: orphan executor
Posted by Mohit Jaggi <mo...@uber.com>.
:-)
Is there a Jira ticket to track this? Any idea when this will be worked on?
On Tue, Oct 31, 2017 at 5:22 PM, Benjamin Mahler <bm...@apache.org> wrote:
> The question was posed merely to point out that there is no notion of the
> executor "running away" currently, due to the answer I provided: there
> isn't a complete lifecycle API for the executor. (This includes
> healthiness, state updates, reconciliation, ability for scheduler to shut
> it down, etc).
>
> On Tue, Oct 31, 2017 at 4:27 PM, Mohit Jaggi <mo...@uber.com> wrote:
>
>> Good question.
>> - I don't know what the interaction between mesos agent and executor is.
>> Is there a health check?
>> - There is a reconciliation between Mesos and Frameworks: will Mesos
>> include the "orphan" executor in the list there, so framework can find
>> runaways and kill them(using Mesos provided API)?
>>
>> On Tue, Oct 31, 2017 at 3:49 PM, Benjamin Mahler <bm...@apache.org>
>> wrote:
>>
>>> What defines a runaway executor?
>>>
>>> Mesos does not know that this particular executor should self-terminate
>>> within some reasonable time after its task terminates. In this case the
>>> framework (Aurora) knows this expected behavior of Thermos and can clean up
>>> ones that get stuck after the task terminates. However, we currently don't
>>> provide a great executor lifecycle API to enable schedulers to do this
>>> (it's long overdue).
>>>
>>> On Tue, Oct 31, 2017 at 2:47 PM, Mohit Jaggi <mo...@uber.com>
>>> wrote:
>>>
>>>> I was asking if this can happen automatically.
>>>>
>>>> On Tue, Oct 31, 2017 at 2:41 PM, Benjamin Mahler <bm...@apache.org>
>>>> wrote:
>>>>
>>>>> You can kill it manually by SIGKILLing the executor process.
>>>>> Using the agent API, you can launch a nested container session and
>>>>> kill the executor. +jie,gilbert, is there a CLI command for 'exec'ing into
>>>>> the container?
>>>>>
>>>>> On Tue, Oct 31, 2017 at 12:47 PM, Mohit Jaggi <mo...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Yes. There is a fix available now in Aurora/Thermos to try and exit
>>>>>> in such scenarios. But I am curious to know if Mesos agent has the
>>>>>> functionality to reap runaway executors.
>>>>>>
>>>>>> On Tue, Oct 31, 2017 at 12:08 PM, Benjamin Mahler <bmahler@apache.org
>>>>>> > wrote:
>>>>>>
>>>>>>> Is my understanding correct that the Thermos transitions the task to
>>>>>>> TASK_FAILED, but Thermos gets stuck and can't terminate itself? The typical
>>>>>>> workflow for thermos, as a 1:1 task:executor approach, is that the executor
>>>>>>> terminates itself after the task is terminal.
>>>>>>>
>>>>>>> The full logs of the agent during this window would help, it looks
>>>>>>> like an agent termination is involved here as well?
>>>>>>>
>>>>>>> On Fri, Oct 27, 2017 at 3:09 PM, Mohit Jaggi <mo...@uber.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here are some relevant logs. Aurora scheduler logs shows the task
>>>>>>>> going from:
>>>>>>>> INIT
>>>>>>>> ->PENDING
>>>>>>>> ->ASSIGNED
>>>>>>>> ->STARTING
>>>>>>>> ->RUNNING for a long time
>>>>>>>> ->FAILED due to health check error, OSError: Resource temporarily
>>>>>>>> unavailable (I think this is referring to running out of PID space, see
>>>>>>>> thermos logs below)
>>>>>>>>
>>>>>>>>
>>>>>>>> --- mesos agent ---
>>>>>>>>
>>>>>>>> I1005 22:56:47.902153 127818 fetcher.cpp:285] Fetching directly into the sandbox directory
>>>>>>>> I1005 22:56:47.902170 127818 fetcher.cpp:222] Fetching URI '/usr/bin/XXXXX'
>>>>>>>> I1005 22:56:47.913270 127818 fetcher.cpp:207] Copied resource '/usr/bin/xxxxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>>> I1005 22:56:47.913331 127818 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-caa0744d-fffd-446e-9f97-05bd84a32b54/runs/bb904e1d-4c32-4d7a-b1b6-9b3f78ddfe95/xxx'
>>>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>>> twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>> I1005 22:58:15.677225 7 exec.cpp:162] Version: 1.1.0
>>>>>>>> I1005 22:58:15.680867 14 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>> Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>> I1006 01:13:52.950552 39 exec.cpp:487] Agent exited, but framework has checkpointing enabled. Waiting 365days to reconnect with agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S1540
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --- thermos (Aurora) ----
>>>>>>>>
>>>>>>>> 1 I1023 19:03:05.765677 52364 fetcher.cpp:582] Fetched '/usr/bin/xxx' to '/var/lib/mesos/slaves/b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295/frameworks/20160112-010512-421372426-5050-73504-0000/executors/thermos-xxx-2-d3c1c4d9-4d74-433a-b26a-8a88bb7687b8/runs/982e7236-fccd-40bc-a2a5-d8a1901cf0bf/fxxx'
>>>>>>>> 22 WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
>>>>>>>> 23 twitter.common.app debug: Initializing: twitter.common.log (Logging subsystem.)
>>>>>>>> 24 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>> 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>>>>>>>> 26 I1023 19:04:32.264870 42 exec.cpp:237] Executor registered on agent b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>>>>>>>> 27 Writing log files to disk in /mnt/mesos/sandbox
>>>>>>>> 28 Traceback (most recent call last):
>>>>>>>> 29 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>>> 30 self.__real_run(*args, **kw)
>>>>>>>> 31 File "apache/thermos/monitoring/resource.py", line 243, in run
>>>>>>>> 32 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py", lin e 79, in wait
>>>>>>>> 33 thread.start()
>>>>>>>> 34 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>> 35 _start_new_thread(self.__bootstrap, ())
>>>>>>>> 36 thread.error: can't start new thread
>>>>>>>> 37 ERROR] Failed to stop health checkers:
>>>>>>>> 38 ERROR] Traceback (most recent call last):
>>>>>>>> 39 File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>>>>>>>> 40 propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
>>>>>>>> 41 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>>> 42 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>>> 43 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>>> 44 AnonymousThread().start()
>>>>>>>> 45 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>> 46 _start_new_thread(self.__bootstrap, ())
>>>>>>>> 47 error: can't start new thread
>>>>>>>> 48
>>>>>>>> 49 ERROR] Failed to stop runner:
>>>>>>>> 50 ERROR] Traceback (most recent call last):
>>>>>>>> 51 File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>>>>>>>> 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>>>>>>>> 53 File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline
>>>>>>>> 54 return deadline(*args, daemon=True, propagate=True, **kw)
>>>>>>>> 55 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py", line 6 1, in deadline
>>>>>>>> 56 AnonymousThread().start()
>>>>>>>> 57 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>> 58 _start_new_thread(self.__bootstrap, ())
>>>>>>>> 59 error: can't start new thread
>>>>>>>> 60
>>>>>>>> 61 Traceback (most recent call last):
>>>>>>>> 62 File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 1 26, in _excepting_run
>>>>>>>> 63 self.__real_run(*args, **kw)
>>>>>>>> 64 File "apache/aurora/executor/status_manager.py", line 62, in run
>>>>>>>> 65 File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>>>>>>>> 66 File "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py", line 5 6, in defer
>>>>>>>> 67 deferred.start()
>>>>>>>> 68 File "/usr/lib/python2.7/threading.py", line 745, in start
>>>>>>>> 69 _start_new_thread(self.__bootstrap, ())
>>>>>>>> 70 thread.error: can't start new thread
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 27, 2017 at 2:25 PM, Vinod Kone <vi...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Can you share the agent and executor logs of an example orphaned
>>>>>>>>> executor? That would help us diagnose the issue.
>>>>>>>>>
>>>>>>>>> On Fri, Oct 27, 2017 at 8:19 PM, Mohit Jaggi <mohit.jaggi@uber.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Folks,
>>>>>>>>>> Often I see some orphaned executors in my cluster. These are
>>>>>>>>>> cases where the framework was informed of task loss, so has forgotten about
>>>>>>>>>> them as expected, but the container(docker) is still around. AFAIK, Mesos
>>>>>>>>>> agent is the only entity that has knowledge of these containers. How do I
>>>>>>>>>> ensure that they get cleaned up by the agent?
>>>>>>>>>>
>>>>>>>>>> Mohit.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>