You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Ajay V <aj...@gmail.com> on 2018/01/10 01:33:14 UTC

Mesos rare TASK_LOST scenario v 0.21.0

Hello,

I'm trying to debug a TASK_LOST thats generated on the agent that I see on
rare occasions.

Following is a log that I'm trying to understand. This is happening after
the driver.sendStatusUpdate() has been called with a task state of
TASK_FINISHED from a java executor. It looks to me like the container is
already exited before the TASK_FINISHED  is processed. Is there a timing
issue here in this version of mesos that is causing this? The effect of
this problem is that, even though the work of the executor is complete and
the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is
marked as LOST and the actual update of TASK_FINISHED is ignored.

I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for container
'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited

I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container
'bb0e5f2d-4bdb-479c-b829-4741993c4109'

W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown
container 'bb0e5f2d-4bdb-479c-b829-4741993c4109'

W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource
statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because:
Failed to get usage: No process found at 28952

I0108 10:16:52.899657 37278 slave.cpp:2898] Executor
'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
20171208-050805-140555025-5050-3470-0000 exited with status 0

I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update
TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
ff631ad1-cfab-493e-be18-961581abcf3d of framework
20171208-050805-140555025-5050-3470-0000 from @0.0.0.0:0

I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task
ff631ad1-cfab-493e-be18-961581abcf3d

W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for
unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109

I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received status
update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
ff631ad1-cfab-493e-be18-961581abcf3d of framework
20171208-050805-140555025-5050-3470-0000

I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding
update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
ff631ad1-cfab-493e-be18-961581abcf3d of framework
20171208-050805-140555025-5050-3470-0000 to the slave

I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update TASK_LOST
(UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
ff631ad1-cfab-493e-be18-961581abcf3d of framework
20171208-050805-140555025-5050-3470-0000 to master@17.179.96.8:5050

I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager
successfully handled status update TASK_LOST (UUID:
f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
ff631ad1-cfab-493e-be18-961581abcf3d of framework
20171208-050805-140555025-5050-3470-0000

I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received status
update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for
task ff631ad1-cfab-493e-be18-961581abcf3d of framework
20171208-050805-140555025-5050-3470-0000

I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up
status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of
framework 20171208-050805-140555025-5050-3470-0000

I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager
successfully handled status update acknowledgement (UUID:
f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
ff631ad1-cfab-493e-be18-961581abcf3d of framework
20171208-050805-140555025-5050-3470-0000

I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task
ff631ad1-cfab-493e-be18-961581abcf3d

I0108 10:16:52.960364 37268 slave.cpp:3007] Cleaning up executor
'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
20171208-050805-140555025-5050-3470-0000

Regards,
Ajay

Re: Mesos rare TASK_LOST scenario v 0.21.0

Posted by Vinod Kone <vi...@apache.org>.
The command executor was probably fixed somewhere between 0.21 and 1.3. The
only reason I mentioned 1.3+ is because any releases before that are out of
support period. If you can repro the issue with 1.3+ and paste the logs
here or in a JIRA, we can help debug it for you.

On Wed, Jan 10, 2018 at 9:47 AM, Ajay V <aj...@gmail.com> wrote:

> Thanks for getting back Vinod. So, does that mean that even for v1.2,
> these race conditions (where the command executor doesn't stay long enough
> ) existed and that 1.3 versions fixes them ?. Reason for asking is because
> I did try an upgrade to v1.2 and still found very similar issues.
>
> Regards,
> Ajay
>
> On Tue, Jan 9, 2018 at 6:48 PM, Vinod Kone <vi...@apache.org> wrote:
>
>> 0.21 is really old and not supported. I highly recommend you upgrade to
>> 1.3+.
>>
>> Regarding what you are seeing, we definitely had issues in the past where
>> the command executor didn't stay up long enough to guarantee that
>> TASK_FINISHED was delivered to the agent; so races like above were possible.
>>
>> On Tue, Jan 9, 2018 at 5:33 PM, Ajay V <aj...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I'm trying to debug a TASK_LOST thats generated on the agent that I see
>>> on rare occasions.
>>>
>>> Following is a log that I'm trying to understand. This is happening
>>> after the driver.sendStatusUpdate() has been called with a task state of
>>> TASK_FINISHED from a java executor. It looks to me like the container is
>>> already exited before the TASK_FINISHED  is processed. Is there a timing
>>> issue here in this version of mesos that is causing this? The effect of
>>> this problem is that, even though the work of the executor is complete and
>>> the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is
>>> marked as LOST and the actual update of TASK_FINISHED is ignored.
>>>
>>> I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for
>>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited
>>>
>>> I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container
>>> 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>>>
>>> W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown
>>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>>>
>>> W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource
>>> statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because:
>>> Failed to get usage: No process found at 28952
>>>
>>> I0108 10:16:52.899657 37278 slave.cpp:2898] Executor
>>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
>>> 20171208-050805-140555025-5050-3470-0000 exited with status 0
>>>
>>> I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update
>>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000 from @0.0.0.0:0
>>>
>>> I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task
>>> ff631ad1-cfab-493e-be18-961581abcf3d
>>>
>>> W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for
>>> unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109
>>>
>>> I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received
>>> status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5)
>>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000
>>>
>>> I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding
>>> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000 to the slave
>>>
>>> I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update
>>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000 to master@17.179.96.8:5050
>>>
>>> I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager
>>> successfully handled status update TASK_LOST (UUID:
>>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000
>>>
>>> I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received
>>> status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5)
>>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000
>>>
>>> I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up
>>> status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of
>>> framework 20171208-050805-140555025-5050-3470-0000
>>>
>>> I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager
>>> successfully handled status update acknowledgement (UUID:
>>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>>> 20171208-050805-140555025-5050-3470-0000
>>>
>>> I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task
>>> ff631ad1-cfab-493e-be18-961581abcf3d
>>>
>>> I0108 10:16:52.960364 37268 slave.cpp:3007] Cleaning up executor
>>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
>>> 20171208-050805-140555025-5050-3470-0000
>>>
>>> Regards,
>>> Ajay
>>>
>>
>>
>

Re: Mesos rare TASK_LOST scenario v 0.21.0

Posted by Ajay V <aj...@gmail.com>.
Thanks for getting back Vinod. So, does that mean that even for v1.2, these
race conditions (where the command executor doesn't stay long enough )
existed and that 1.3 versions fixes them ?. Reason for asking is because I
did try an upgrade to v1.2 and still found very similar issues.

Regards,
Ajay

On Tue, Jan 9, 2018 at 6:48 PM, Vinod Kone <vi...@apache.org> wrote:

> 0.21 is really old and not supported. I highly recommend you upgrade to
> 1.3+.
>
> Regarding what you are seeing, we definitely had issues in the past where
> the command executor didn't stay up long enough to guarantee that
> TASK_FINISHED was delivered to the agent; so races like above were possible.
>
> On Tue, Jan 9, 2018 at 5:33 PM, Ajay V <aj...@gmail.com> wrote:
>
>> Hello,
>>
>> I'm trying to debug a TASK_LOST thats generated on the agent that I see
>> on rare occasions.
>>
>> Following is a log that I'm trying to understand. This is happening after
>> the driver.sendStatusUpdate() has been called with a task state of
>> TASK_FINISHED from a java executor. It looks to me like the container is
>> already exited before the TASK_FINISHED  is processed. Is there a timing
>> issue here in this version of mesos that is causing this? The effect of
>> this problem is that, even though the work of the executor is complete and
>> the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is
>> marked as LOST and the actual update of TASK_FINISHED is ignored.
>>
>> I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for
>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited
>>
>> I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container
>> 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>>
>> W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown
>> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>>
>> W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource
>> statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because:
>> Failed to get usage: No process found at 28952
>>
>> I0108 10:16:52.899657 37278 slave.cpp:2898] Executor
>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
>> 20171208-050805-140555025-5050-3470-0000 exited with status 0
>>
>> I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update
>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470-0000 from @0.0.0.0:0
>>
>> I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task
>> ff631ad1-cfab-493e-be18-961581abcf3d
>>
>> W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for
>> unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109
>>
>> I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received
>> status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for
>> task ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470-0000
>>
>> I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding
>> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470-0000 to the slave
>>
>> I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update
>> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470-0000 to master@17.179.96.8:5050
>>
>> I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager
>> successfully handled status update TASK_LOST (UUID:
>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470-0000
>>
>> I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received
>> status update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5)
>> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470-0000
>>
>> I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up
>> status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of
>> framework 20171208-050805-140555025-5050-3470-0000
>>
>> I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager
>> successfully handled status update acknowledgement (UUID:
>> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
>> ff631ad1-cfab-493e-be18-961581abcf3d of framework
>> 20171208-050805-140555025-5050-3470-0000
>>
>> I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task
>> ff631ad1-cfab-493e-be18-961581abcf3d
>>
>> I0108 10:16:52.960364 37268 slave.cpp:3007] Cleaning up executor
>> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
>> 20171208-050805-140555025-5050-3470-0000
>>
>> Regards,
>> Ajay
>>
>
>

Re: Mesos rare TASK_LOST scenario v 0.21.0

Posted by Vinod Kone <vi...@apache.org>.
0.21 is really old and not supported. I highly recommend you upgrade to
1.3+.

Regarding what you are seeing, we definitely had issues in the past where
the command executor didn't stay up long enough to guarantee that
TASK_FINISHED was delivered to the agent; so races like above were possible.

On Tue, Jan 9, 2018 at 5:33 PM, Ajay V <aj...@gmail.com> wrote:

> Hello,
>
> I'm trying to debug a TASK_LOST thats generated on the agent that I see on
> rare occasions.
>
> Following is a log that I'm trying to understand. This is happening after
> the driver.sendStatusUpdate() has been called with a task state of
> TASK_FINISHED from a java executor. It looks to me like the container is
> already exited before the TASK_FINISHED  is processed. Is there a timing
> issue here in this version of mesos that is causing this? The effect of
> this problem is that, even though the work of the executor is complete and
> the executor calls the sendStatusUpdate with a TASK_FINISHED, the task is
> marked as LOST and the actual update of TASK_FINISHED is ignored.
>
> I0108 10:16:51.388300 37272 containerizer.cpp:1117] Executor for container
> 'bb0e5f2d-4bdb-479c-b829-4741993c4109' has exited
>
> I0108 10:16:51.388741 37272 containerizer.cpp:946] Destroying container
> 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>
> W0108 10:16:52.159241 37260 posix.hpp:192] No resource usage for unknown
> container 'bb0e5f2d-4bdb-479c-b829-4741993c4109'
>
> W0108 10:16:52.803463 37255 containerizer.cpp:888] Skipping resource
> statistic for container bb0e5f2d-4bdb-479c-b829-4741993c4109 because:
> Failed to get usage: No process found at 28952
>
> I0108 10:16:52.899657 37278 slave.cpp:2898] Executor
> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
> 20171208-050805-140555025-5050-3470-0000 exited with status 0
>
> I0108 10:16:52.901736 37278 slave.cpp:2215] Handling status update
> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
> ff631ad1-cfab-493e-be18-961581abcf3d of framework
> 20171208-050805-140555025-5050-3470-0000 from @0.0.0.0:0
>
> I0108 10:16:52.901978 37278 slave.cpp:4305] Terminating task
> ff631ad1-cfab-493e-be18-961581abcf3d
>
> W0108 10:16:52.902793 37274 containerizer.cpp:852] Ignoring update for
> unknown container: bb0e5f2d-4bdb-479c-b829-4741993c4109
>
> I0108 10:16:52.903230 37274 status_update_manager.cpp:317] Received status
> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
> ff631ad1-cfab-493e-be18-961581abcf3d of framework
> 20171208-050805-140555025-5050-3470-0000
>
> I0108 10:16:52.904119 37274 status_update_manager.cpp:371] Forwarding
> update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
> ff631ad1-cfab-493e-be18-961581abcf3d of framework
> 20171208-050805-140555025-5050-3470-0000 to the slave
>
> I0108 10:16:52.905725 37282 slave.cpp:2458] Forwarding the update
> TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task
> ff631ad1-cfab-493e-be18-961581abcf3d of framework
> 20171208-050805-140555025-5050-3470-0000 to master@17.179.96.8:5050
>
> I0108 10:16:52.906025 37282 slave.cpp:2385] Status update manager
> successfully handled status update TASK_LOST (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5)
> for task ff631ad1-cfab-493e-be18-961581abcf3d of framework
> 20171208-050805-140555025-5050-3470-0000
>
> I0108 10:16:52.956588 37280 status_update_manager.cpp:389] Received status
> update acknowledgement (UUID: f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for
> task ff631ad1-cfab-493e-be18-961581abcf3d of framework
> 20171208-050805-140555025-5050-3470-0000
>
> I0108 10:16:52.956841 37280 status_update_manager.cpp:525] Cleaning up
> status update stream for task ff631ad1-cfab-493e-be18-961581abcf3d of
> framework 20171208-050805-140555025-5050-3470-0000
>
> I0108 10:16:52.957608 37268 slave.cpp:1800] Status update manager
> successfully handled status update acknowledgement (UUID:
> f2bf0aa2-d465-4ced-8cea-06bc1d3f38c5) for task ff631ad1-cfab-493e-be18-961581abcf3d
> of framework 20171208-050805-140555025-5050-3470-0000
>
> I0108 10:16:52.958693 37268 slave.cpp:4344] Completing task
> ff631ad1-cfab-493e-be18-961581abcf3d
>
> I0108 10:16:52.960364 37268 slave.cpp:3007] Cleaning up executor
> 'ff631ad1-cfab-493e-be18-961581abcf3d' of framework
> 20171208-050805-140555025-5050-3470-0000
>
> Regards,
> Ajay
>