You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Varun Gupta <va...@uber.com> on 2018/04/09 19:10:54 UTC

Re: Question on status update retry in agent

Hi,

We are running into an issue with slave status update manager. Below is the
behavior I am seeing.

Our use case is, we run Stateful container (Cassandra process), here
Executor polls JMX port at 60 second interval to get Cassandra State and
sends the state to agent -> master -> framework.

*RUNNING Cassandra Process translates to TASK_RUNNING.*
*CRASHED or DRAINED Cassandra Process translates to TASK_FAILED.*

At some point slave has multiple TASK_RUNNING status updates in stream and
then followed by TASK_FAILED if acknowledgements are pending. We use
explicit acknowledgements, and I see Mesos Master receives, all
TASK_RUNNING and then TASK_FAILED as well as Framework also receives all
TASK_RUNNING updates followed up TASK_FAILED. After receiving TASK_FAILED,
Framework restarts different executor on same machine using old persistent
volume.

Issue is that, *old executor reference is hold by slave* (assuming it did
not receive acknowledgement, whereas master and scheduler have processed
the status updates), so it  continues to retry TASK_RUNNING infinitely.
Here, old executor process is not running. As well as new executor process
is running, and continues to work as-is. This makes be believe, some bug
with slave status update manager.

I read slave status update manager code, recover
<https://github.com/apache/mesos/blob/master/src/slave/task_status_update_manager.cpp#L203>
 has a constraint
<https://github.com/apache/mesos/blob/master/src/slave/task_status_update_manager.cpp#L239>
 to ignore status updates from stream if the last executor run is completed.

I think, similar constraint should be applicable for status update
<https://github.com/apache/mesos/blob/master/src/slave/task_status_update_manager.cpp#L318>
 and acknowledge
<https://github.com/apache/mesos/blob/master/src/slave/task_status_update_manager.cpp#L760>
.


Thanks,
Varun


















On Fri, Mar 16, 2018 at 7:47 PM, Benjamin Mahler <bm...@apache.org> wrote:

> (1) Assuming you're referring to the scheduler's acknowledgement of a
> status update, the agent will not forward TS2 until TS1 has been
> acknowledged. So, TS2 will not be acknowledged before TS1 is acknowledged.
> FWICT, we'll ignore any violation of this ordering and log a warning.
>
> (2) To reverse the question, why would it make sense to ignore them?
> Assuming you're looking to reduce the number of round trips needed for
> schedulers to see the terminal update, I would point you to:
> https://issues.apache.org/jira/browse/MESOS-6941
>
> (3) When the agent sees an executor terminate, it will transition all
> non-terminal tasks assigned to that executor to TASK_GONE (partition aware
> framework) or TASK_LOST (non partition aware framework) or TASK_FAILED (if
> the container OOMed). There may be other cases, it looks a bit convoluted
> to me.
>
> On Thu, Mar 15, 2018 at 10:35 AM, Zhitao Li <zh...@gmail.com> wrote:
>
> > Hi,
> >
> > While designing the correct behavior with one of our framework, we
> > encounters some questions about behavior of status update:
> >
> > The executor continuously polls the workload probe to get current mode of
> > workload (a Cassandra server), and send various status update states
> > (STARTING, RUNNING, FAILED, etc).
> >
> > Executor polls every 30 seconds, and sends status update. Here, we are
> > seeing congestion on task update acknowledgements somewhere (still
> > unknown).
> >
> > There are three scenarios that we want to understand.
> >
> >    1. Agent queue has task update TS1, TS2 & TS3 (in this order) waiting
> on
> >    acknowledgement. Suppose if TS2 receives an acknowledgement, then what
> > will
> >    happen to TS1 update in the queue.
> >
> >
> >    1. Agent queue has task update TS1, TS2, TS3 & TASK_FAILED. Here, TS1,
> >    TS2, TS3 are non-terminial updates. Once the agent has received a
> > terminal
> >    status update, does it makes sense to ignore non-terminal updates in
> the
> >    queue?
> >
> >
> >    1. As per Executor Driver code comment
> >    <https://github.com/apache/mesos/blob/master/src/java/src/
> > org/apache/mesos/ExecutorDriver.java#L86>,
> >    if the executor is terminated, does agent send TASK_LOST? If so, does
> it
> >    send once or for each unacknowledged status update?
> >
> >
> > I'll study the code in status update manager and agent separately but
> some
> > official answer will definitely help.
> >
> > Many thanks!
> >
> > --
> > Cheers,
> >
> > Zhitao Li
> >
>

Re: Question on status update retry in agent

Posted by Benjamin Mahler <bm...@apache.org>.
I'm not following what the bug is.

The code you pointed to is called from here:
https://github.com/apache/mesos/blob/1.4.0/src/slave/status_update_manager.cpp#L762-L776

Where we ignore duplicates and also ensure that the ack matches the latest
update we've sent.
So, from the code you pointed to I don't see an issue yet.

Which update does the duplicate ack for 67f548b4-96cb-4b57-8720-2c8a4ba347e8
refer to?
The logs you provided don't show when the status updates arrive for these,
can you include that?

On Mon, Apr 16, 2018 at 2:08 PM, Varun Gupta <va...@uber.com> wrote:

> We use explicit ack from Scheduler.
>
> Here, is a snippet of the logs. Please see logs for Status Update UUID:
> a918f5ed-a604-415a-ad62-5a34fb6334ef
>
> W0416 00:41:25.843505 124530 status_update_manager.cpp:761] Duplicate
> status update acknowledgment (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8)
> for update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for
> task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190
>
> E0416 00:41:25.843559 124542 slave.cpp:2951] Failed to handle status update
> acknowledgement (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8) for task
> node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190: Duplicate acknowledgement
>
> W0416 00:41:28.416702 124539 status_update_manager.cpp:478] Resending
> status update TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e)
> for
> task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190
>
> I0416 00:41:28.416910 124530 slave.cpp:4051] Forwarding the update
> TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for task
> node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190 to master@10.160.41.62:5050
>
> W0416 00:41:28.425237 124537 status_update_manager.cpp:478] Resending
> status update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef)
> for
> task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190
>
> I0416 00:41:28.425361 124525 slave.cpp:4051] Forwarding the update
> TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for task
> node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190 to master@10.160.41.62:5050
>
> I0416 00:41:36.723466 124518 status_update_manager.cpp:395] Received status
> update acknowledgement (UUID: b99b15d6-d9af-4395-9c46-709f91db90e1) for
> task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190
>
> W0416 00:41:36.723588 124518 status_update_manager.cpp:761] Duplicate
> status update acknowledgment (UUID: b99b15d6-d9af-4395-9c46-709f91db90e1)
> for update TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for
> task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190
>
> E0416 00:41:36.723650 124518 slave.cpp:2951] Failed to handle status update
> acknowledgement (UUID: b99b15d6-d9af-4395-9c46-709f91db90e1) for task
> node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190: Duplicate acknowledgement
>
> I0416 00:41:36.730357 124527 status_update_manager.cpp:395] Received status
> update acknowledgement (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8) for
> task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190
>
> W0416 00:41:36.730417 124527 status_update_manager.cpp:761] Duplicate
> status update acknowledgment (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8)
> for update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for
> task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190
>
> E0416 00:41:36.730465 124518 slave.cpp:2951] Failed to handle status update
> acknowledgement (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8) for task
> node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190: Duplicate acknowledgement
>
> I0416 00:41:43.769918 124532 http.cpp:277] HTTP GET for
> /slave(1)/state.json from 10.65.5.13:42600 with
> User-Agent='filebundle-agent'
>
> W0416 00:41:48.417913 124539 status_update_manager.cpp:478] Resending
> status update TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e)
> for
> task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190
>
> I0416 00:41:48.418166 124538 slave.cpp:4051] Forwarding the update
> TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for task
> node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190 to master@10.160.41.62:5050
>
> W0416 00:41:48.425532 124539 status_update_manager.cpp:478] Resending
> status update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef)
> for
> task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
> 5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190
>
>
> I discussed with @zhitao. Here, is a plausible explanation of the bug.
> Executor sends, updates to agent every 60 seconds, and agents maintains
> them in pending
> <https://github.com/apache/mesos/blob/master/src/slave/task_
> status_update_manager.hpp#L173>
> queue. Now when the ack comes, they can be in any order for the status
> update but _handle section pops
> <https://github.com/apache/mesos/blob/master/src/slave/task_
> status_update_manager.cpp#L888>
> last update from the queue without making sure, ack was for that status
> update.
>
>
>
>
> On Tue, Apr 10, 2018 at 6:43 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
> > Do you have logs? Which acknowledgements did the agent receive? Which
> > TASK_RUNNING in the sequence was it re-sending?
> >
> > On Tue, Apr 10, 2018 at 6:41 PM, Benjamin Mahler <bm...@apache.org>
> > wrote:
> >
> > > > Issue is that, *old executor reference is hold by slave* (assuming it
> > > did not receive acknowledgement, whereas master and scheduler have
> > > processed the status updates), so it  continues to retry TASK_RUNNING
> > > infinitely.
> > >
> > > The agent only retries so long as it does not get an acknowledgement,
> is
> > > the scheduler acknowledging the duplicates updates or ignoring them?
> > >
> > > On Mon, Apr 9, 2018 at 12:10 PM, Varun Gupta <va...@uber.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> We are running into an issue with slave status update manager. Below
> is
> > >> the
> > >> behavior I am seeing.
> > >>
> > >> Our use case is, we run Stateful container (Cassandra process), here
> > >> Executor polls JMX port at 60 second interval to get Cassandra State
> and
> > >> sends the state to agent -> master -> framework.
> > >>
> > >> *RUNNING Cassandra Process translates to TASK_RUNNING.*
> > >> *CRASHED or DRAINED Cassandra Process translates to TASK_FAILED.*
> > >>
> > >> At some point slave has multiple TASK_RUNNING status updates in stream
> > and
> > >> then followed by TASK_FAILED if acknowledgements are pending. We use
> > >> explicit acknowledgements, and I see Mesos Master receives, all
> > >> TASK_RUNNING and then TASK_FAILED as well as Framework also receives
> all
> > >> TASK_RUNNING updates followed up TASK_FAILED. After receiving
> > TASK_FAILED,
> > >> Framework restarts different executor on same machine using old
> > persistent
> > >> volume.
> > >>
> > >> Issue is that, *old executor reference is hold by slave* (assuming it
> > did
> > >> not receive acknowledgement, whereas master and scheduler have
> processed
> > >> the status updates), so it  continues to retry TASK_RUNNING
> infinitely.
> > >> Here, old executor process is not running. As well as new executor
> > process
> > >> is running, and continues to work as-is. This makes be believe, some
> bug
> > >> with slave status update manager.
> > >>
> > >> I read slave status update manager code, recover
> > >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> > >> status_update_manager.cpp#L203>
> > >>  has a constraint
> > >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> > >> status_update_manager.cpp#L239>
> > >>  to ignore status updates from stream if the last executor run is
> > >> completed.
> > >>
> > >> I think, similar constraint should be applicable for status update
> > >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> > >> status_update_manager.cpp#L318>
> > >>  and acknowledge
> > >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> > >> status_update_manager.cpp#L760>
> > >> .
> > >>
> > >>
> > >> Thanks,
> > >> Varun
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Fri, Mar 16, 2018 at 7:47 PM, Benjamin Mahler <bm...@apache.org>
> > >> wrote:
> > >>
> > >> > (1) Assuming you're referring to the scheduler's acknowledgement of
> a
> > >> > status update, the agent will not forward TS2 until TS1 has been
> > >> > acknowledged. So, TS2 will not be acknowledged before TS1 is
> > >> acknowledged.
> > >> > FWICT, we'll ignore any violation of this ordering and log a
> warning.
> > >> >
> > >> > (2) To reverse the question, why would it make sense to ignore them?
> > >> > Assuming you're looking to reduce the number of round trips needed
> for
> > >> > schedulers to see the terminal update, I would point you to:
> > >> > https://issues.apache.org/jira/browse/MESOS-6941
> > >> >
> > >> > (3) When the agent sees an executor terminate, it will transition
> all
> > >> > non-terminal tasks assigned to that executor to TASK_GONE (partition
> > >> aware
> > >> > framework) or TASK_LOST (non partition aware framework) or
> TASK_FAILED
> > >> (if
> > >> > the container OOMed). There may be other cases, it looks a bit
> > >> convoluted
> > >> > to me.
> > >> >
> > >> > On Thu, Mar 15, 2018 at 10:35 AM, Zhitao Li <zh...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > While designing the correct behavior with one of our framework, we
> > >> > > encounters some questions about behavior of status update:
> > >> > >
> > >> > > The executor continuously polls the workload probe to get current
> > >> mode of
> > >> > > workload (a Cassandra server), and send various status update
> states
> > >> > > (STARTING, RUNNING, FAILED, etc).
> > >> > >
> > >> > > Executor polls every 30 seconds, and sends status update. Here, we
> > are
> > >> > > seeing congestion on task update acknowledgements somewhere (still
> > >> > > unknown).
> > >> > >
> > >> > > There are three scenarios that we want to understand.
> > >> > >
> > >> > >    1. Agent queue has task update TS1, TS2 & TS3 (in this order)
> > >> waiting
> > >> > on
> > >> > >    acknowledgement. Suppose if TS2 receives an acknowledgement,
> then
> > >> what
> > >> > > will
> > >> > >    happen to TS1 update in the queue.
> > >> > >
> > >> > >
> > >> > >    1. Agent queue has task update TS1, TS2, TS3 & TASK_FAILED.
> Here,
> > >> TS1,
> > >> > >    TS2, TS3 are non-terminial updates. Once the agent has
> received a
> > >> > > terminal
> > >> > >    status update, does it makes sense to ignore non-terminal
> updates
> > >> in
> > >> > the
> > >> > >    queue?
> > >> > >
> > >> > >
> > >> > >    1. As per Executor Driver code comment
> > >> > >    <https://github.com/apache/mesos/blob/master/src/java/src/
> > >> > > org/apache/mesos/ExecutorDriver.java#L86>,
> > >> > >    if the executor is terminated, does agent send TASK_LOST? If
> so,
> > >> does
> > >> > it
> > >> > >    send once or for each unacknowledged status update?
> > >> > >
> > >> > >
> > >> > > I'll study the code in status update manager and agent separately
> > but
> > >> > some
> > >> > > official answer will definitely help.
> > >> > >
> > >> > > Many thanks!
> > >> > >
> > >> > > --
> > >> > > Cheers,
> > >> > >
> > >> > > Zhitao Li
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Question on status update retry in agent

Posted by Varun Gupta <va...@uber.com>.
We use explicit ack from Scheduler.

Here, is a snippet of the logs. Please see logs for Status Update UUID:
a918f5ed-a604-415a-ad62-5a34fb6334ef

W0416 00:41:25.843505 124530 status_update_manager.cpp:761] Duplicate
status update acknowledgment (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8)
for update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for
task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

E0416 00:41:25.843559 124542 slave.cpp:2951] Failed to handle status update
acknowledgement (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8) for task
node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190: Duplicate acknowledgement

W0416 00:41:28.416702 124539 status_update_manager.cpp:478] Resending
status update TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for
task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

I0416 00:41:28.416910 124530 slave.cpp:4051] Forwarding the update
TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for task
node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190 to master@10.160.41.62:5050

W0416 00:41:28.425237 124537 status_update_manager.cpp:478] Resending
status update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for
task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

I0416 00:41:28.425361 124525 slave.cpp:4051] Forwarding the update
TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for task
node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190 to master@10.160.41.62:5050

I0416 00:41:36.723466 124518 status_update_manager.cpp:395] Received status
update acknowledgement (UUID: b99b15d6-d9af-4395-9c46-709f91db90e1) for
task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

W0416 00:41:36.723588 124518 status_update_manager.cpp:761] Duplicate
status update acknowledgment (UUID: b99b15d6-d9af-4395-9c46-709f91db90e1)
for update TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for
task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

E0416 00:41:36.723650 124518 slave.cpp:2951] Failed to handle status update
acknowledgement (UUID: b99b15d6-d9af-4395-9c46-709f91db90e1) for task
node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190: Duplicate acknowledgement

I0416 00:41:36.730357 124527 status_update_manager.cpp:395] Received status
update acknowledgement (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8) for
task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

W0416 00:41:36.730417 124527 status_update_manager.cpp:761] Duplicate
status update acknowledgment (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8)
for update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for
task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

E0416 00:41:36.730465 124518 slave.cpp:2951] Failed to handle status update
acknowledgement (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8) for task
node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190: Duplicate acknowledgement

I0416 00:41:43.769918 124532 http.cpp:277] HTTP GET for
/slave(1)/state.json from 10.65.5.13:42600 with
User-Agent='filebundle-agent'

W0416 00:41:48.417913 124539 status_update_manager.cpp:478] Resending
status update TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for
task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

I0416 00:41:48.418166 124538 slave.cpp:4051] Forwarding the update
TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for task
node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190 to master@10.160.41.62:5050

W0416 00:41:48.425532 124539 status_update_manager.cpp:478] Resending
status update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for
task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190


I discussed with @zhitao. Here, is a plausible explanation of the bug.
Executor sends, updates to agent every 60 seconds, and agents maintains
them in pending
<https://github.com/apache/mesos/blob/master/src/slave/task_status_update_manager.hpp#L173>
queue. Now when the ack comes, they can be in any order for the status
update but _handle section pops
<https://github.com/apache/mesos/blob/master/src/slave/task_status_update_manager.cpp#L888>
last update from the queue without making sure, ack was for that status
update.




On Tue, Apr 10, 2018 at 6:43 PM, Benjamin Mahler <bm...@apache.org> wrote:

> Do you have logs? Which acknowledgements did the agent receive? Which
> TASK_RUNNING in the sequence was it re-sending?
>
> On Tue, Apr 10, 2018 at 6:41 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
> > > Issue is that, *old executor reference is hold by slave* (assuming it
> > did not receive acknowledgement, whereas master and scheduler have
> > processed the status updates), so it  continues to retry TASK_RUNNING
> > infinitely.
> >
> > The agent only retries so long as it does not get an acknowledgement, is
> > the scheduler acknowledging the duplicates updates or ignoring them?
> >
> > On Mon, Apr 9, 2018 at 12:10 PM, Varun Gupta <va...@uber.com> wrote:
> >
> >> Hi,
> >>
> >> We are running into an issue with slave status update manager. Below is
> >> the
> >> behavior I am seeing.
> >>
> >> Our use case is, we run Stateful container (Cassandra process), here
> >> Executor polls JMX port at 60 second interval to get Cassandra State and
> >> sends the state to agent -> master -> framework.
> >>
> >> *RUNNING Cassandra Process translates to TASK_RUNNING.*
> >> *CRASHED or DRAINED Cassandra Process translates to TASK_FAILED.*
> >>
> >> At some point slave has multiple TASK_RUNNING status updates in stream
> and
> >> then followed by TASK_FAILED if acknowledgements are pending. We use
> >> explicit acknowledgements, and I see Mesos Master receives, all
> >> TASK_RUNNING and then TASK_FAILED as well as Framework also receives all
> >> TASK_RUNNING updates followed up TASK_FAILED. After receiving
> TASK_FAILED,
> >> Framework restarts different executor on same machine using old
> persistent
> >> volume.
> >>
> >> Issue is that, *old executor reference is hold by slave* (assuming it
> did
> >> not receive acknowledgement, whereas master and scheduler have processed
> >> the status updates), so it  continues to retry TASK_RUNNING infinitely.
> >> Here, old executor process is not running. As well as new executor
> process
> >> is running, and continues to work as-is. This makes be believe, some bug
> >> with slave status update manager.
> >>
> >> I read slave status update manager code, recover
> >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> >> status_update_manager.cpp#L203>
> >>  has a constraint
> >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> >> status_update_manager.cpp#L239>
> >>  to ignore status updates from stream if the last executor run is
> >> completed.
> >>
> >> I think, similar constraint should be applicable for status update
> >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> >> status_update_manager.cpp#L318>
> >>  and acknowledge
> >> <https://github.com/apache/mesos/blob/master/src/slave/task_
> >> status_update_manager.cpp#L760>
> >> .
> >>
> >>
> >> Thanks,
> >> Varun
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Mar 16, 2018 at 7:47 PM, Benjamin Mahler <bm...@apache.org>
> >> wrote:
> >>
> >> > (1) Assuming you're referring to the scheduler's acknowledgement of a
> >> > status update, the agent will not forward TS2 until TS1 has been
> >> > acknowledged. So, TS2 will not be acknowledged before TS1 is
> >> acknowledged.
> >> > FWICT, we'll ignore any violation of this ordering and log a warning.
> >> >
> >> > (2) To reverse the question, why would it make sense to ignore them?
> >> > Assuming you're looking to reduce the number of round trips needed for
> >> > schedulers to see the terminal update, I would point you to:
> >> > https://issues.apache.org/jira/browse/MESOS-6941
> >> >
> >> > (3) When the agent sees an executor terminate, it will transition all
> >> > non-terminal tasks assigned to that executor to TASK_GONE (partition
> >> aware
> >> > framework) or TASK_LOST (non partition aware framework) or TASK_FAILED
> >> (if
> >> > the container OOMed). There may be other cases, it looks a bit
> >> convoluted
> >> > to me.
> >> >
> >> > On Thu, Mar 15, 2018 at 10:35 AM, Zhitao Li <zh...@gmail.com>
> >> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > While designing the correct behavior with one of our framework, we
> >> > > encounters some questions about behavior of status update:
> >> > >
> >> > > The executor continuously polls the workload probe to get current
> >> mode of
> >> > > workload (a Cassandra server), and send various status update states
> >> > > (STARTING, RUNNING, FAILED, etc).
> >> > >
> >> > > Executor polls every 30 seconds, and sends status update. Here, we
> are
> >> > > seeing congestion on task update acknowledgements somewhere (still
> >> > > unknown).
> >> > >
> >> > > There are three scenarios that we want to understand.
> >> > >
> >> > >    1. Agent queue has task update TS1, TS2 & TS3 (in this order)
> >> waiting
> >> > on
> >> > >    acknowledgement. Suppose if TS2 receives an acknowledgement, then
> >> what
> >> > > will
> >> > >    happen to TS1 update in the queue.
> >> > >
> >> > >
> >> > >    1. Agent queue has task update TS1, TS2, TS3 & TASK_FAILED. Here,
> >> TS1,
> >> > >    TS2, TS3 are non-terminial updates. Once the agent has received a
> >> > > terminal
> >> > >    status update, does it makes sense to ignore non-terminal updates
> >> in
> >> > the
> >> > >    queue?
> >> > >
> >> > >
> >> > >    1. As per Executor Driver code comment
> >> > >    <https://github.com/apache/mesos/blob/master/src/java/src/
> >> > > org/apache/mesos/ExecutorDriver.java#L86>,
> >> > >    if the executor is terminated, does agent send TASK_LOST? If so,
> >> does
> >> > it
> >> > >    send once or for each unacknowledged status update?
> >> > >
> >> > >
> >> > > I'll study the code in status update manager and agent separately
> but
> >> > some
> >> > > official answer will definitely help.
> >> > >
> >> > > Many thanks!
> >> > >
> >> > > --
> >> > > Cheers,
> >> > >
> >> > > Zhitao Li
> >> > >
> >> >
> >>
> >
> >
>

Re: Question on status update retry in agent

Posted by Benjamin Mahler <bm...@apache.org>.
Do you have logs? Which acknowledgements did the agent receive? Which
TASK_RUNNING in the sequence was it re-sending?

On Tue, Apr 10, 2018 at 6:41 PM, Benjamin Mahler <bm...@apache.org> wrote:

> > Issue is that, *old executor reference is hold by slave* (assuming it
> did not receive acknowledgement, whereas master and scheduler have
> processed the status updates), so it  continues to retry TASK_RUNNING
> infinitely.
>
> The agent only retries so long as it does not get an acknowledgement, is
> the scheduler acknowledging the duplicates updates or ignoring them?
>
> On Mon, Apr 9, 2018 at 12:10 PM, Varun Gupta <va...@uber.com> wrote:
>
>> Hi,
>>
>> We are running into an issue with slave status update manager. Below is
>> the
>> behavior I am seeing.
>>
>> Our use case is, we run Stateful container (Cassandra process), here
>> Executor polls JMX port at 60 second interval to get Cassandra State and
>> sends the state to agent -> master -> framework.
>>
>> *RUNNING Cassandra Process translates to TASK_RUNNING.*
>> *CRASHED or DRAINED Cassandra Process translates to TASK_FAILED.*
>>
>> At some point slave has multiple TASK_RUNNING status updates in stream and
>> then followed by TASK_FAILED if acknowledgements are pending. We use
>> explicit acknowledgements, and I see Mesos Master receives, all
>> TASK_RUNNING and then TASK_FAILED as well as Framework also receives all
>> TASK_RUNNING updates followed up TASK_FAILED. After receiving TASK_FAILED,
>> Framework restarts different executor on same machine using old persistent
>> volume.
>>
>> Issue is that, *old executor reference is hold by slave* (assuming it did
>> not receive acknowledgement, whereas master and scheduler have processed
>> the status updates), so it  continues to retry TASK_RUNNING infinitely.
>> Here, old executor process is not running. As well as new executor process
>> is running, and continues to work as-is. This makes be believe, some bug
>> with slave status update manager.
>>
>> I read slave status update manager code, recover
>> <https://github.com/apache/mesos/blob/master/src/slave/task_
>> status_update_manager.cpp#L203>
>>  has a constraint
>> <https://github.com/apache/mesos/blob/master/src/slave/task_
>> status_update_manager.cpp#L239>
>>  to ignore status updates from stream if the last executor run is
>> completed.
>>
>> I think, similar constraint should be applicable for status update
>> <https://github.com/apache/mesos/blob/master/src/slave/task_
>> status_update_manager.cpp#L318>
>>  and acknowledge
>> <https://github.com/apache/mesos/blob/master/src/slave/task_
>> status_update_manager.cpp#L760>
>> .
>>
>>
>> Thanks,
>> Varun
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Mar 16, 2018 at 7:47 PM, Benjamin Mahler <bm...@apache.org>
>> wrote:
>>
>> > (1) Assuming you're referring to the scheduler's acknowledgement of a
>> > status update, the agent will not forward TS2 until TS1 has been
>> > acknowledged. So, TS2 will not be acknowledged before TS1 is
>> acknowledged.
>> > FWICT, we'll ignore any violation of this ordering and log a warning.
>> >
>> > (2) To reverse the question, why would it make sense to ignore them?
>> > Assuming you're looking to reduce the number of round trips needed for
>> > schedulers to see the terminal update, I would point you to:
>> > https://issues.apache.org/jira/browse/MESOS-6941
>> >
>> > (3) When the agent sees an executor terminate, it will transition all
>> > non-terminal tasks assigned to that executor to TASK_GONE (partition
>> aware
>> > framework) or TASK_LOST (non partition aware framework) or TASK_FAILED
>> (if
>> > the container OOMed). There may be other cases, it looks a bit
>> convoluted
>> > to me.
>> >
>> > On Thu, Mar 15, 2018 at 10:35 AM, Zhitao Li <zh...@gmail.com>
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > While designing the correct behavior with one of our framework, we
>> > > encounters some questions about behavior of status update:
>> > >
>> > > The executor continuously polls the workload probe to get current
>> mode of
>> > > workload (a Cassandra server), and send various status update states
>> > > (STARTING, RUNNING, FAILED, etc).
>> > >
>> > > Executor polls every 30 seconds, and sends status update. Here, we are
>> > > seeing congestion on task update acknowledgements somewhere (still
>> > > unknown).
>> > >
>> > > There are three scenarios that we want to understand.
>> > >
>> > >    1. Agent queue has task update TS1, TS2 & TS3 (in this order)
>> waiting
>> > on
>> > >    acknowledgement. Suppose if TS2 receives an acknowledgement, then
>> what
>> > > will
>> > >    happen to TS1 update in the queue.
>> > >
>> > >
>> > >    1. Agent queue has task update TS1, TS2, TS3 & TASK_FAILED. Here,
>> TS1,
>> > >    TS2, TS3 are non-terminial updates. Once the agent has received a
>> > > terminal
>> > >    status update, does it makes sense to ignore non-terminal updates
>> in
>> > the
>> > >    queue?
>> > >
>> > >
>> > >    1. As per Executor Driver code comment
>> > >    <https://github.com/apache/mesos/blob/master/src/java/src/
>> > > org/apache/mesos/ExecutorDriver.java#L86>,
>> > >    if the executor is terminated, does agent send TASK_LOST? If so,
>> does
>> > it
>> > >    send once or for each unacknowledged status update?
>> > >
>> > >
>> > > I'll study the code in status update manager and agent separately but
>> > some
>> > > official answer will definitely help.
>> > >
>> > > Many thanks!
>> > >
>> > > --
>> > > Cheers,
>> > >
>> > > Zhitao Li
>> > >
>> >
>>
>
>

Re: Question on status update retry in agent

Posted by Benjamin Mahler <bm...@apache.org>.
> Issue is that, *old executor reference is hold by slave* (assuming it did not
receive acknowledgement, whereas master and scheduler have processed the
status updates), so it  continues to retry TASK_RUNNING infinitely.

The agent only retries so long as it does not get an acknowledgement, is
the scheduler acknowledging the duplicates updates or ignoring them?

On Mon, Apr 9, 2018 at 12:10 PM, Varun Gupta <va...@uber.com> wrote:

> Hi,
>
> We are running into an issue with slave status update manager. Below is the
> behavior I am seeing.
>
> Our use case is, we run Stateful container (Cassandra process), here
> Executor polls JMX port at 60 second interval to get Cassandra State and
> sends the state to agent -> master -> framework.
>
> *RUNNING Cassandra Process translates to TASK_RUNNING.*
> *CRASHED or DRAINED Cassandra Process translates to TASK_FAILED.*
>
> At some point slave has multiple TASK_RUNNING status updates in stream and
> then followed by TASK_FAILED if acknowledgements are pending. We use
> explicit acknowledgements, and I see Mesos Master receives, all
> TASK_RUNNING and then TASK_FAILED as well as Framework also receives all
> TASK_RUNNING updates followed up TASK_FAILED. After receiving TASK_FAILED,
> Framework restarts different executor on same machine using old persistent
> volume.
>
> Issue is that, *old executor reference is hold by slave* (assuming it did
> not receive acknowledgement, whereas master and scheduler have processed
> the status updates), so it  continues to retry TASK_RUNNING infinitely.
> Here, old executor process is not running. As well as new executor process
> is running, and continues to work as-is. This makes be believe, some bug
> with slave status update manager.
>
> I read slave status update manager code, recover
> <https://github.com/apache/mesos/blob/master/src/slave/
> task_status_update_manager.cpp#L203>
>  has a constraint
> <https://github.com/apache/mesos/blob/master/src/slave/
> task_status_update_manager.cpp#L239>
>  to ignore status updates from stream if the last executor run is
> completed.
>
> I think, similar constraint should be applicable for status update
> <https://github.com/apache/mesos/blob/master/src/slave/
> task_status_update_manager.cpp#L318>
>  and acknowledge
> <https://github.com/apache/mesos/blob/master/src/slave/
> task_status_update_manager.cpp#L760>
> .
>
>
> Thanks,
> Varun
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Mar 16, 2018 at 7:47 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
> > (1) Assuming you're referring to the scheduler's acknowledgement of a
> > status update, the agent will not forward TS2 until TS1 has been
> > acknowledged. So, TS2 will not be acknowledged before TS1 is
> acknowledged.
> > FWICT, we'll ignore any violation of this ordering and log a warning.
> >
> > (2) To reverse the question, why would it make sense to ignore them?
> > Assuming you're looking to reduce the number of round trips needed for
> > schedulers to see the terminal update, I would point you to:
> > https://issues.apache.org/jira/browse/MESOS-6941
> >
> > (3) When the agent sees an executor terminate, it will transition all
> > non-terminal tasks assigned to that executor to TASK_GONE (partition
> aware
> > framework) or TASK_LOST (non partition aware framework) or TASK_FAILED
> (if
> > the container OOMed). There may be other cases, it looks a bit convoluted
> > to me.
> >
> > On Thu, Mar 15, 2018 at 10:35 AM, Zhitao Li <zh...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > While designing the correct behavior with one of our framework, we
> > > encounters some questions about behavior of status update:
> > >
> > > The executor continuously polls the workload probe to get current mode
> of
> > > workload (a Cassandra server), and send various status update states
> > > (STARTING, RUNNING, FAILED, etc).
> > >
> > > Executor polls every 30 seconds, and sends status update. Here, we are
> > > seeing congestion on task update acknowledgements somewhere (still
> > > unknown).
> > >
> > > There are three scenarios that we want to understand.
> > >
> > >    1. Agent queue has task update TS1, TS2 & TS3 (in this order)
> waiting
> > on
> > >    acknowledgement. Suppose if TS2 receives an acknowledgement, then
> what
> > > will
> > >    happen to TS1 update in the queue.
> > >
> > >
> > >    1. Agent queue has task update TS1, TS2, TS3 & TASK_FAILED. Here,
> TS1,
> > >    TS2, TS3 are non-terminial updates. Once the agent has received a
> > > terminal
> > >    status update, does it makes sense to ignore non-terminal updates in
> > the
> > >    queue?
> > >
> > >
> > >    1. As per Executor Driver code comment
> > >    <https://github.com/apache/mesos/blob/master/src/java/src/
> > > org/apache/mesos/ExecutorDriver.java#L86>,
> > >    if the executor is terminated, does agent send TASK_LOST? If so,
> does
> > it
> > >    send once or for each unacknowledged status update?
> > >
> > >
> > > I'll study the code in status update manager and agent separately but
> > some
> > > official answer will definitely help.
> > >
> > > Many thanks!
> > >
> > > --
> > > Cheers,
> > >
> > > Zhitao Li
> > >
> >
>