You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Benjamin Mahler <be...@gmail.com> on 2014/12/01 19:28:46 UTC
Re: Task Checkpointing with Mesos, Marathon and Docker containers

> I would like to be able to shutdown a mesos-slave for maintenance without
altering the current tasks.

What are you trying to do? If your maintenance operation does not affect
the tasks, why do you need to stop the slave in the first place?

On Wed, Nov 26, 2014 at 1:36 AM, Geoffroy Jabouley <
geoffroy.jabouley@gmail.com> wrote:

> Hello all
>
> thanks for your answers.
>
> Is there a way of configuring this 75s timeout for slave reconnection?
>
> I think that my problem is that as the task status is lost:
> - marathon framework detects the loss and start another instance
> - mesos-slave, when restarting, detects the lost task and restart a new one
>
> ==> 2 tasks on mesos cluster, 2 running docker containers, 1 app instance
> in marathon
>
>
> So a solution would be to extend the 75s timeout. I thought that my
> command lines for starting the cluster were fine, but it seems incomplete...
>
> I would like to be able to shutdown a mesos-slave for maintenance without
> altering the current tasks.
>
> 2014-11-25 18:30 GMT+01:00 Connor Doyle <co...@mesosphere.io>:
>
>> Hi Geoffroy,
>>
>> For the Marathon instances, in all released version of Marathon you must
>> supply the --checkpoint flag to turn on task checkpointing for the
>> framework.  We've changed the default to true starting with the next
>> release.
>>
>> There is a bug in Mesos where the FrameworkInfo does not get updated when
>> a framework re-registers.  This means that if you shut down Marathon and
>> restart it with --checkpoint, the Mesos master (with the same FrameworkId,
>> which Marathon picks up from ZK) will ignore the new setting.  For
>> reference, here is the design doc to address that:
>> https://cwiki.apache.org/confluence/display/MESOS/Design+doc%3A+Updating+Framework+Info
>>
>> Fortunately, there is an easy workaround.
>>
>> 1) Shut down Marathon (tasks keep running)
>> 2) Restart the leading Mesos master (tasks keep running)
>> 3) Start Marathon with --checkpoint enabled
>>
>> This works by clearing the Mesos master's in-memory state.  It is rebuilt
>> as the slave nodes and frameworks re-register.
>>
>> Please report back if this doesn't solve the issue for you.
>> --
>> Connor
>>
>>
>> > On Nov 25, 2014, at 07:43, Geoffroy Jabouley <
>> geoffroy.jabouley@gmail.com> wrote:
>> >
>> > Hello
>> >
>> > i am currently trying to activate checkpointing for my Mesos cloud.
>> >
>> > Starting from an application running in a docker container on the
>> cluster, launched from marathon, my use cases are the followings:
>> >
>> > UC1: kill the marathon service, then restart after 2 minutes.
>> > Expected: the mesos task is still active, the docker container is
>> running. When the marathon service restarts, it get backs its tasks.
>> >
>> > Result: OK
>> >
>> >
>> > UC2: kill the mesos slave, then restart after 2 minutes.
>> > Expected: the mesos task remains active, the docker container is
>> running. When the mesos slave service restarts, it get backs its tasks.
>> Marathon does not show error.
>> >
>> > Results: task get status LOST when slave is killed. Docker container
>> still running.  Marathon detects the application went down and spawn a new
>> one on another available mesos slave. When the slave restarts, it kills the
>> previous running container and start a new one. So i end up with 2
>> applications on my cluster, one spawn by Marathon, and another orphan one.
>> >
>> >
>> > Is this behavior normal? Can you please explain what i am doing wrong?
>> >
>> >
>> -----------------------------------------------------------------------------------------------------------
>> >
>> > Here is the configuration i have come so far:
>> > Mesos 0.19.1 (not dockerized)
>> > Marathon 0.6.1 (not dockerized)
>> > Docker 1.3 + Deimos 0.4.2
>> >
>> > Mesos master is started:
>> > /usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050
>> --log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=...
>> --quorum=1 --work_dir=/var/lib/mesos
>> >
>> > Mesos slave is started:
>> > /usr/local/sbin/mesos-slave --master=zk://...:2181/mesos
>> --log_dir=/var/log/mesos --checkpoint=true
>> --containerizer_path=/usr/local/bin/deimos
>> --executor_registration_timeout=5mins --hostname=... --ip=...
>> --isolation=external --recover=reconnect --recovery_timeout=120mins
>> --strict=true
>> >
>> > Marathon is started:
>> > java -Xmx512m -Djava.library.path=/usr/local/lib
>> -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp
>> /usr/local/bin/marathon mesosphere.marathon.Main --zk
>> zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 30000
>> --hostname ... --event_subscriber http_callback --http_port 8080
>> --task_launch_timeout 300000 --local_port_max 40000 --ha --checkpoint
>> >
>> >
>> >
>> >
>>
>>
>