You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Ankur Chauhan <an...@malloc64.com> on 2015/07/25 21:30:30 UTC

Questions about framework development - (HA and reconciling state)

Hi all,


I am working on creating an integration between Apache Flink (http://flink.apache.org) and mesos which would be similar to the way the current hadoop-mesos integration works using the java mesos client.
My current idea is that the scheduler will also run a JobManager process (similar to the jobTracker) which will start off a bunch of taskManager (similar to the TaskTracker) tasks using a custom executor.

I want to get some feedback and information of the following questions I have:

0. How do i go about the issue of HA at the scheduler level?
    I was thinking of using zookeeper based leader election by directly maintaining a zookeeper connection myself. Is there a better way to do this (something which does not require me to use a self managed zookeeper connection)?

1. How do i deal with restarts and reconciling the tasks?
    In case the scheduler restarts (currently maintains an in-memory map of currently running tasks), How do I go about rediscovering tasks and reconciling state?
    I was thinking of using DiscoverInfo but I can't find any reference to figure out how to "query" mesos for tasks matching the service discovery information. - Any suggestions on how to do this.

3. How does one go about testing frameworks? Any suggestions / pointers.

My work in progress version is at https://github.com/ankurcha/flink/tree/flink-mesos/flink-mesos

Any help would be much appreciated.


Thanks!
Ankur

Re: Questions about framework development - (HA and reconciling state)

Posted by Adam Bordelon <ad...@mesosphere.io>.

> 0. How do i go about the issue of HA at the scheduler level?
One alternative to having to do your own leader election is to use a
meta-framework like Marathon or Aurora to automatically restart your
scheduler. There will be a short downtime during the failover, but as soon
as the new scheduler comes back up it can recover state, reregister, and
reconcile. Then you only ever need one running instance, which is always
the leader.

> 1. How do i deal with restarts and reconciling the tasks?
I strongly recommend you read
http://mesos.apache.org/documentation/latest/reconciliation/

> 3. How does one go about testing frameworks? Any suggestions / pointers.
- Unit tests within your framework code, mocking necessary Mesos
Master/Slave components.
- Health checks on all your tasks, and a `/health` endpoint on your
scheduler, to ease integration testing.

On Sat, Jul 25, 2015 at 12:30 PM, Ankur Chauhan <an...@malloc64.com> wrote:

> Hi all,
>
>
> I am working on creating an integration between Apache Flink (
> http://flink.apache.org) and mesos which would be similar to the way the
> current hadoop-mesos integration works using the java mesos client.
> My current idea is that the scheduler will also run a JobManager process
> (similar to the jobTracker) which will start off a bunch of taskManager
> (similar to the TaskTracker) tasks using a custom executor.
>
> I want to get some feedback and information of the following questions I
> have:
>
> 0. How do i go about the issue of HA at the scheduler level?
>     I was thinking of using zookeeper based leader election by directly
> maintaining a zookeeper connection myself. Is there a better way to do this
> (something which does not require me to use a self managed zookeeper
> connection)?
>
> 1. How do i deal with restarts and reconciling the tasks?
>     In case the scheduler restarts (currently maintains an in-memory map
> of currently running tasks), How do I go about rediscovering tasks and
> reconciling state?
>     I was thinking of using DiscoverInfo but I can't find any reference to
> figure out how to "query" mesos for tasks matching the service discovery
> information. - Any suggestions on how to do this.
>
> 3. How does one go about testing frameworks? Any suggestions / pointers.
>
> My work in progress version is at
> https://github.com/ankurcha/flink/tree/flink-mesos/flink-mesos
>
> Any help would be much appreciated.
>
>
> Thanks!
> Ankur
>

Re: Questions about framework development - (HA and reconciling state)

Posted by Jeff Schroeder <je...@computer.org>.

Not sure how much more difficult it would be, but Apache Aurora uses the
native mesos replicated log construct for data persistence (where you store
data in memory). It requires one manual setup to deploy the framework, but
seems like it is worth it for what you get out of it. Here is how I just
recently tested it out and was impressed with how bulletproof it is.

I ran a semi chaos monkey test with Aurora + aurproxy with an nginx load
balancer. Every random seconds < 200, it would restart one of the 5 Aurora
schedulers in a loop. Then while clients were hitting the webapp at
~50-60rps I was cycling aurora job update between 5 and 15 instances in a
loop to see how the clients handled scheduler failover and instances being
killed.

Never had a single issue from the schedulers and only a single 502 error
after about 2 million requests, which can be mitigated with a bit more
tuning.

On Saturday, July 25, 2015, Ankur Chauhan <an...@malloc64.com> wrote:

> Hi all,
>
>
> I am working on creating an integration between Apache Flink (
> http://flink.apache.org) and mesos which would be similar to the way the
> current hadoop-mesos integration works using the java mesos client.
> My current idea is that the scheduler will also run a JobManager process
> (similar to the jobTracker) which will start off a bunch of taskManager
> (similar to the TaskTracker) tasks using a custom executor.
>
> I want to get some feedback and information of the following questions I
> have:
>
> 0. How do i go about the issue of HA at the scheduler level?
>     I was thinking of using zookeeper based leader election by directly
> maintaining a zookeeper connection myself. Is there a better way to do this
> (something which does not require me to use a self managed zookeeper
> connection)?
>
> 1. How do i deal with restarts and reconciling the tasks?
>     In case the scheduler restarts (currently maintains an in-memory map
> of currently running tasks), How do I go about rediscovering tasks and
> reconciling state?
>     I was thinking of using DiscoverInfo but I can't find any reference to
> figure out how to "query" mesos for tasks matching the service discovery
> information. - Any suggestions on how to do this.
>
> 3. How does one go about testing frameworks? Any suggestions / pointers.
>
> My work in progress version is at
> https://github.com/ankurcha/flink/tree/flink-mesos/flink-mesos
>
> Any help would be much appreciated.
>
>
> Thanks!
> Ankur
>

-- 
Text by Jeff, typos by iPhone