You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Vikram Kone <vi...@gmail.com> on 2015/08/18 20:53:51 UTC

How to make any service highly available using Zookeper

Hi,
I'm a newbie to Zookeeper, so pardon any naive question I ask here.
I have a cassandra cluster running on linux VMs and have a spark job
scheduler service running on one of the nodes. Since cassandra has a
peer-peer architecture there is no concept of leader.
I want to provide high availability for this job scheduler service using
Zookeeper. I can't make any code changes to the job scheduler service since
it's a 3rd party app.
I'm thinking of copying the application folder on all the servers in the
cluster and use zookeeper to start an instance of the service on the
leader/master node by executing /opt/job-scheduler/bin/start.sh on leader
election.
Is this something easy to do with zookeeper?

Please point to any documentation or tutorial on how to run a bash script
on the leader node in zoo keeper's ensemble after a node is elected as
leader by the quorum.

Thanks

Re: How to make any service highly available using Zookeper

Posted by Lars Albertsson <la...@gmail.com>.

Hi again Vikram,

I am convinced that there are suitable off-the-shelf solutions from
the HA service niche, i.e. something similar to HAProxy or ELB. I am
not an expert in that area, however, so I cannot recommend anything in
particular.

>From the batch ecosystem components, I see two reasonable options. You
will need a watchdog that respawns a scheduler on failure. It should
be masterless, or you will have issues regarding who watches the
watchdog.

Your first option would be to introduce Mesos, and use either Marathon
or Aurora, which can guarantee that a service runs in exactly N
instances.

If you don't want to bring in Mesos as a dependency, you can roll your
own watchdog. I believe that it is simpler to base it on regular
health checks driven by cron, as opposed to long-running processes
that both monitor each other and your scheduler. If you put the same
script in crontab on redundant machines, and use a Zookeeper-based
lease (http://kazoo.readthedocs.org/en/latest/api/recipe/lease.html),
you effectively have an HA cron service. The script can then perform a
health check and respawn your scheduler if necessary.

With both of your solutions, you will risk a split brain scenario if
the current scheduler does not respond to health checks, but still
thinks it is alive, e.g. if you run on bare metal, and there are
network issues. A straightforward solution would be to run your
scheduler in a VM or other container, and bring the whole container
down on failover.

>From your mail, I get the impression that you are considering running
the scheduler on one of the Cassandra nodes, or on the Zookeeper
leader node. I suggest avoiding that for stability reasons, and
instead run the scheduler on a dedicated node. A centralised job
scheduler is a bottleneck, and its resource consumption will rise over
time. Cassandra works best in balanced and symmetric scenarios, and
Zookeeper is not scalable and sensitive to overload, so both are best
left alone.

I hope the information is useful.

Regards,

Lars Albertsson

On Tue, Aug 18, 2015 at 8:53 PM, Vikram Kone <vi...@gmail.com> wrote:
> Hi,
> I'm a newbie to Zookeeper, so pardon any naive question I ask here.
> I have a cassandra cluster running on linux VMs and have a spark job
> scheduler service running on one of the nodes. Since cassandra has a
> peer-peer architecture there is no concept of leader.
> I want to provide high availability for this job scheduler service using
> Zookeeper. I can't make any code changes to the job scheduler service since
> it's a 3rd party app.
> I'm thinking of copying the application folder on all the servers in the
> cluster and use zookeeper to start an instance of the service on the
> leader/master node by executing /opt/job-scheduler/bin/start.sh on leader
> election.
> Is this something easy to do with zookeeper?
>
> Please point to any documentation or tutorial on how to run a bash script
> on the leader node in zoo keeper's ensemble after a node is elected as
> leader by the quorum.
>
> Thanks

Re: How to make any service highly available using Zookeper

Posted by Martin Grotzke <ma...@googlemail.com>.

Hi Vikram,

we built s.th. for running/scheduling jobs in replicated (HA) worker nodes,
and there we're using Cassandra (lightweight transactions) for job locking:
https://github.com/Galeria-Kaufhof/ha-jobs.

I don't know the spark job scheduler, but if it's possible to embed this
and control/start/monitor it via some java/scala api, you could build a
simple wrapper app that's using ha-jobs to decide on which node the spark
job scheduler should be run.

Cheers,
Martin
Am 18.08.2015 20:54 schrieb "Vikram Kone" <vi...@gmail.com>:

> Hi,
> I'm a newbie to Zookeeper, so pardon any naive question I ask here.
> I have a cassandra cluster running on linux VMs and have a spark job
> scheduler service running on one of the nodes. Since cassandra has a
> peer-peer architecture there is no concept of leader.
> I want to provide high availability for this job scheduler service using
> Zookeeper. I can't make any code changes to the job scheduler service since
> it's a 3rd party app.
> I'm thinking of copying the application folder on all the servers in the
> cluster and use zookeeper to start an instance of the service on the
> leader/master node by executing /opt/job-scheduler/bin/start.sh on leader
> election.
> Is this something easy to do with zookeeper?
>
> Please point to any documentation or tutorial on how to run a bash script
> on the leader node in zoo keeper's ensemble after a node is elected as
> leader by the quorum.
>
> Thanks
>