You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Matan Shukry <ma...@gmail.com> on 2014/02/23 01:12:26 UTC

Spark High Availability

Lately I started messing around with hadoop and spark.

I noticed spark can leverage zookeeper in order to create
multiple "secondaries" masters.

I was wondering however, how one may implement the client
in such situation?

that is, what should the spark master URL be for a spark client application?

Let's say for example, I have 10 nodes, and 3 of them (1/3/5) are masters.
I don't want to put either one of the masters url, since they may be
brought down.

so, which master URL do I use? or rather, how do I use one url
which will change when a new master is chosen?

Note:
I know I can simply have a list of masters, use try/catch to see which one
fails, and try other ones - I was hoping for something "better", in
performance context, and more dynamic as well.

Yours, Jones.

Re: Spark High Availability

Posted by Aaron Davidson <il...@gmail.com>.

Yup, all you need to do is provide a valid host address that will route to
a Master. So you could, for instance, make the host addresses named such as
spark://spark-master1.my.com:port1,spark-master2.my.com:port2,
spark-master3.my.com:port3

and then just change the DNS entries to keep them up to date with the
current masters. Or use static IPs, HAProxy, etc.

On Sat, Feb 22, 2014 at 7:11 PM, Matan Shukry <ma...@gmail.com> wrote:

> Is there any way to make this url more dynamic, so a case such as you
> described where I would need to add new node wouldn't require
> recompilation? For example, by using a dns record or haproxy or some other
> software?
>  On Feb 23, 2014 3:51 AM, "Aaron Davidson" <il...@gmail.com> wrote:
>
>> The current way of solving this problem is to list all three masters as
>> your master url; e.g.,:
>> spark://host1:port1,host2:port2,host3:port3
>>
>> This will try all three in parallel and use whichever one is currently
>> the master. This should work as long as you don't have to introduce a new
>> node as a backup master (due to one of the others failing permanently) --
>> in that case, you'd have to update the master URL to include the new node
>> in case it is elected leader for all *newly created* clients/workers.
>> Old clients are ambivalent to the coming and goings of masters, as any new
>> master will reconnect to all old clients and workers.
>>
>>
>>
>> On Sat, Feb 22, 2014 at 4:12 PM, Matan Shukry <ma...@gmail.com>wrote:
>>
>>> Lately I started messing around with hadoop and spark.
>>>
>>> I noticed spark can leverage zookeeper in order to create
>>> multiple "secondaries" masters.
>>>
>>> I was wondering however, how one may implement the client
>>> in such situation?
>>>
>>> that is, what should the spark master URL be for a spark client
>>> application?
>>>
>>>  Let's say for example, I have 10 nodes, and 3 of them (1/3/5) are
>>> masters.
>>> I don't want to put either one of the masters url, since they may be
>>> brought down.
>>>
>>> so, which master URL do I use? or rather, how do I use one url
>>> which will change when a new master is chosen?
>>>
>>> Note:
>>> I know I can simply have a list of masters, use try/catch to see which
>>> one fails, and try other ones - I was hoping for something "better", in
>>> performance context, and more dynamic as well.
>>>
>>> Yours, Jones.
>>>
>>
>>

Re: Spark High Availability

Posted by Matan Shukry <ma...@gmail.com>.

Is there any way to make this url more dynamic, so a case such as you
described where I would need to add new node wouldn't require
recompilation? For example, by using a dns record or haproxy or some other
software?
On Feb 23, 2014 3:51 AM, "Aaron Davidson" <il...@gmail.com> wrote:

> The current way of solving this problem is to list all three masters as
> your master url; e.g.,:
> spark://host1:port1,host2:port2,host3:port3
>
> This will try all three in parallel and use whichever one is currently the
> master. This should work as long as you don't have to introduce a new node
> as a backup master (due to one of the others failing permanently) -- in
> that case, you'd have to update the master URL to include the new node in
> case it is elected leader for all *newly created* clients/workers. Old
> clients are ambivalent to the coming and goings of masters, as any new
> master will reconnect to all old clients and workers.
>
>
>
> On Sat, Feb 22, 2014 at 4:12 PM, Matan Shukry <ma...@gmail.com>wrote:
>
>> Lately I started messing around with hadoop and spark.
>>
>> I noticed spark can leverage zookeeper in order to create
>> multiple "secondaries" masters.
>>
>> I was wondering however, how one may implement the client
>> in such situation?
>>
>> that is, what should the spark master URL be for a spark client
>> application?
>>
>>  Let's say for example, I have 10 nodes, and 3 of them (1/3/5) are
>> masters.
>> I don't want to put either one of the masters url, since they may be
>> brought down.
>>
>> so, which master URL do I use? or rather, how do I use one url
>> which will change when a new master is chosen?
>>
>> Note:
>> I know I can simply have a list of masters, use try/catch to see which
>> one fails, and try other ones - I was hoping for something "better", in
>> performance context, and more dynamic as well.
>>
>> Yours, Jones.
>>
>
>

Re: Spark High Availability

Posted by Aaron Davidson <il...@gmail.com>.

The current way of solving this problem is to list all three masters as
your master url; e.g.,:
spark://host1:port1,host2:port2,host3:port3

This will try all three in parallel and use whichever one is currently the
master. This should work as long as you don't have to introduce a new node
as a backup master (due to one of the others failing permanently) -- in
that case, you'd have to update the master URL to include the new node in
case it is elected leader for all *newly created* clients/workers. Old
clients are ambivalent to the coming and goings of masters, as any new
master will reconnect to all old clients and workers.

On Sat, Feb 22, 2014 at 4:12 PM, Matan Shukry <ma...@gmail.com> wrote:

> Lately I started messing around with hadoop and spark.
>
> I noticed spark can leverage zookeeper in order to create
> multiple "secondaries" masters.
>
> I was wondering however, how one may implement the client
> in such situation?
>
> that is, what should the spark master URL be for a spark client
> application?
>
>  Let's say for example, I have 10 nodes, and 3 of them (1/3/5) are masters.
> I don't want to put either one of the masters url, since they may be
> brought down.
>
> so, which master URL do I use? or rather, how do I use one url
> which will change when a new master is chosen?
>
> Note:
> I know I can simply have a list of masters, use try/catch to see which one
> fails, and try other ones - I was hoping for something "better", in
> performance context, and more dynamic as well.
>
> Yours, Jones.
>