You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by A J <s5...@gmail.com> on 2011/02/15 21:45:16 UTC

Coordinator node

>From my reading it seems like the node that the client connects to becomes
the coordinator node. Questions:

1. Is it true that the write first happens on the coordinator node and then
the coordinator node propagates it to the right primary node and the
replicas ? In other words if I have a 2G write, would the 2G be transferred
first to the coordinator node or is it just a witness and just waits for the
transfer to happen directly between the client and required right nodes ?

2. How do you load-balance between the different nodes to give all equal
chance to become co-ordinator node ? Does the client need a sort of
round-robin DNS balancer ? if so, what if some of the nodes drop off. How to
inform the DNS balancer  ?
Or do I need a proper load balancer in front that looks at the traffic on
each node and accordingly selects a co-ordinator node ? What is more
pervalent ?

Thanks.

Re: Coordinator node

Posted by A J <s5...@gmail.com>.
Hi,
Are there any blogs/writeups anyone is aware of that talks of using
primary replica as coordinator node (rather than a random coordinator
node) in production scenarios ?

Thank you.


On Wed, Feb 16, 2011 at 10:53 AM, A J <s5...@gmail.com> wrote:
> Thanks for the confirmation. Interesting alternatives to avoid random
> coordinator.
> Are there any blogs/writeups of they (primary node as co-ordinator) been
> used in production scenarios. I googled but could not find anything
> relevant.
> On Wed, Feb 16, 2011 at 3:25 AM, Oleg Anastasyev <ol...@gmail.com> wrote:
>>
>> A J <s5alye <at> gmail.com> writes:
>>
>> >
>> >
>> > Makes sense ! Thanks.
>> > Just a quick follow-up:
>> > Now I understand the write is not made to coordinator (unless it is part
>> > of
>> the replica for that key). But does the write column traffic 'flow'
>> through the
>> coordinator node. For a 2G column write, will I see 2G network traffic on
>> the
>> coordinator node  or just a few bytes of traffic on the co-ordinator of it
>> reading the key and talking to nodes/client etc ?
>>
>> Yes, if you talk to random (AKA coordinator) node first - all 2G traffic
>> will
>> flow to it first and then forwarded to natural nodes (those owning
>> replicas of a
>> row to be written).
>> If you want to avoid extra traffic, you should determine natural nodes of
>> the
>> row and send your write directly to one of natural nodes (i.e. one of
>> natural
>> nodes became coordinator). This natural coordinator node will accept write
>> locally and submit write to other replicas in parallel.
>> If your client is written in java this can be implemented relatively easy.
>> Look
>> at TokenMetadata.ringIterator().
>>
>> If you have no requirement on using thrift interface of cassandra, it
>> could be
>> more efficient to write using StorageProxy interface. The latter plays a
>> local
>> coordinator role, so it talks directly to all replicas, so these 2G will
>> be
>> passed directly from your client to all row replicas.
>>
>>
>> >
>> > This will be a factor for us. So need to make sure exactly.
>>
>>
>
>

Re: Coordinator node

Posted by A J <s5...@gmail.com>.
Thanks for the confirmation. Interesting alternatives to avoid random
coordinator.
Are there any blogs/writeups of they (primary node as co-ordinator) been
used in production scenarios. I googled but could not find anything
relevant.

On Wed, Feb 16, 2011 at 3:25 AM, Oleg Anastasyev <ol...@gmail.com> wrote:

> A J <s5alye <at> gmail.com> writes:
>
> >
> >
> > Makes sense ! Thanks.
> > Just a quick follow-up:
> > Now I understand the write is not made to coordinator (unless it is part
> of
> the replica for that key). But does the write column traffic 'flow' through
> the
> coordinator node. For a 2G column write, will I see 2G network traffic on
> the
> coordinator node  or just a few bytes of traffic on the co-ordinator of it
> reading the key and talking to nodes/client etc ?
>
> Yes, if you talk to random (AKA coordinator) node first - all 2G traffic
> will
> flow to it first and then forwarded to natural nodes (those owning replicas
> of a
> row to be written).
> If you want to avoid extra traffic, you should determine natural nodes of
> the
> row and send your write directly to one of natural nodes (i.e. one of
> natural
> nodes became coordinator). This natural coordinator node will accept write
> locally and submit write to other replicas in parallel.
> If your client is written in java this can be implemented relatively easy.
> Look
> at TokenMetadata.ringIterator().
>
> If you have no requirement on using thrift interface of cassandra, it could
> be
> more efficient to write using StorageProxy interface. The latter plays a
> local
> coordinator role, so it talks directly to all replicas, so these 2G will be
> passed directly from your client to all row replicas.
>
>
> >
> > This will be a factor for us. So need to make sure exactly.
>
>
>

Re: Coordinator node

Posted by Oleg Anastasyev <ol...@gmail.com>.
A J <s5alye <at> gmail.com> writes:

> 
> 
> Makes sense ! Thanks.
> Just a quick follow-up:
> Now I understand the write is not made to coordinator (unless it is part of
the replica for that key). But does the write column traffic 'flow' through the
coordinator node. For a 2G column write, will I see 2G network traffic on the
coordinator node  or just a few bytes of traffic on the co-ordinator of it
reading the key and talking to nodes/client etc ?

Yes, if you talk to random (AKA coordinator) node first - all 2G traffic will
flow to it first and then forwarded to natural nodes (those owning replicas of a
row to be written).
If you want to avoid extra traffic, you should determine natural nodes of the
row and send your write directly to one of natural nodes (i.e. one of natural
nodes became coordinator). This natural coordinator node will accept write
locally and submit write to other replicas in parallel.
If your client is written in java this can be implemented relatively easy. Look
at TokenMetadata.ringIterator().

If you have no requirement on using thrift interface of cassandra, it could be
more efficient to write using StorageProxy interface. The latter plays a local
coordinator role, so it talks directly to all replicas, so these 2G will be
passed directly from your client to all row replicas.


> 
> This will be a factor for us. So need to make sure exactly.



Re: Coordinator node

Posted by A J <s5...@gmail.com>.
Makes sense ! Thanks.
Just a quick follow-up:
Now I understand the write is not made to coordinator (unless it is part of
the replica for that key). But does the write column traffic 'flow' through
the coordinator node. For a 2G column write, will I see 2G network traffic
on the coordinator node  or just a few bytes of traffic on the co-ordinator
of it reading the key and talking to nodes/client etc ?

This will be a factor for us. So need to make sure exactly.


On Tue, Feb 15, 2011 at 5:02 PM, Matthew Dennis <md...@datastax.com>wrote:

> It doesn't write anything to the coordinator node, it just forwards it to
> nodes in the replica set for that row key.
>
> write goes to some node (coordinator, i.e. whatever node you connected to).
> coordinator looks at key, determines which nodes are responsible for it.
> in parallel it forwards the requests to those nodes (in the case it is in
> the replica set for that key, it will write it locally in parallel with the
> writes that were forwarded).
> the coordinator waits until it has the appropriate number of responses to
> meet your consistency level from the nodes in the replica set for the key
> (possibly including itself).
> the coordinator determines the correct value to send to the client based on
> the responses it receives and then sends it.
>
>
> On Tue, Feb 15, 2011 at 3:55 PM, A J <s5...@gmail.com> wrote:
>
>> Thanks.
>> 1. That is somewhat disappointing. Wish the redundancy of write on the
>> coordinator node could have been avoided somehow.
>> Does the write on the coordinator node (incase it is not part of the N
>> replica nodes for that key) get deleted before response of the write is
>> returned back to the client ?
>>
>>
>> On Tue, Feb 15, 2011 at 4:40 PM, Matthew Dennis <md...@datastax.com>wrote:
>>
>>> 1. Yes, the coordinator node propagates requests to the correct nodes.
>>>
>>> 2. most (all?) higher level clients (pycassa, hector, etc) load balance
>>> for you.  In general your client and/or the caller of the client needs to
>>> catch exceptions and retry.  If you're using RRDNS and some of the nodes are
>>> temporarily down, you wouldn't bother to update DNS; your client would just
>>> route to some other node that is up after noticing the first node is down.
>>>
>>> In general you don't want a load balancer in front of the nodes as the
>>> load balancer itself becomes a SPOF as well as a performance bottleneck (not
>>> to mention the extra cost and complexity).  By far the most common setup is
>>> to have the clients load balance for you, coupled with retry logic in your
>>> application.
>>>
>>>
>>> On Tue, Feb 15, 2011 at 2:45 PM, A J <s5...@gmail.com> wrote:
>>>
>>>> From my reading it seems like the node that the client connects to
>>>> becomes the coordinator node. Questions:
>>>>
>>>> 1. Is it true that the write first happens on the coordinator node and
>>>> then the coordinator node propagates it to the right primary node and the
>>>> replicas ? In other words if I have a 2G write, would the 2G be transferred
>>>> first to the coordinator node or is it just a witness and just waits for the
>>>> transfer to happen directly between the client and required right nodes ?
>>>>
>>>> 2. How do you load-balance between the different nodes to give all equal
>>>> chance to become co-ordinator node ? Does the client need a sort of
>>>> round-robin DNS balancer ? if so, what if some of the nodes drop off. How to
>>>> inform the DNS balancer  ?
>>>> Or do I need a proper load balancer in front that looks at the traffic
>>>> on each node and accordingly selects a co-ordinator node ? What is more
>>>> pervalent ?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>
>>
>

Re: Coordinator node

Posted by Matthew Dennis <md...@datastax.com>.
It doesn't write anything to the coordinator node, it just forwards it to
nodes in the replica set for that row key.

write goes to some node (coordinator, i.e. whatever node you connected to).
coordinator looks at key, determines which nodes are responsible for it.
in parallel it forwards the requests to those nodes (in the case it is in
the replica set for that key, it will write it locally in parallel with the
writes that were forwarded).
the coordinator waits until it has the appropriate number of responses to
meet your consistency level from the nodes in the replica set for the key
(possibly including itself).
the coordinator determines the correct value to send to the client based on
the responses it receives and then sends it.

On Tue, Feb 15, 2011 at 3:55 PM, A J <s5...@gmail.com> wrote:

> Thanks.
> 1. That is somewhat disappointing. Wish the redundancy of write on the
> coordinator node could have been avoided somehow.
> Does the write on the coordinator node (incase it is not part of the N
> replica nodes for that key) get deleted before response of the write is
> returned back to the client ?
>
>
> On Tue, Feb 15, 2011 at 4:40 PM, Matthew Dennis <md...@datastax.com>wrote:
>
>> 1. Yes, the coordinator node propagates requests to the correct nodes.
>>
>> 2. most (all?) higher level clients (pycassa, hector, etc) load balance
>> for you.  In general your client and/or the caller of the client needs to
>> catch exceptions and retry.  If you're using RRDNS and some of the nodes are
>> temporarily down, you wouldn't bother to update DNS; your client would just
>> route to some other node that is up after noticing the first node is down.
>>
>> In general you don't want a load balancer in front of the nodes as the
>> load balancer itself becomes a SPOF as well as a performance bottleneck (not
>> to mention the extra cost and complexity).  By far the most common setup is
>> to have the clients load balance for you, coupled with retry logic in your
>> application.
>>
>>
>> On Tue, Feb 15, 2011 at 2:45 PM, A J <s5...@gmail.com> wrote:
>>
>>> From my reading it seems like the node that the client connects to
>>> becomes the coordinator node. Questions:
>>>
>>> 1. Is it true that the write first happens on the coordinator node and
>>> then the coordinator node propagates it to the right primary node and the
>>> replicas ? In other words if I have a 2G write, would the 2G be transferred
>>> first to the coordinator node or is it just a witness and just waits for the
>>> transfer to happen directly between the client and required right nodes ?
>>>
>>> 2. How do you load-balance between the different nodes to give all equal
>>> chance to become co-ordinator node ? Does the client need a sort of
>>> round-robin DNS balancer ? if so, what if some of the nodes drop off. How to
>>> inform the DNS balancer  ?
>>> Or do I need a proper load balancer in front that looks at the traffic on
>>> each node and accordingly selects a co-ordinator node ? What is more
>>> pervalent ?
>>>
>>> Thanks.
>>>
>>>
>>>
>>
>

Re: Coordinator node

Posted by A J <s5...@gmail.com>.
Thanks.
1. That is somewhat disappointing. Wish the redundancy of write on the
coordinator node could have been avoided somehow.
Does the write on the coordinator node (incase it is not part of the N
replica nodes for that key) get deleted before response of the write is
returned back to the client ?

On Tue, Feb 15, 2011 at 4:40 PM, Matthew Dennis <md...@datastax.com>wrote:

> 1. Yes, the coordinator node propagates requests to the correct nodes.
>
> 2. most (all?) higher level clients (pycassa, hector, etc) load balance for
> you.  In general your client and/or the caller of the client needs to catch
> exceptions and retry.  If you're using RRDNS and some of the nodes are
> temporarily down, you wouldn't bother to update DNS; your client would just
> route to some other node that is up after noticing the first node is down.
>
> In general you don't want a load balancer in front of the nodes as the load
> balancer itself becomes a SPOF as well as a performance bottleneck (not to
> mention the extra cost and complexity).  By far the most common setup is to
> have the clients load balance for you, coupled with retry logic in your
> application.
>
>
> On Tue, Feb 15, 2011 at 2:45 PM, A J <s5...@gmail.com> wrote:
>
>> From my reading it seems like the node that the client connects to becomes
>> the coordinator node. Questions:
>>
>> 1. Is it true that the write first happens on the coordinator node and
>> then the coordinator node propagates it to the right primary node and the
>> replicas ? In other words if I have a 2G write, would the 2G be transferred
>> first to the coordinator node or is it just a witness and just waits for the
>> transfer to happen directly between the client and required right nodes ?
>>
>> 2. How do you load-balance between the different nodes to give all equal
>> chance to become co-ordinator node ? Does the client need a sort of
>> round-robin DNS balancer ? if so, what if some of the nodes drop off. How to
>> inform the DNS balancer  ?
>> Or do I need a proper load balancer in front that looks at the traffic on
>> each node and accordingly selects a co-ordinator node ? What is more
>> pervalent ?
>>
>> Thanks.
>>
>>
>>
>

Re: Coordinator node

Posted by Matthew Dennis <md...@datastax.com>.
1. Yes, the coordinator node propagates requests to the correct nodes.

2. most (all?) higher level clients (pycassa, hector, etc) load balance for
you.  In general your client and/or the caller of the client needs to catch
exceptions and retry.  If you're using RRDNS and some of the nodes are
temporarily down, you wouldn't bother to update DNS; your client would just
route to some other node that is up after noticing the first node is down.

In general you don't want a load balancer in front of the nodes as the load
balancer itself becomes a SPOF as well as a performance bottleneck (not to
mention the extra cost and complexity).  By far the most common setup is to
have the clients load balance for you, coupled with retry logic in your
application.

On Tue, Feb 15, 2011 at 2:45 PM, A J <s5...@gmail.com> wrote:

> From my reading it seems like the node that the client connects to becomes
> the coordinator node. Questions:
>
> 1. Is it true that the write first happens on the coordinator node and then
> the coordinator node propagates it to the right primary node and the
> replicas ? In other words if I have a 2G write, would the 2G be transferred
> first to the coordinator node or is it just a witness and just waits for the
> transfer to happen directly between the client and required right nodes ?
>
> 2. How do you load-balance between the different nodes to give all equal
> chance to become co-ordinator node ? Does the client need a sort of
> round-robin DNS balancer ? if so, what if some of the nodes drop off. How to
> inform the DNS balancer  ?
> Or do I need a proper load balancer in front that looks at the traffic on
> each node and accordingly selects a co-ordinator node ? What is more
> pervalent ?
>
> Thanks.
>
>
>

Re: Coordinator node

Posted by Attila Babo <ba...@gmail.com>.
There is a single point of failure for sure as there is a single proxy
in front but that pays off as the load is even between nodes. Another
plus is when a machine is out of the cluster for maintenance the proxy
handles that automatically. Originally I started it as an experiment,
there is a large number of long running clients and using a proxy was
an easy way to reduce configuration.

On Tue, Feb 15, 2011 at 11:45 PM, Matthew Dennis <md...@datastax.com> wrote:
> You have a single HAProxy node in front of the cluster or you have a HAProxy
> node on each machine that is a client of Cassandra that points at all the
> nodes in the cluster?
>
> The former has a SPOF and bottleneck (the HAProxy instance), the latter does
> not (and is somewhat common, especially for things like Apache+PHP).

Re: Coordinator node

Posted by Matthew Dennis <md...@datastax.com>.
You have a single HAProxy node in front of the cluster or you have a HAProxy
node on each machine that is a client of Cassandra that points at all the
nodes in the cluster?

The former has a SPOF and bottleneck (the HAProxy instance), the latter does
not (and is somewhat common, especially for things like Apache+PHP).

On Tue, Feb 15, 2011 at 4:41 PM, Attila Babo <ba...@gmail.com> wrote:

> We are using haproxy in TCP mode for round-robin with great succes.
> It's bit unorthodox but has same real added values like logging.
>
> Here is the relavant config for haproxy:
>
> #####
>
> global
>    log 127.0.0.1 local0
>    log 127.0.0.1 local1 notice
>    maxconn 4096
>    user haproxy
>    group haproxy
>    daemon
>
> defaults
>    log global
>    mode tcp
>    maxconn 2000
>    contimeout 5000
>    clitimeout 50000
>    srvtimeout 50000
>
> listen cassandra 0.0.0.0:9160
>    balance roundrobin
>    server db1 ec2-XXX.compute-1.amazonaws.com:9160 check observe layer4
>    server db2 ec2-YYY.compute-1.amazonaws.com:9160 check observe layer4
>    server db3 ec2-ZZZ.compute-1.amazonaws.com:9160 check observe layer4
>

Re: Coordinator node

Posted by Attila Babo <ba...@gmail.com>.
We are using haproxy in TCP mode for round-robin with great succes.
It's bit unorthodox but has same real added values like logging.

Here is the relavant config for haproxy:

#####

global
    log 127.0.0.1 local0
    log 127.0.0.1 local1 notice
    maxconn 4096
    user haproxy
    group haproxy
    daemon

defaults
    log global
    mode tcp
    maxconn 2000
    contimeout 5000
    clitimeout 50000
    srvtimeout 50000

listen cassandra 0.0.0.0:9160
    balance roundrobin
    server db1 ec2-XXX.compute-1.amazonaws.com:9160 check observe layer4
    server db2 ec2-YYY.compute-1.amazonaws.com:9160 check observe layer4
    server db3 ec2-ZZZ.compute-1.amazonaws.com:9160 check observe layer4