You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Vasileios Vlachos <va...@gmail.com> on 2014/05/30 00:22:22 UTC

Multi-DC Environment Question

Hello All,

We have plans to add a second DC to our live Cassandra environment. 
Currently RF=3 and we read and write at QUORUM. After adding DC2 we are 
going to be reading and writing at LOCAL_QUORUM.

If my understanding is correct, when a client sends a write request, if 
the consistency level is satisfied on DC1 (that is RF/2+1), success is 
returned to the client and DC2 will eventually get the data as well. The 
assumption behind this is that the the client always connects to DC1 for 
reads and writes and given that there is a site-to-site VPN between DC1 
and DC2. Therefore, DC1 will almost always return success before DC2 
(actually I don't know if it is possible for DC2 to be more up-to-date 
than DC1 with this setup...).

Now imagine DC1 looses connectivity and the client fails over to DC2. 
Everything should work fine after that, with the only difference that 
DC2 will be now handling the requests directly from the client. After 
some time, say after max_hint_window_in_ms, DC1 comes back up. My 
question is how do I bring DC1 up to speed with DC2 which is now more 
up-to-date? Will that require a nodetool repair on DC1 nodes? Also, what 
is the answer when the outage is < max_hint_window_in_msinstead?

Thanks in advance!

Vasilis

-- 
Kind Regards,

Vasileios Vlachos

Re: Multi-DC Environment Question

Posted by Vasileios Vlachos <va...@gmail.com>.

Hello again,

Back to this after a while...

As far as I can tell whenever DC2 is unavailable, there is one node from
DC1 that acts as a coordinator. When DC2 is available again, this one node
sends the hints to only one node at DC2, which then sends any replicas to
the other nodes in the local DC (DC2). This ensures efficient cross-DC
bandwidth usage. I was watching "system.hints" on all nodes during this
test and this is the conclusion I came to.

Two things:
1. If the above is correct, does the same apply when performing
anti-entropy repair (without specifying a particular DC)? I'm just hoping
the answer to this is going to be YES, otherwise the VPN is not going to be
very happy in our case and we would prefer to not saturate it whenever
running nodetool repair. I suppose we could have a traffic limiter on the
firewalls worst case scenario but I would appreciate your input if you know
more on this.

2. As I described earlier, in order to test this I was watching the
"system.hints" CF in order to monitor any hints. I was looking to add a
Nagios check for this purpose. For that reason I was looking into JMX
Concole. I noticed that when a node stores hints, "MBean
org.apache.cassandra.db:type=ColumnFamilies,keyspace=system,columnfamily=hints",
attribute "MemtableColumnsCount" goes up (although I would expect it to be
MemtableRowCount or something?). This attribute will retain its value,
until the other node becomes available and ready to receive the hints. I
was looking for another attribute somewhere to monitor the active hints. I
checked:

"MBean
org.apache.cassandra.metrics:type=ColumnFamily,keyspace=system,scope=hints,name=PendingTasks",

"MBean org.apache.cassandra.metrics:type=Storage,name=TotalHints",
"MBean
org.apache.cassandra.metrics:type=Storage,name=TotalHintsInProgress",
"MBean
org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=HintedHandoff,name=ActiveTasks"
and even
"MBean
org.apache.cassandra.metrics:type=HintedHandOffManager,name=Hints_not_stored-/
10.2.1.100" (this one will never go back to zero).

All of them would not increase whenever any hints are being sent (or at
least I didn't catch it because it was too fast or whatever?). Does anyone
know what all these attributes represent? It looks like there are more
specific hint attributes on a per CF basis, but I was looking for a more
generic one to begin with. Any help would be much appreciated.

Thanks in advance,

Vasilis

On Wed, Jun 4, 2014 at 1:42 PM, Vasileios Vlachos <
vasileiosvlachos@gmail.com> wrote:

> Hello Matt,
>
> nodetool status:
>
> Datacenter: MAN
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Owns (effective) Host ID Token Rack
> UN 10.2.1.103 89.34 KB 99.2% b7f8bc93-bf39-475c-a251-8fbe2c7f7239
> -9211685935328163899 RAC1
> UN 10.2.1.102 86.32 KB 0.7% 1f8937e1-9ecb-4e59-896e-6d6ac42dc16d
> -3511707179720619260 RAC1
> Datacenter: DER
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Owns (effective) Host ID Token Rack
> UN 10.2.1.101 75.43 KB 0.2% e71c7ee7-d852-4819-81c0-e993ca87dd5c
> -1277931707251349874 RAC1
> UN 10.2.1.100 104.53 KB 99.8% 7333b664-ce2d-40cf-986f-d4b4d4023726
> -9204412570946850701 RAC1
>
> I do not know why the cluster is not balanced at the moment, but it holds
> almost no data. I will populate it soon and see how that goes. The output
> of 'nodetool ring' just lists all the tokens assigned to each individual
> node, and as you can imagine it would be pointless to paste it here. I just
> did 'nodetool ring | awk ... | unique | wc -l' and it works out to be 1024
> as expected (4 nodes x 256 tokens each).
>
> Still have not got the answers to the other questions though...
>
> Thanks,
>
> Vasilis
>
>
> On Wed, Jun 4, 2014 at 12:28 AM, Matthew Allen <ma...@gmail.com>
> wrote:
>
>> Thanks Vasileios.  I think I need to make a call as to whether to switch
>> to vnodes or stick with tokens for my Multi-DC cluster.
>>
>> Would you be able to show a nodetool ring/status from your cluster to see
>> what the token assignment looks like ?
>>
>> Thanks
>>
>> Matt
>>
>>
>> On Wed, Jun 4, 2014 at 8:31 AM, Vasileios Vlachos <
>> vasileiosvlachos@gmail.com> wrote:
>>
>>>  I should have said that earlier really... I am using 1.2.16 and Vnodes
>>> are enabled.
>>>
>>> Thanks,
>>>
>>> Vasilis
>>>
>>> --
>>> Kind Regards,
>>>
>>> Vasileios Vlachos
>>>
>>>
>>
>

Re: Multi-DC Environment Question

Posted by Vasileios Vlachos <va...@gmail.com>.

Hello Matt,

nodetool status:

Datacenter: MAN
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN 10.2.1.103 89.34 KB 99.2% b7f8bc93-bf39-475c-a251-8fbe2c7f7239
-9211685935328163899 RAC1
UN 10.2.1.102 86.32 KB 0.7% 1f8937e1-9ecb-4e59-896e-6d6ac42dc16d
-3511707179720619260 RAC1
Datacenter: DER
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN 10.2.1.101 75.43 KB 0.2% e71c7ee7-d852-4819-81c0-e993ca87dd5c
-1277931707251349874 RAC1
UN 10.2.1.100 104.53 KB 99.8% 7333b664-ce2d-40cf-986f-d4b4d4023726
-9204412570946850701 RAC1

I do not know why the cluster is not balanced at the moment, but it holds
almost no data. I will populate it soon and see how that goes. The output
of 'nodetool ring' just lists all the tokens assigned to each individual
node, and as you can imagine it would be pointless to paste it here. I just
did 'nodetool ring | awk ... | unique | wc -l' and it works out to be 1024
as expected (4 nodes x 256 tokens each).

Still have not got the answers to the other questions though...

Thanks,

Vasilis

On Wed, Jun 4, 2014 at 12:28 AM, Matthew Allen <ma...@gmail.com>
wrote:

> Thanks Vasileios.  I think I need to make a call as to whether to switch
> to vnodes or stick with tokens for my Multi-DC cluster.
>
> Would you be able to show a nodetool ring/status from your cluster to see
> what the token assignment looks like ?
>
> Thanks
>
> Matt
>
>
> On Wed, Jun 4, 2014 at 8:31 AM, Vasileios Vlachos <
> vasileiosvlachos@gmail.com> wrote:
>
>>  I should have said that earlier really... I am using 1.2.16 and Vnodes
>> are enabled.
>>
>> Thanks,
>>
>> Vasilis
>>
>> --
>> Kind Regards,
>>
>> Vasileios Vlachos
>>
>>
>

Re: Multi-DC Environment Question

Posted by Matthew Allen <ma...@gmail.com>.

Thanks Vasileios.  I think I need to make a call as to whether to switch to
vnodes or stick with tokens for my Multi-DC cluster.

Would you be able to show a nodetool ring/status from your cluster to see
what the token assignment looks like ?

Thanks

Matt

On Wed, Jun 4, 2014 at 8:31 AM, Vasileios Vlachos <
vasileiosvlachos@gmail.com> wrote:

>  I should have said that earlier really... I am using 1.2.16 and Vnodes
> are enabled.
>
> Thanks,
>
> Vasilis
>
> --
> Kind Regards,
>
> Vasileios Vlachos
>
>

Re: Multi-DC Environment Question

Posted by Vasileios Vlachos <va...@gmail.com>.

I should have said that earlier really... I am using 1.2.16 and Vnodes 
are enabled.

Thanks,

Vasilis

-- 
Kind Regards,

Vasileios Vlachos

Re: Multi-DC Environment Question

Posted by Vasileios Vlachos <va...@gmail.com>.

Thanks for your responses!

Matt, I did a test with 4 nodes, 2 in each DC and the answer appears to 
be yes. The tokens seem to be unique across the entire cluster, not just 
on a per DC basis. I don't know if the number of nodes deployed is 
enough to reassure me, but this is my conclusion for now. Please, 
correct me if you know I'm wrong.

Rob, this is the plan of attack I have in mind now. Although, in case of 
a catastrophic failure of a DC, the downtime is usually longer than 
that. So it's either less than the default value (when testing that the 
DR works for example) or more (actually using the DR as primary DC). 
Based on that, the default seems reasonable to me.

I also found that nodetool repair can be performed on one DC only by 
specifying the --in-local-dc option. So, presumably the classic nodetool 
repair applies to the entire cluster (sounds obvious, but is that 
actually correct?).

Question 3 in my previous email still remains unanswered to me... I 
cannot find out if there is only one hint stored in the coordinator 
irrespective of number of replicas being down, and also if the hint is 
100% of the size of the original write request.

Thanks,

Vasilis

On 03/06/14 18:52, Robert Coli wrote:
> On Fri, May 30, 2014 at 4:08 AM, Vasileios Vlachos 
> <vasileiosvlachos@gmail.com <ma...@gmail.com>> wrote:
>
>     Basically you sort of confirmed that if down_time >
>     max_hint_window_in_ms the only way to bring DC1 up-to-date is
>     anti-entropy repair.
>
>
>     Also, read repair does not help either as we assumed that
>     down_time > max_hint_window_in_ms. Please correct me if I am wrong.
>
>
> My understanding is that if you :
>
> 1) set read repair chance to 100%
> 2) read all keys in the keyspace with a client
>
> You would accomplish the same increase in consistency as you would by 
> running repair.
>
> In cases where this may matter, and your system can handle delivering 
> the hints, increasing the already-increased-from-old-default-of-1-hour 
> current default of 3 hours to 6 or more hours gives operators more 
> time to work in the case of partition or failure. Note that hints are 
> only an optimization, only repair (and read repair at 100%, I think..) 
> assert any guarantee of consistency.
>
> =Rob
>

-- 
Kind Regards,

Vasileios Vlachos

Re: Multi-DC Environment Question

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, May 30, 2014 at 4:08 AM, Vasileios Vlachos <
vasileiosvlachos@gmail.com> wrote:

> Basically you sort of confirmed that if down_time > max_hint_window_in_ms
> the only way to bring DC1 up-to-date is anti-entropy repair.
>

Also, read repair does not help either as we assumed that down_time >
> max_hint_window_in_ms. Please correct me if I am wrong.
>

My understanding is that if you :

1) set read repair chance to 100%
2) read all keys in the keyspace with a client

You would accomplish the same increase in consistency as you would by
running repair.

In cases where this may matter, and your system can handle delivering the
hints, increasing the already-increased-from-old-default-of-1-hour current
default of 3 hours to 6 or more hours gives operators more time to work in
the case of partition or failure. Note that hints are only an optimization,
only repair (and read repair at 100%, I think..) assert any guarantee of
consistency.

=Rob

Re: Multi-DC Environment Question

Posted by Matthew Allen <ma...@gmail.com>.

Hi Vasilis,

With regards to Question 2.

*         | How tokens are being assigned when adding a 2nd DC? Is the
range -2^64 to 2^63 for each DC, or it is  -2^64 to 2^63 for the entire
cluster? (I think the latter is correct), *

Have you been able to deduce an answer to this (assuming Murmur3
Partitioner) ?

I'm facing the same problem and looking to do the latter (-2^64 to 2^63 for
the entire cluster) given how nodetool repair isn't multi-dc aware in (
https://issues.apache.org/jira/browse/CASSANDRA-2609)

Unfortunately the documentation (
http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configGenTokens_c.html)
isn't quite clear ("Multiple data center deployments: calculate the tokens
for each data center so that the hash range is evenly divided for the nodes
in each data center.")

Thanks

Matt


On Fri, May 30, 2014 at 9:08 PM, Vasileios Vlachos <
vasileiosvlachos@gmail.com> wrote:

>
> Thanks for your responses, Ben thanks for the link.
>
> Basically you sort of confirmed that if down_time > max_hint_window_in_ms
> the only way to bring DC1 up-to-date is anti-entropy repair. Read
> consistency level is irrelevant to the problem I described as I am reading
> LOCAL_QUORUM. In this situation I lost whatever data -if any- had not been
> transfered across to DC2 before DC1 went down, that is understandable.
> Also, read repair does not help either as we assumed that down_time >
> max_hint_window_in_ms. Please correct me if I am wrong.
>
> I think I could better understand how that works if I knew the answers to
> the following questions:
> 1. What is the output of nodetool status when a cluster spans across 2
> DCs? Will I be able to see ALL nodes irrespective of the DC they belong to?
> 2. How tokens are being assigned when adding a 2nd DC? Is the range -2^64
> to 2^63 for each DC, or it is  -2^64 to 2^63 for the entire cluster? (I
> think the latter is correct)
> 3. Does the coordinator store 1 hint irrespective of how many replicas
> happen to be down at the time and also irrespective of DC2 being down in
> the scenario I described above? (I think the answer is according to the
> presentation you sent me, but I would like someone to confirm that)
>
> Thank you in advance,
>
> Vasilis
>
>
> On Fri, May 30, 2014 at 3:13 AM, Ben Bromhead <be...@instaclustr.com> wrote:
>
>> Short answer:
>>
>> If time elapsed > max_hint_window_in_ms then hints will stop being
>> created. You will need to rely on your read consistency level, read repair
>> and anti-entropy repair operations to restore consistency.
>>
>> Long answer:
>>
>>
>> http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra
>>
>> Ben Bromhead
>> Instaclustr | www.instaclustr.com | @instaclustr
>> <http://twitter.com/instaclustr> | +61 415 936 359
>>
>> On 30 May 2014, at 8:40 am, Tupshin Harper <tu...@tupshin.com> wrote:
>>
>> When one node or DC is down, coordinator nodes being written through will
>> notice this fact and store hints (hinted handoff is the mechanism),  and
>> those hints are used to send the data that was not able to be replicated
>> initially.
>>
>> http://www.datastax.com/dev/blog/modern-hinted-handoff
>>
>> -Tupshin
>> On May 29, 2014 6:22 PM, "Vasileios Vlachos" <va...@gmail.com>
>> wrote:
>>
>>  Hello All,
>>
>> We have plans to add a second DC to our live Cassandra environment.
>> Currently RF=3 and we read and write at QUORUM. After adding DC2 we are
>> going to be reading and writing at LOCAL_QUORUM.
>>
>> If my understanding is correct, when a client sends a write request, if
>> the consistency level is satisfied on DC1 (that is RF/2+1), success is
>> returned to the client and DC2 will eventually get the data as well. The
>> assumption behind this is that the the client always connects to DC1 for
>> reads and writes and given that there is a site-to-site VPN between DC1 and
>> DC2. Therefore, DC1 will almost always return success before DC2 (actually
>> I don't know if it is possible for DC2 to be more up-to-date than DC1 with
>> this setup...).
>>
>> Now imagine DC1 looses connectivity and the client fails over to DC2.
>> Everything should work fine after that, with the only difference that DC2
>> will be now handling the requests directly from the client. After some
>> time, say after max_hint_window_in_ms, DC1 comes back up. My question is
>> how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will
>> that require a nodetool repair on DC1 nodes? Also, what is the answer
>> when the outage is < max_hint_window_in_ms instead?
>>
>> Thanks in advance!
>>
>> Vasilis
>>
>> --
>> Kind Regards,
>>
>> Vasileios Vlachos
>>
>>
>>
>

Re: Multi-DC Environment Question

Posted by Vasileios Vlachos <va...@gmail.com>.

Thanks for your responses, Ben thanks for the link.

Basically you sort of confirmed that if down_time > max_hint_window_in_ms
the only way to bring DC1 up-to-date is anti-entropy repair. Read
consistency level is irrelevant to the problem I described as I am reading
LOCAL_QUORUM. In this situation I lost whatever data -if any- had not been
transfered across to DC2 before DC1 went down, that is understandable.
Also, read repair does not help either as we assumed that down_time >
max_hint_window_in_ms. Please correct me if I am wrong.

I think I could better understand how that works if I knew the answers to
the following questions:
1. What is the output of nodetool status when a cluster spans across 2 DCs?
Will I be able to see ALL nodes irrespective of the DC they belong to?
2. How tokens are being assigned when adding a 2nd DC? Is the range -2^64
to 2^63 for each DC, or it is  -2^64 to 2^63 for the entire cluster? (I
think the latter is correct)
3. Does the coordinator store 1 hint irrespective of how many replicas
happen to be down at the time and also irrespective of DC2 being down in
the scenario I described above? (I think the answer is according to the
presentation you sent me, but I would like someone to confirm that)

Thank you in advance,

Vasilis

On Fri, May 30, 2014 at 3:13 AM, Ben Bromhead <be...@instaclustr.com> wrote:

> Short answer:
>
> If time elapsed > max_hint_window_in_ms then hints will stop being
> created. You will need to rely on your read consistency level, read repair
> and anti-entropy repair operations to restore consistency.
>
> Long answer:
>
> http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra
>
> Ben Bromhead
> Instaclustr | www.instaclustr.com | @instaclustr
> <http://twitter.com/instaclustr> | +61 415 936 359
>
> On 30 May 2014, at 8:40 am, Tupshin Harper <tu...@tupshin.com> wrote:
>
> When one node or DC is down, coordinator nodes being written through will
> notice this fact and store hints (hinted handoff is the mechanism),  and
> those hints are used to send the data that was not able to be replicated
> initially.
>
> http://www.datastax.com/dev/blog/modern-hinted-handoff
>
> -Tupshin
> On May 29, 2014 6:22 PM, "Vasileios Vlachos" <va...@gmail.com>
> wrote:
>
>  Hello All,
>
> We have plans to add a second DC to our live Cassandra environment.
> Currently RF=3 and we read and write at QUORUM. After adding DC2 we are
> going to be reading and writing at LOCAL_QUORUM.
>
> If my understanding is correct, when a client sends a write request, if
> the consistency level is satisfied on DC1 (that is RF/2+1), success is
> returned to the client and DC2 will eventually get the data as well. The
> assumption behind this is that the the client always connects to DC1 for
> reads and writes and given that there is a site-to-site VPN between DC1 and
> DC2. Therefore, DC1 will almost always return success before DC2 (actually
> I don't know if it is possible for DC2 to be more up-to-date than DC1 with
> this setup...).
>
> Now imagine DC1 looses connectivity and the client fails over to DC2.
> Everything should work fine after that, with the only difference that DC2
> will be now handling the requests directly from the client. After some
> time, say after max_hint_window_in_ms, DC1 comes back up. My question is
> how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will
> that require a nodetool repair on DC1 nodes? Also, what is the answer
> when the outage is < max_hint_window_in_ms instead?
>
> Thanks in advance!
>
> Vasilis
>
> --
> Kind Regards,
>
> Vasileios Vlachos
>
>
>

Re: Multi-DC Environment Question

Posted by Ben Bromhead <be...@instaclustr.com>.

Short answer:

If time elapsed > max_hint_window_in_ms then hints will stop being created. You will need to rely on your read consistency level, read repair and anti-entropy repair operations to restore consistency.

Long answer:

http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra

Ben Bromhead
Instaclustr | www.instaclustr.com | @instaclustr | +61 415 936 359

On 30 May 2014, at 8:40 am, Tupshin Harper <tu...@tupshin.com> wrote:

> When one node or DC is down, coordinator nodes being written through will notice this fact and store hints (hinted handoff is the mechanism),  and those hints are used to send the data that was not able to be replicated initially.
> 
> http://www.datastax.com/dev/blog/modern-hinted-handoff
> 
> -Tupshin
> 
> On May 29, 2014 6:22 PM, "Vasileios Vlachos" <va...@gmail.com> wrote:
> Hello All,
> 
> We have plans to add a second DC to our live Cassandra environment. Currently RF=3 and we read and write at QUORUM. After adding DC2 we are going to be reading and writing at LOCAL_QUORUM.
> 
> If my understanding is correct, when a client sends a write request, if the consistency level is satisfied on DC1 (that is RF/2+1), success is returned to the client and DC2 will eventually get the data as well. The assumption behind this is that the the client always connects to DC1 for reads and writes and given that there is a site-to-site VPN between DC1 and DC2. Therefore, DC1 will almost always return success before DC2 (actually I don't know if it is possible for DC2 to be more up-to-date than DC1 with this setup...).
> 
> Now imagine DC1 looses connectivity and the client fails over to DC2. Everything should work fine after that, with the only difference that DC2 will be now handling the requests directly from the client. After some time, say after max_hint_window_in_ms, DC1 comes back up. My question is how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will that require a nodetool repair on DC1 nodes? Also, what is the answer when the outage is < max_hint_window_in_ms instead?
> 
> Thanks in advance!
> 
> Vasilis
> -- 
> Kind Regards,
> 
> Vasileios Vlachos

Re: Multi-DC Environment Question

Posted by Tupshin Harper <tu...@tupshin.com>.

When one node or DC is down, coordinator nodes being written through will
notice this fact and store hints (hinted handoff is the mechanism),  and
those hints are used to send the data that was not able to be replicated
initially.

http://www.datastax.com/dev/blog/modern-hinted-handoff

-Tupshin
On May 29, 2014 6:22 PM, "Vasileios Vlachos" <va...@gmail.com>
wrote:

 Hello All,

We have plans to add a second DC to our live Cassandra environment.
Currently RF=3 and we read and write at QUORUM. After adding DC2 we are
going to be reading and writing at LOCAL_QUORUM.

If my understanding is correct, when a client sends a write request, if the
consistency level is satisfied on DC1 (that is RF/2+1), success is returned
to the client and DC2 will eventually get the data as well. The assumption
behind this is that the the client always connects to DC1 for reads and
writes and given that there is a site-to-site VPN between DC1 and DC2.
Therefore, DC1 will almost always return success before DC2 (actually I
don't know if it is possible for DC2 to be more up-to-date than DC1 with
this setup...).

Now imagine DC1 looses connectivity and the client fails over to DC2.
Everything should work fine after that, with the only difference that DC2
will be now handling the requests directly from the client. After some
time, say after max_hint_window_in_ms, DC1 comes back up. My question is
how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will
that require a nodetool repair on DC1 nodes? Also, what is the answer when
the outage is < max_hint_window_in_ms instead?

Thanks in advance!

Vasilis

-- 
Kind Regards,

Vasileios Vlachos