You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Mina Naguib <mi...@bloomdigital.com> on 2011/08/10 01:24:16 UTC
Peculiar imbalance affecting 2 machines in a 6 node cluster
Hi everyone
I'm observing a very peculiar type of imbalance and I'd appreciate any help or ideas to try. This is on cassandra 0.7.8.
The original cluster was 3 machines in the DCMTL, equally balanced at 33.33% each and each holding roughly 34G.
Then, I added to it 3 machines in the LA data center. The ring is currently as follows (IP addresses redacted for clarity):
Address Status State Load Owns Token
151236607520417094872610936636341427313
IPLA1 Up Normal 34.57 GB 11.11% 0
IPMTL1 Up Normal 34.43 GB 22.22% 37809151880104273718152734159085356828
IPLA2 Up Normal 17.55 GB 11.11% 56713727820156410577229101238628035242
IPMTL2 Up Normal 34.56 GB 22.22% 94522879700260684295381835397713392071
IPLA3 Up Normal 51.37 GB 11.11% 113427455640312821154458202477256070485
IPMTL3 Up Normal 34.71 GB 22.22% 151236607520417094872610936636341427313
The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines in yet another data center, but they're not ready yet to join the cluster. Once that third DC joins all nodes will be at 11.11%. However, I don't think this is related.
The problem I'm currently observing is visible in the LA machines, specifically IPLA2 and IPLA3. IPLA2 has 50% the expected volume, and IPLA3 has 150% the expected volume.
Putting their load side by side shows the peculiar ratio of 2:1:3 between the 3 LA nodes:
34.57 17.55 51.37
(the same 2:1:3 ratio is reflected in our internal tools trending reads/second and writes/second)
I've tried several iterations of compactions/cleanups to no avail. In terms of config this is the main keyspace:
Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
Options: [DCMTL:2, DCLA:2]
And this is the cassandra-topology.properties file (IPs again redacted for clarity):
IPMTL1:DCMTL:RAC1
IPMTL2:DCMTL:RAC1
IPMTL3:DCMTL:RAC1
IPLA1:DCLA:RAC1
IPLA2:DCLA:RAC1
IPLA3:DCLA::RAC1
IPLON1:DCLON:RAC1
IPLON2:DCLON:RAC1
IPLON3:DCLON:RAC1
# default for unknown nodes
default=DCBAD:RACBAD
One thing that did occur to me while reading the source code for the NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers placing data on different racks. Since all my machines are defined as in the same rack, I believe that the 2-pass approach would still yield balanced placement.
However, just to test, I modified live the topology file to specify that IPLA1, IPLA2 and IPLA3 are in 3 different racks, and sure enough I saw immediately that the reads/second and writes/second equalized to expected fair volume (I quickly reverted that change).
So, it seems somehow related to rack awareness, but I've been raking my head and I can't figure out how/why, or why the three MTL machines are not affected the same way.
If the solution is to specify them in different racks and run repair on everything, I'm okay with that - but I hate doing that without first understanding *why* the current behavior is the way it is.
Any ideas would be hugely appreciated.
Thank you.
Re: Peculiar imbalance affecting 2 machines in a 6 node cluster
Posted by aaron morton <aa...@thelastpickle.com>.
Cool.
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
On 11 Aug 2011, at 02:45, Mina Naguib wrote:
>
> Hi Aaron
>
> Thank you very much for the reply and the pointers to the previous list discussions. The second was was particularly telling.
>
> I'm happy to say that the problem is fixed, and it's so trivial it's quite embarrassing - but I'll state it here for the sake of the archives.
>
> There was an extra semicolon in the topology file in the line defining IPLA3. It's just as visible in my prod config as it is in my example below ;-)
>
> I'm guessing the parser splits <dc, rack> tuples on (":"), so it probably parsed the IPLA3 entry as "DCLA" , ":RAC1" (which is different than the others on "RAC1"), and so the NTS did its thing distributing evenly between racks, and IPLA3 got more of the data and IPLA2 got less.
>
> I''ve fixed it, and the reads/s and writes/s immediately equalized. I'm now doing a round of repairs/compactions/cleanups to equalize the data load as well.
>
> Unfortunately It's not easy in cassandra 0.7.8 to actually see the parsed topology state (unlike 0.8's nice ring output which shows the DC and rack), so I'm ashamed to say it took much longer than it should've to troubleshoot.
>
> Thanks for your help.
>
>
> On 2011-08-10, at 5:12 AM, aaron morton wrote:
>
>> WRT the load imbalance checking the basics: you've run cleanup after any tokens moves? Repair is running ? Also sometimes nodes get a bit bloated from repair and will settle down with compaction.
>>
>> Your slightly odd tokens in the MTL DC are making it a little tricky to understand whats going on. But I'm trying to check if you've followed the multi DC token selection here http://wiki.apache.org/cassandra/Operations#Token_selection . Background about what can happen in a multi dc deployment if the tokens are not right http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-between-racks-td6324819.html
>>
>> This is what you currently have….
>>
>> DC:LA
>> IPLA1 Up Normal 34.57 GB 11.11% 0
>> IPLA2 Up Normal 17.55 GB 11.11% 56713727820156410577229101238628035242
>> IPLA3 Up Normal 51.37 GB 11.11% 113427455640312821154458202477256070485
>>
>> DC: MTL
>> IPMTL1 Up Normal 34.43 GB 22.22% 37809151880104273718152734159085356828
>> IPMTL2 Up Normal 34.56 GB 22.22% 94522879700260684295381835397713392071
>> IPMTL3 Up Normal 34.71 GB 22.22% 151236607520417094872610936636341427313
>>
>> Using the bump approach you would have
>>
>> IPLA1 0
>> IPLA2 56713727820156410577229101238628035242
>> IPLA3 113427455640312821154458202477256070484
>>
>> IPMTL1 1
>> IPMTL2 56713727820156410577229101238628035243
>> IPMTL3 113427455640312821154458202477256070485
>>
>> Using the interleaving you would have
>>
>> IPLA1 0
>> IPMTL1 28356863910078205288614550619314017621
>> IPLA2 56713727820156410577229101238628035242
>> IPMTL2 85070591730234615865843651857942052863
>> IPLA3 113427455640312821154458202477256070484
>> IPMTL3 141784319550391026443072753096570088105
>>
>> The current setup in LA give each node in LA 33% of the LA local ring. Which should be right, just checking.
>>
>> If cleanup / repair / compaction is all good and you are confident the tokens are right try poking around with nodetool getendpoints to see which nodes keys are sent to. Like you I cannot see anything obvious in NTS that would cause load to be imbalanced if they are all in the same rack.
>>
>> Cheers
>>
>>
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 10 Aug 2011, at 11:24, Mina Naguib wrote:
>>
>>> Hi everyone
>>>
>>> I'm observing a very peculiar type of imbalance and I'd appreciate any help or ideas to try. This is on cassandra 0.7.8.
>>>
>>> The original cluster was 3 machines in the DCMTL, equally balanced at 33.33% each and each holding roughly 34G.
>>>
>>> Then, I added to it 3 machines in the LA data center. The ring is currently as follows (IP addresses redacted for clarity):
>>>
>>> Address Status State Load Owns Token
>>> 151236607520417094872610936636341427313
>>> IPLA1 Up Normal 34.57 GB 11.11% 0
>>> IPMTL1 Up Normal 34.43 GB 22.22% 37809151880104273718152734159085356828
>>> IPLA2 Up Normal 17.55 GB 11.11% 56713727820156410577229101238628035242
>>> IPMTL2 Up Normal 34.56 GB 22.22% 94522879700260684295381835397713392071
>>> IPLA3 Up Normal 51.37 GB 11.11% 113427455640312821154458202477256070485
>>> IPMTL3 Up Normal 34.71 GB 22.22% 151236607520417094872610936636341427313
>>>
>>> The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines in yet another data center, but they're not ready yet to join the cluster. Once that third DC joins all nodes will be at 11.11%. However, I don't think this is related.
>>>
>>> The problem I'm currently observing is visible in the LA machines, specifically IPLA2 and IPLA3. IPLA2 has 50% the expected volume, and IPLA3 has 150% the expected volume.
>>>
>>> Putting their load side by side shows the peculiar ratio of 2:1:3 between the 3 LA nodes:
>>> 34.57 17.55 51.37
>>> (the same 2:1:3 ratio is reflected in our internal tools trending reads/second and writes/second)
>>>
>>> I've tried several iterations of compactions/cleanups to no avail. In terms of config this is the main keyspace:
>>> Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>>> Options: [DCMTL:2, DCLA:2]
>>> And this is the cassandra-topology.properties file (IPs again redacted for clarity):
>>> IPMTL1:DCMTL:RAC1
>>> IPMTL2:DCMTL:RAC1
>>> IPMTL3:DCMTL:RAC1
>>> IPLA1:DCLA:RAC1
>>> IPLA2:DCLA:RAC1
>>> IPLA3:DCLA::RAC1
>>> IPLON1:DCLON:RAC1
>>> IPLON2:DCLON:RAC1
>>> IPLON3:DCLON:RAC1
>>> # default for unknown nodes
>>> default=DCBAD:RACBAD
>>>
>>>
>>> One thing that did occur to me while reading the source code for the NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers placing data on different racks. Since all my machines are defined as in the same rack, I believe that the 2-pass approach would still yield balanced placement.
>>>
>>> However, just to test, I modified live the topology file to specify that IPLA1, IPLA2 and IPLA3 are in 3 different racks, and sure enough I saw immediately that the reads/second and writes/second equalized to expected fair volume (I quickly reverted that change).
>>>
>>> So, it seems somehow related to rack awareness, but I've been raking my head and I can't figure out how/why, or why the three MTL machines are not affected the same way.
>>>
>>> If the solution is to specify them in different racks and run repair on everything, I'm okay with that - but I hate doing that without first understanding *why* the current behavior is the way it is.
>>>
>>> Any ideas would be hugely appreciated.
>>>
>>> Thank you.
>>>
>>
>
Re: Peculiar imbalance affecting 2 machines in a 6 node cluster
Posted by Mina Naguib <mi...@bloomdigital.com>.
Hi Aaron
Thank you very much for the reply and the pointers to the previous list discussions. The second was was particularly telling.
I'm happy to say that the problem is fixed, and it's so trivial it's quite embarrassing - but I'll state it here for the sake of the archives.
There was an extra semicolon in the topology file in the line defining IPLA3. It's just as visible in my prod config as it is in my example below ;-)
I'm guessing the parser splits <dc, rack> tuples on (":"), so it probably parsed the IPLA3 entry as "DCLA" , ":RAC1" (which is different than the others on "RAC1"), and so the NTS did its thing distributing evenly between racks, and IPLA3 got more of the data and IPLA2 got less.
I''ve fixed it, and the reads/s and writes/s immediately equalized. I'm now doing a round of repairs/compactions/cleanups to equalize the data load as well.
Unfortunately It's not easy in cassandra 0.7.8 to actually see the parsed topology state (unlike 0.8's nice ring output which shows the DC and rack), so I'm ashamed to say it took much longer than it should've to troubleshoot.
Thanks for your help.
On 2011-08-10, at 5:12 AM, aaron morton wrote:
> WRT the load imbalance checking the basics: you've run cleanup after any tokens moves? Repair is running ? Also sometimes nodes get a bit bloated from repair and will settle down with compaction.
>
> Your slightly odd tokens in the MTL DC are making it a little tricky to understand whats going on. But I'm trying to check if you've followed the multi DC token selection here http://wiki.apache.org/cassandra/Operations#Token_selection . Background about what can happen in a multi dc deployment if the tokens are not right http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-between-racks-td6324819.html
>
> This is what you currently have….
>
> DC:LA
> IPLA1 Up Normal 34.57 GB 11.11% 0
> IPLA2 Up Normal 17.55 GB 11.11% 56713727820156410577229101238628035242
> IPLA3 Up Normal 51.37 GB 11.11% 113427455640312821154458202477256070485
>
> DC: MTL
> IPMTL1 Up Normal 34.43 GB 22.22% 37809151880104273718152734159085356828
> IPMTL2 Up Normal 34.56 GB 22.22% 94522879700260684295381835397713392071
> IPMTL3 Up Normal 34.71 GB 22.22% 151236607520417094872610936636341427313
>
> Using the bump approach you would have
>
> IPLA1 0
> IPLA2 56713727820156410577229101238628035242
> IPLA3 113427455640312821154458202477256070484
>
> IPMTL1 1
> IPMTL2 56713727820156410577229101238628035243
> IPMTL3 113427455640312821154458202477256070485
>
> Using the interleaving you would have
>
> IPLA1 0
> IPMTL1 28356863910078205288614550619314017621
> IPLA2 56713727820156410577229101238628035242
> IPMTL2 85070591730234615865843651857942052863
> IPLA3 113427455640312821154458202477256070484
> IPMTL3 141784319550391026443072753096570088105
>
> The current setup in LA give each node in LA 33% of the LA local ring. Which should be right, just checking.
>
> If cleanup / repair / compaction is all good and you are confident the tokens are right try poking around with nodetool getendpoints to see which nodes keys are sent to. Like you I cannot see anything obvious in NTS that would cause load to be imbalanced if they are all in the same rack.
>
> Cheers
>
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 10 Aug 2011, at 11:24, Mina Naguib wrote:
>
>> Hi everyone
>>
>> I'm observing a very peculiar type of imbalance and I'd appreciate any help or ideas to try. This is on cassandra 0.7.8.
>>
>> The original cluster was 3 machines in the DCMTL, equally balanced at 33.33% each and each holding roughly 34G.
>>
>> Then, I added to it 3 machines in the LA data center. The ring is currently as follows (IP addresses redacted for clarity):
>>
>> Address Status State Load Owns Token
>> 151236607520417094872610936636341427313
>> IPLA1 Up Normal 34.57 GB 11.11% 0
>> IPMTL1 Up Normal 34.43 GB 22.22% 37809151880104273718152734159085356828
>> IPLA2 Up Normal 17.55 GB 11.11% 56713727820156410577229101238628035242
>> IPMTL2 Up Normal 34.56 GB 22.22% 94522879700260684295381835397713392071
>> IPLA3 Up Normal 51.37 GB 11.11% 113427455640312821154458202477256070485
>> IPMTL3 Up Normal 34.71 GB 22.22% 151236607520417094872610936636341427313
>>
>> The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines in yet another data center, but they're not ready yet to join the cluster. Once that third DC joins all nodes will be at 11.11%. However, I don't think this is related.
>>
>> The problem I'm currently observing is visible in the LA machines, specifically IPLA2 and IPLA3. IPLA2 has 50% the expected volume, and IPLA3 has 150% the expected volume.
>>
>> Putting their load side by side shows the peculiar ratio of 2:1:3 between the 3 LA nodes:
>> 34.57 17.55 51.37
>> (the same 2:1:3 ratio is reflected in our internal tools trending reads/second and writes/second)
>>
>> I've tried several iterations of compactions/cleanups to no avail. In terms of config this is the main keyspace:
>> Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
>> Options: [DCMTL:2, DCLA:2]
>> And this is the cassandra-topology.properties file (IPs again redacted for clarity):
>> IPMTL1:DCMTL:RAC1
>> IPMTL2:DCMTL:RAC1
>> IPMTL3:DCMTL:RAC1
>> IPLA1:DCLA:RAC1
>> IPLA2:DCLA:RAC1
>> IPLA3:DCLA::RAC1
>> IPLON1:DCLON:RAC1
>> IPLON2:DCLON:RAC1
>> IPLON3:DCLON:RAC1
>> # default for unknown nodes
>> default=DCBAD:RACBAD
>>
>>
>> One thing that did occur to me while reading the source code for the NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers placing data on different racks. Since all my machines are defined as in the same rack, I believe that the 2-pass approach would still yield balanced placement.
>>
>> However, just to test, I modified live the topology file to specify that IPLA1, IPLA2 and IPLA3 are in 3 different racks, and sure enough I saw immediately that the reads/second and writes/second equalized to expected fair volume (I quickly reverted that change).
>>
>> So, it seems somehow related to rack awareness, but I've been raking my head and I can't figure out how/why, or why the three MTL machines are not affected the same way.
>>
>> If the solution is to specify them in different racks and run repair on everything, I'm okay with that - but I hate doing that without first understanding *why* the current behavior is the way it is.
>>
>> Any ideas would be hugely appreciated.
>>
>> Thank you.
>>
>
Re: Peculiar imbalance affecting 2 machines in a 6 node cluster
Posted by aaron morton <aa...@thelastpickle.com>.
WRT the load imbalance checking the basics: you've run cleanup after any tokens moves? Repair is running ? Also sometimes nodes get a bit bloated from repair and will settle down with compaction.
Your slightly odd tokens in the MTL DC are making it a little tricky to understand whats going on. But I'm trying to check if you've followed the multi DC token selection here http://wiki.apache.org/cassandra/Operations#Token_selection . Background about what can happen in a multi dc deployment if the tokens are not right http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-between-racks-td6324819.html
This is what you currently have….
DC:LA
IPLA1 Up Normal 34.57 GB 11.11% 0
IPLA2 Up Normal 17.55 GB 11.11% 56713727820156410577229101238628035242
IPLA3 Up Normal 51.37 GB 11.11% 113427455640312821154458202477256070485
DC: MTL
IPMTL1 Up Normal 34.43 GB 22.22% 37809151880104273718152734159085356828
IPMTL2 Up Normal 34.56 GB 22.22% 94522879700260684295381835397713392071
IPMTL3 Up Normal 34.71 GB 22.22% 151236607520417094872610936636341427313
Using the bump approach you would have
IPLA1 0
IPLA2 56713727820156410577229101238628035242
IPLA3 113427455640312821154458202477256070484
IPMTL1 1
IPMTL2 56713727820156410577229101238628035243
IPMTL3 113427455640312821154458202477256070485
Using the interleaving you would have
IPLA1 0
IPMTL1 28356863910078205288614550619314017621
IPLA2 56713727820156410577229101238628035242
IPMTL2 85070591730234615865843651857942052863
IPLA3 113427455640312821154458202477256070484
IPMTL3 141784319550391026443072753096570088105
The current setup in LA give each node in LA 33% of the LA local ring. Which should be right, just checking.
If cleanup / repair / compaction is all good and you are confident the tokens are right try poking around with nodetool getendpoints to see which nodes keys are sent to. Like you I cannot see anything obvious in NTS that would cause load to be imbalanced if they are all in the same rack.
Cheers
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
On 10 Aug 2011, at 11:24, Mina Naguib wrote:
> Hi everyone
>
> I'm observing a very peculiar type of imbalance and I'd appreciate any help or ideas to try. This is on cassandra 0.7.8.
>
> The original cluster was 3 machines in the DCMTL, equally balanced at 33.33% each and each holding roughly 34G.
>
> Then, I added to it 3 machines in the LA data center. The ring is currently as follows (IP addresses redacted for clarity):
>
> Address Status State Load Owns Token
> 151236607520417094872610936636341427313
> IPLA1 Up Normal 34.57 GB 11.11% 0
> IPMTL1 Up Normal 34.43 GB 22.22% 37809151880104273718152734159085356828
> IPLA2 Up Normal 17.55 GB 11.11% 56713727820156410577229101238628035242
> IPMTL2 Up Normal 34.56 GB 22.22% 94522879700260684295381835397713392071
> IPLA3 Up Normal 51.37 GB 11.11% 113427455640312821154458202477256070485
> IPMTL3 Up Normal 34.71 GB 22.22% 151236607520417094872610936636341427313
>
> The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines in yet another data center, but they're not ready yet to join the cluster. Once that third DC joins all nodes will be at 11.11%. However, I don't think this is related.
>
> The problem I'm currently observing is visible in the LA machines, specifically IPLA2 and IPLA3. IPLA2 has 50% the expected volume, and IPLA3 has 150% the expected volume.
>
> Putting their load side by side shows the peculiar ratio of 2:1:3 between the 3 LA nodes:
> 34.57 17.55 51.37
> (the same 2:1:3 ratio is reflected in our internal tools trending reads/second and writes/second)
>
> I've tried several iterations of compactions/cleanups to no avail. In terms of config this is the main keyspace:
> Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
> Options: [DCMTL:2, DCLA:2]
> And this is the cassandra-topology.properties file (IPs again redacted for clarity):
> IPMTL1:DCMTL:RAC1
> IPMTL2:DCMTL:RAC1
> IPMTL3:DCMTL:RAC1
> IPLA1:DCLA:RAC1
> IPLA2:DCLA:RAC1
> IPLA3:DCLA::RAC1
> IPLON1:DCLON:RAC1
> IPLON2:DCLON:RAC1
> IPLON3:DCLON:RAC1
> # default for unknown nodes
> default=DCBAD:RACBAD
>
>
> One thing that did occur to me while reading the source code for the NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers placing data on different racks. Since all my machines are defined as in the same rack, I believe that the 2-pass approach would still yield balanced placement.
>
> However, just to test, I modified live the topology file to specify that IPLA1, IPLA2 and IPLA3 are in 3 different racks, and sure enough I saw immediately that the reads/second and writes/second equalized to expected fair volume (I quickly reverted that change).
>
> So, it seems somehow related to rack awareness, but I've been raking my head and I can't figure out how/why, or why the three MTL machines are not affected the same way.
>
> If the solution is to specify them in different racks and run repair on everything, I'm okay with that - but I hate doing that without first understanding *why* the current behavior is the way it is.
>
> Any ideas would be hugely appreciated.
>
> Thank you.
>