You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Brian Fleming <bi...@gmail.com> on 2013/02/10 09:56:45 UTC

Read-repair working, repair not working?

 **

Hi,****

** **

I have a 20 node cluster running v1.0.7 split between 5 data centres, each
with an RF of 2, containing a ~1TB unique dataset/~10TB of total data. ****

** **

I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I
brought online late last year with data consistency & availability: I’d
request data, nothing would be returned, I would then re-request the data
and it would correctly be returned: i.e. read-repair appeared to be
occurring.  However running repairs on the nodes didn’t resolve this (I
tried general ‘*repair’* commands as well as targeted keyspace commands) –
this didn’t alter the behaviour.****

** **

After a lot of fruitless investigation, I decided to wipe &
re-install/re-populate the nodes.  The re-install & repair operations are
now complete: I see the expected amount of data on the nodes, however I am
still seeing the same behaviour, i.e. I only get data after one failed
attempt.****

** **

When I run repair commands, I don’t see any errors in the logs. ****

I see the expected ‘AntiEntropySessions’ count in ‘nodetool tpstats’ during
repair sessions.

I see a number of dropped ‘MUTATION’ operations : just under 5% of the
total ‘MutationStage’ count.****

** **

Questions :****

**-          **Could anybody suggest anything specific to look at to see
why the repair operations aren’t having the desired effect? ****

**-          **Would increasing logging level to ‘DEBUG’ show read-repair
activity (to confirm that this is happening, when & for what proportion of
total requests)?****

**-          **Is there something obvious that I could be missing here?****

** **

Many thanks,****

Brian****

**

Re: Read-repair working, repair not working?

Posted by aaron morton <aa...@thelastpickle.com>.

> CL.ONE : this is primarily for performance reasons …
This makes reasoning about "correct" behaviour a little harder. 

If there is anyway you can run some tests with R + W > N strong consistency I would encourage you to do so. You will then have a baseline of what works.

>  (say I make 100 requests : all 100 initially fail and subsequently all 100 succeed), so not sure it'll help?
The high number of inconsistencies seems to match with the massive number of dropped Mutation messages. Even if Anti Entropy is running, if the node in HK is dropping so many messages there will be inconsistencies. 
  

It looks like the HK node is overloaded. I would check the logs for GC messages, check for VM steal in a virtualised env, check for sufficient CPU + memory resources, check for IO stress. 

>>> 20 node cluster running v1.0.7 split between 5 data centres,
>>> I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I
Do all DC's have the same number of nodes ?

Cheers
-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/02/2013, at 9:13 PM, Brian Fleming <bi...@gmail.com> wrote:

> 
> Hi Aaron,
>  
> Many thanks for your reply - answers below.
>  
> Cheers,
>  
> Brian
>  
>  
> > What CL are you using for reads and writes?
> > I would first build a test case to ensure correct operation when using strong consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM because that is 2 and you would not have any redundancy in the DC.
> CL.ONE : this is primarily for performance reasons but also because there are only three local nodes as you suggest and we need at least some resiliency.  
> In the context of this issue, I considered increasing this to CL.LOCAL_QUORUM but the behaviour suggests than none of the 3 local nodes have the data (say I make 100 requests : all 100 initially fail and subsequently all 100 succeed), so not sure it'll help?
>  
>  
> > Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes.
> This DC is remote in terms of network topology - it's in Asia (Hong Kong) while the rest of the cluster is in Europe/North America, so network latency rather than congestion could be a cause?  However I see some pretty aggressive data transfer speeds during the initial repairs & the data footprint approximately matches the nodes elsewhere in the ring, so something doesn't add up?
>  
> Here are the tpstats for one of these nodes :
> Pool Name                    Active   Pending      Completed   Blocked  All time blocked
> ReadStage                         0         0       4919185         0                 0
> RequestResponseStage              0         0      16869994         0                 0
> MutationStage                     0         0      16764910         0                 0
> ReadRepairStage                   0         0           3703         0                 0
> ReplicateOnWriteStage             0         0              0         0                 0
> GossipStage                       0         0         845225         0                 0
> AntiEntropyStage                  0         0          52441         0                 0
> MigrationStage                    0         0           4362         0                 0
> MemtablePostFlusher               0         0            952         0                 0
> StreamStage                       0         0             24         0                 0
> FlushWriter                       0         0            960         0                 5
> MiscStage                         0         0           3592         0                 0
> AntiEntropySessions               4         4            121         0                 0
> InternalResponseStage             0         0              0         0                 0
> HintedHandoff                     1         2             55         0                 0
>  
> Message type           Dropped
> RANGE_SLICE                  0
> READ_REPAIR             150597
> BINARY                       0
> READ                    781490
> MUTATION                853846
> REQUEST_RESPONSE             0
> The numbers of READ_REPAIR, READ & MUTATION operations & are non-negligable.  The nodes in Europe/North America have effectively zero dropped messages.  This suggests network latency is probably a significant factor? 
> [the network ping from Europe to a HK node is ~250ms, so I wouldn’t have expected it to be such a problem?]
>  
> 
> > It would, but the INFO logging for the AES is pretty good. I would hold off for now.
> Ok.
>  
> > [AES session logging]
> Yes, I see the expected start/end logs, so that's another thing off the list.
>  
>  
> 
> 
> On 10 Feb 2013, at 20:12, aaron morton <aa...@thelastpickle.com> wrote:
> 
>>> I’d request data, nothing would be returned, I would then re-request the data and it would correctly be returned:
>>> 
>> What CL are you using for reads and writes?
>> 
>>> I see a number of dropped ‘MUTATION’ operations : just under 5% of the total ‘MutationStage’ count.
>>> 
>> Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes. 
>> 
>> 
>>> -          Could anybody suggest anything specific to look at to see why the repair operations aren’t having the desired effect? 
>>> 
>> I would first build a test case to ensure correct operation when using strong consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM because that is 2 and you would not have any redundancy in the DC. 
>> 
>> 
>>> 
>>> -          Would increasing logging level to ‘DEBUG’ show read-repair activity (to confirm that this is happening, when & for what proportion of total requests)?
>> It would, but the INFO logging for the AES is pretty good. I would hold off for now. 
>> 
>>> 
>>> -          Is there something obvious that I could be missing here?
>> When a new AES session starts it logs this
>> 
>>            logger.info(String.format("[repair #%s] new session: will sync %s on range %s for %s.%s", getName(), repairedNodes(), range, tablename, Arrays.toString(cfnames)));
>> 
>> When it completes it logs this
>> 
>> logger.info(String.format("[repair #%s] session completed successfully", getName()));
>> 
>> Or this on failure 
>> 
>> logger.error(String.format("[repair #%s] session completed with the following error", getName()), exception);
>> 
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 10/02/2013, at 9:56 PM, Brian Fleming <bi...@gmail.com> wrote:
>> 
>>> 
>>> 
>>> 
>>> Hi,
>>> 
>>> 
>>> 
>>> I have a 20 node cluster running v1.0.7 split between 5 data centres, each with an RF of 2, containing a ~1TB unique dataset/~10TB of total data. 
>>> 
>>> 
>>> 
>>> I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I brought online late last year with data consistency & availability: I’d request data, nothing would be returned, I would then re-request the data and it would correctly be returned: i.e. read-repair appeared to be occurring.  However running repairs on the nodes didn’t resolve this (I tried general ‘repair’ commands as well as targeted keyspace commands) – this didn’t alter the behaviour.
>>> 
>>> 
>>> 
>>> After a lot of fruitless investigation, I decided to wipe & re-install/re-populate the nodes.  The re-install & repair operations are now complete: I see the expected amount of data on the nodes, however I am still seeing the same behaviour, i.e. I only get data after one failed attempt.
>>> 
>>> 
>>> 
>>> When I run repair commands, I don’t see any errors in the logs. 
>>> 
>>> I see the expected ‘AntiEntropySessions’ count in ‘nodetool tpstats’ during repair sessions.
>>> 
>>> I see a number of dropped ‘MUTATION’ operations : just under 5% of the total ‘MutationStage’ count.
>>> 
>>> 
>>> 
>>> Questions :
>>> 
>>> -          Could anybody suggest anything specific to look at to see why the repair operations aren’t having the desired effect? 
>>> 
>>> -          Would increasing logging level to ‘DEBUG’ show read-repair activity (to confirm that this is happening, when & for what proportion of total requests)?
>>> 
>>> -          Is there something obvious that I could be missing here?
>>> 
>>> 
>>> 
>>> Many thanks,
>>> 
>>> Brian
>>> 
>>> 
>>> 
>>

Re: Read-repair working, repair not working?

Posted by Brian Fleming <bi...@gmail.com>.

Hi Aaron,
 
Many thanks for your reply - answers below.
 
Cheers,
 
Brian
 
 
> What CL are you using for reads and writes?
> I would first build a test case to ensure correct operation when using strong consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM because that is 2 and you would not have any redundancy in the DC.
CL.ONE : this is primarily for performance reasons but also because there are only three local nodes as you suggest and we need at least some resiliency.  
In the context of this issue, I considered increasing this to CL.LOCAL_QUORUM but the behaviour suggests than none of the 3 local nodes have the data (say I make 100 requests : all 100 initially fail and subsequently all 100 succeed), so not sure it'll help?
 
 
> Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes.
This DC is remote in terms of network topology - it's in Asia (Hong Kong) while the rest of the cluster is in Europe/North America, so network latency rather than congestion could be a cause?  However I see some pretty aggressive data transfer speeds during the initial repairs & the data footprint approximately matches the nodes elsewhere in the ring, so something doesn't add up?
 
Here are the tpstats for one of these nodes :
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0       4919185         0                 0
RequestResponseStage              0         0      16869994         0                 0
MutationStage                     0         0      16764910         0                 0
ReadRepairStage                   0         0           3703         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0         845225         0                 0
AntiEntropyStage                  0         0          52441         0                 0
MigrationStage                    0         0           4362         0                 0
MemtablePostFlusher               0         0            952         0                 0
StreamStage                       0         0             24         0                 0
FlushWriter                       0         0            960         0                 5
MiscStage                         0         0           3592         0                 0
AntiEntropySessions               4         4            121         0                 0
InternalResponseStage             0         0              0         0                 0
HintedHandoff                     1         2             55         0                 0
 
Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR             150597
BINARY                       0
READ                    781490
MUTATION                853846
REQUEST_RESPONSE             0
The numbers of READ_REPAIR, READ & MUTATION operations & are non-negligable.  The nodes in Europe/North America have effectively zero dropped messages.  This suggests network latency is probably a significant factor? 
[the network ping from Europe to a HK node is ~250ms, so I wouldn’t have expected it to be such a problem?]
 

> It would, but the INFO logging for the AES is pretty good. I would hold off for now.
Ok.
 
> [AES session logging]
Yes, I see the expected start/end logs, so that's another thing off the list.
 
 


On 10 Feb 2013, at 20:12, aaron morton <aa...@thelastpickle.com> wrote:

>> I’d request data, nothing would be returned, I would then re-request the data and it would correctly be returned:
> What CL are you using for reads and writes?
> 
>> I see a number of dropped ‘MUTATION’ operations : just under 5% of the total ‘MutationStage’ count.
> Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes. 
> 
> 
>> -          Could anybody suggest anything specific to look at to see why the repair operations aren’t having the desired effect?
> I would first build a test case to ensure correct operation when using strong consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM because that is 2 and you would not have any redundancy in the DC. 
> 
> 
>> 
>> -          Would increasing logging level to ‘DEBUG’ show read-repair activity (to confirm that this is happening, when & for what proportion of total requests)?
> It would, but the INFO logging for the AES is pretty good. I would hold off for now. 
> 
>> 
>> -          Is there something obvious that I could be missing here?
> When a new AES session starts it logs this
> 
>            logger.info(String.format("[repair #%s] new session: will sync %s on range %s for %s.%s", getName(), repairedNodes(), range, tablename, Arrays.toString(cfnames)));
> 
> When it completes it logs this
> 
> logger.info(String.format("[repair #%s] session completed successfully", getName()));
> 
> Or this on failure 
> 
> logger.error(String.format("[repair #%s] session completed with the following error", getName()), exception);
> 
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 10/02/2013, at 9:56 PM, Brian Fleming <bi...@gmail.com> wrote:
> 
>> 
>> 
>> 
>> Hi,
>> 
>> 
>> 
>> I have a 20 node cluster running v1.0.7 split between 5 data centres, each with an RF of 2, containing a ~1TB unique dataset/~10TB of total data. 
>> 
>> 
>> 
>> I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I brought online late last year with data consistency & availability: I’d request data, nothing would be returned, I would then re-request the data and it would correctly be returned: i.e. read-repair appeared to be occurring.  However running repairs on the nodes didn’t resolve this (I tried general ‘repair’ commands as well as targeted keyspace commands) – this didn’t alter the behaviour.
>> 
>> 
>> 
>> After a lot of fruitless investigation, I decided to wipe & re-install/re-populate the nodes.  The re-install & repair operations are now complete: I see the expected amount of data on the nodes, however I am still seeing the same behaviour, i.e. I only get data after one failed attempt.
>> 
>> 
>> 
>> When I run repair commands, I don’t see any errors in the logs. 
>> 
>> I see the expected ‘AntiEntropySessions’ count in ‘nodetool tpstats’ during repair sessions.
>> 
>> I see a number of dropped ‘MUTATION’ operations : just under 5% of the total ‘MutationStage’ count.
>> 
>> 
>> 
>> Questions :
>> 
>> -          Could anybody suggest anything specific to look at to see why the repair operations aren’t having the desired effect? 
>> 
>> -          Would increasing logging level to ‘DEBUG’ show read-repair activity (to confirm that this is happening, when & for what proportion of total requests)?
>> 
>> -          Is there something obvious that I could be missing here?
>> 
>> 
>> 
>> Many thanks,
>> 
>> Brian
>

Re: Read-repair working, repair not working?

Posted by aaron morton <aa...@thelastpickle.com>.

> I’d request data, nothing would be returned, I would then re-request the data and it would correctly be returned:
> 
What CL are you using for reads and writes?

> I see a number of dropped ‘MUTATION’ operations : just under 5% of the total ‘MutationStage’ count.
> 
Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes. 


> -          Could anybody suggest anything specific to look at to see why the repair operations aren’t having the desired effect? 
> 
I would first build a test case to ensure correct operation when using strong consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM because that is 2 and you would not have any redundancy in the DC. 

 
> 
> -          Would increasing logging level to ‘DEBUG’ show read-repair activity (to confirm that this is happening, when & for what proportion of total requests)?
It would, but the INFO logging for the AES is pretty good. I would hold off for now. 

> 
> -          Is there something obvious that I could be missing here?
When a new AES session starts it logs this

            logger.info(String.format("[repair #%s] new session: will sync %s on range %s for %s.%s", getName(), repairedNodes(), range, tablename, Arrays.toString(cfnames)));

When it completes it logs this

logger.info(String.format("[repair #%s] session completed successfully", getName()));

Or this on failure 

logger.error(String.format("[repair #%s] session completed with the following error", getName()), exception);


Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/02/2013, at 9:56 PM, Brian Fleming <bi...@gmail.com> wrote:

> 
>  
> 
> Hi,
> 
>  
> 
> I have a 20 node cluster running v1.0.7 split between 5 data centres, each with an RF of 2, containing a ~1TB unique dataset/~10TB of total data. 
> 
>  
> 
> I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I brought online late last year with data consistency & availability: I’d request data, nothing would be returned, I would then re-request the data and it would correctly be returned: i.e. read-repair appeared to be occurring.  However running repairs on the nodes didn’t resolve this (I tried general ‘repair’ commands as well as targeted keyspace commands) – this didn’t alter the behaviour.
> 
>  
> 
> After a lot of fruitless investigation, I decided to wipe & re-install/re-populate the nodes.  The re-install & repair operations are now complete: I see the expected amount of data on the nodes, however I am still seeing the same behaviour, i.e. I only get data after one failed attempt.
> 
>  
> 
> When I run repair commands, I don’t see any errors in the logs. 
> 
> I see the expected ‘AntiEntropySessions’ count in ‘nodetool tpstats’ during repair sessions.
> 
> I see a number of dropped ‘MUTATION’ operations : just under 5% of the total ‘MutationStage’ count.
> 
>  
> 
> Questions :
> 
> -          Could anybody suggest anything specific to look at to see why the repair operations aren’t having the desired effect? 
> 
> -          Would increasing logging level to ‘DEBUG’ show read-repair activity (to confirm that this is happening, when & for what proportion of total requests)?
> 
> -          Is there something obvious that I could be missing here?
> 
>  
> 
> Many thanks,
> 
> Brian
> 
>  
>