You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Surbhi Gupta <su...@gmail.com> on 2020/01/27 22:16:41 UTC

How to read content of hints file and apply them manually?

Hi,

We are on Open source 3.11 .
We have a issue in one of the cluster where lots of hints gets piled up and
they don't get applied within hinted handoff period ( 3 hour in our case) .
And load and CPU of the server goes very high.
We see lot of messages   in system.log and debug.log . Our read repair
chance and dc_local_repair chance is 0.1 . Any pointers are welcome .

ERROR [ReadRepairStage:83] 2020-01-27 13:08:43,695 CassandraDaemon.java:228
- Exception in thread Thread[ReadRepairStage:83,5,main]

org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
received only 0 responses.

DEBUG [ReadRepairStage:111] 2020-01-27 13:10:06,663 ReadCallback.java:242 -
Digest mismatch:

org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(4759131696153881383, 9a21276d0af64de28eeeed5d3023b69e)
(142a55e1e28de7daa2ddc34a361

474a0 vs fcba30f022ef25f456914c341022963d)

Re: How to read content of hints file and apply them manually?

Posted by Erick Ramirez <fl...@gmail.com>.
There isn't a tool that I'm aware of that's readily available to do that.
Your best bet is to run a regular repair.

But really, hints are just a side-issue of a much wider problem and that is
the nodes are overloaded. Is your application getting hit with a much
higher than expected traffic? The screenshots you posted show that even
read-repairs aren't getting responses from replicas. You should really
address the overload issue. Cheers!

>

Re: How to read content of hints file and apply them manually?

Posted by Erick Ramirez <fl...@gmail.com>.
I would do a thread dump and work out the threads with the highest CPU
consumers from it. But in my experience, 90% of the time it's GC from high
app traffic unless you've hit an edge case bug. Which means the cluster
doesn't have enough capacity and you need to review the cluster size.
Cheers!

Re: How to read content of hints file and apply them manually?

Posted by Surbhi Gupta <su...@gmail.com>.
So this problem we face is , every time a node goes down or a node is under
high load or CPU. We see lots of hints piles up and doesn’t apply on the
other nodes. Last time when this happened we noticed, high pending
mutations but when I have gone back and checked the history of events , not
every time we see high pending mutations. So basically high load and cpu
caused high pending mutations however I feel it was not the vice versa.

Using top command it was very clear that Cassandra is the cause of the high
cpu.

Other than too, iostat, iotop what tools use you use to dig into high load
and high cpu issue ?

On Tue, Jan 28, 2020 at 1:12 PM Patrick McFadin <pm...@gmail.com> wrote:

> I would definitely check the IO stats then, If you see latency going over
> 20ms, you need to solve that problem.
>
> Patrick
>
> On Tue, Jan 28, 2020 at 12:01 PM Surbhi Gupta <su...@gmail.com>
> wrote:
>
>> We have also noticed a lot of MutationStage pending .
>>
>>
>> On Tue, 28 Jan 2020 at 11:06, Richard Andersen <ri...@andersenfamily.us>
>> wrote:
>>
>>> I am in agreement with Patrick, this is a typical symptom of saturated
>>> IO. Are there a high of drops and/or pending compactions?
>>>
>>> Get Outlook for Android <https://aka.ms/ghei36>
>>> ------------------------------
>>> *From:* Patrick McFadin <pm...@gmail.com>
>>> *Sent:* Tuesday, January 28, 2020 11:25:49 AM
>>> *To:* user@cassandra.apache.org <us...@cassandra.apache.org>
>>> *Subject:* Re: How to read content of hints file and apply them
>>> manually?
>>>
>>> Just to add in here. Any time I see any hints on a cluster, that's like
>>> seeing smoke. If you can't explain it, you have a fire somewhere and it's
>>> not going to get any better.
>>>
>>> By the few messages I've seen, I would start by looking at your IO
>>> subsystem on your nodes. Do you have enough throughput to write and read at
>>> the same time? These are exactly the symptoms I see when running Cassandra
>>> on a SAN or NAS.
>>>
>>> Patrick
>>>
>>> On Mon, Jan 27, 2020 at 8:17 PM Surbhi Gupta <su...@gmail.com>
>>> wrote:
>>>
>>> We tried to tune sethintedhandoffthrottlekb to 100 , 1024 , 10240 but
>>> nothing helped .
>>> Our hints related parameters are as below, if you don't find any
>>> parameter below then it is not set in our environment and should be of the
>>> default value.
>>>
>>> max_hint_window_in_ms: 10800000 # 3 hours
>>>
>>> hinted_handoff_enabled: true
>>>
>>> hinted_handoff_throttle_in_kb: 100
>>>
>>> max_hints_delivery_threads: 8
>>>
>>> hints_directory: /var/lib/cassandra/hints
>>>
>>> hints_flush_period_in_ms: 10000
>>>
>>> max_hints_file_size_in_mb: 128
>>>
>>> On Mon, 27 Jan 2020 at 18:34, Jeff Jirsa <jj...@gmail.com> wrote:
>>>
>>>
>>> The high cpu is probably the hints getting replayed slamming the write
>>> path
>>>
>>> Slowing it down with the hint throttle may help
>>>
>>> It’s not instant.
>>>
>>> On Jan 27, 2020, at 6:05 PM, Erick Ramirez <fl...@gmail.com> wrote:
>>>
>>> 
>>>
>>> Increase the max_hint_window_in_ms setting in cassandra.yaml to more
>>> than 3 hours, perhaps 6 hours. If the issue still persists networking may
>>> need to be tested for bandwidth issues.
>>>
>>>
>>> Just a note of warning about bumping up the hint window without
>>> understanding the pros and cons. Be aware that doubling it means:
>>>
>>>    - you'll end up doubling the size of stored hints in
>>>    the hints_directory
>>>    - there'll be twice as much hints to replay when node(s) come back
>>>    online
>>>
>>> There's always 2 sides to fiddling with the knobs in C*. Cheers!
>>>
>>>

Re: How to read content of hints file and apply them manually?

Posted by Patrick McFadin <pm...@gmail.com>.
I would definitely check the IO stats then, If you see latency going over
20ms, you need to solve that problem.

Patrick

On Tue, Jan 28, 2020 at 12:01 PM Surbhi Gupta <su...@gmail.com>
wrote:

> We have also noticed a lot of MutationStage pending .
>
>
> On Tue, 28 Jan 2020 at 11:06, Richard Andersen <ri...@andersenfamily.us>
> wrote:
>
>> I am in agreement with Patrick, this is a typical symptom of saturated
>> IO. Are there a high of drops and/or pending compactions?
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>> ------------------------------
>> *From:* Patrick McFadin <pm...@gmail.com>
>> *Sent:* Tuesday, January 28, 2020 11:25:49 AM
>> *To:* user@cassandra.apache.org <us...@cassandra.apache.org>
>> *Subject:* Re: How to read content of hints file and apply them manually?
>>
>> Just to add in here. Any time I see any hints on a cluster, that's like
>> seeing smoke. If you can't explain it, you have a fire somewhere and it's
>> not going to get any better.
>>
>> By the few messages I've seen, I would start by looking at your IO
>> subsystem on your nodes. Do you have enough throughput to write and read at
>> the same time? These are exactly the symptoms I see when running Cassandra
>> on a SAN or NAS.
>>
>> Patrick
>>
>> On Mon, Jan 27, 2020 at 8:17 PM Surbhi Gupta <su...@gmail.com>
>> wrote:
>>
>> We tried to tune sethintedhandoffthrottlekb to 100 , 1024 , 10240 but
>> nothing helped .
>> Our hints related parameters are as below, if you don't find any
>> parameter below then it is not set in our environment and should be of the
>> default value.
>>
>> max_hint_window_in_ms: 10800000 # 3 hours
>>
>> hinted_handoff_enabled: true
>>
>> hinted_handoff_throttle_in_kb: 100
>>
>> max_hints_delivery_threads: 8
>>
>> hints_directory: /var/lib/cassandra/hints
>>
>> hints_flush_period_in_ms: 10000
>>
>> max_hints_file_size_in_mb: 128
>>
>> On Mon, 27 Jan 2020 at 18:34, Jeff Jirsa <jj...@gmail.com> wrote:
>>
>>
>> The high cpu is probably the hints getting replayed slamming the write
>> path
>>
>> Slowing it down with the hint throttle may help
>>
>> It’s not instant.
>>
>> On Jan 27, 2020, at 6:05 PM, Erick Ramirez <fl...@gmail.com> wrote:
>>
>> 
>>
>> Increase the max_hint_window_in_ms setting in cassandra.yaml to more than
>> 3 hours, perhaps 6 hours. If the issue still persists networking may need
>> to be tested for bandwidth issues.
>>
>>
>> Just a note of warning about bumping up the hint window without
>> understanding the pros and cons. Be aware that doubling it means:
>>
>>    - you'll end up doubling the size of stored hints in
>>    the hints_directory
>>    - there'll be twice as much hints to replay when node(s) come back
>>    online
>>
>> There's always 2 sides to fiddling with the knobs in C*. Cheers!
>>
>>

Re: How to read content of hints file and apply them manually?

Posted by Surbhi Gupta <su...@gmail.com>.
We have also noticed a lot of MutationStage pending .


On Tue, 28 Jan 2020 at 11:06, Richard Andersen <ri...@andersenfamily.us>
wrote:

> I am in agreement with Patrick, this is a typical symptom of saturated IO.
> Are there a high of drops and/or pending compactions?
>
> Get Outlook for Android <https://aka.ms/ghei36>
> ------------------------------
> *From:* Patrick McFadin <pm...@gmail.com>
> *Sent:* Tuesday, January 28, 2020 11:25:49 AM
> *To:* user@cassandra.apache.org <us...@cassandra.apache.org>
> *Subject:* Re: How to read content of hints file and apply them manually?
>
> Just to add in here. Any time I see any hints on a cluster, that's like
> seeing smoke. If you can't explain it, you have a fire somewhere and it's
> not going to get any better.
>
> By the few messages I've seen, I would start by looking at your IO
> subsystem on your nodes. Do you have enough throughput to write and read at
> the same time? These are exactly the symptoms I see when running Cassandra
> on a SAN or NAS.
>
> Patrick
>
> On Mon, Jan 27, 2020 at 8:17 PM Surbhi Gupta <su...@gmail.com>
> wrote:
>
> We tried to tune sethintedhandoffthrottlekb to 100 , 1024 , 10240 but
> nothing helped .
> Our hints related parameters are as below, if you don't find any parameter
> below then it is not set in our environment and should be of the default
> value.
>
> max_hint_window_in_ms: 10800000 # 3 hours
>
> hinted_handoff_enabled: true
>
> hinted_handoff_throttle_in_kb: 100
>
> max_hints_delivery_threads: 8
>
> hints_directory: /var/lib/cassandra/hints
>
> hints_flush_period_in_ms: 10000
>
> max_hints_file_size_in_mb: 128
>
> On Mon, 27 Jan 2020 at 18:34, Jeff Jirsa <jj...@gmail.com> wrote:
>
>
> The high cpu is probably the hints getting replayed slamming the write
> path
>
> Slowing it down with the hint throttle may help
>
> It’s not instant.
>
> On Jan 27, 2020, at 6:05 PM, Erick Ramirez <fl...@gmail.com> wrote:
>
> 
>
> Increase the max_hint_window_in_ms setting in cassandra.yaml to more than
> 3 hours, perhaps 6 hours. If the issue still persists networking may need
> to be tested for bandwidth issues.
>
>
> Just a note of warning about bumping up the hint window without
> understanding the pros and cons. Be aware that doubling it means:
>
>    - you'll end up doubling the size of stored hints in
>    the hints_directory
>    - there'll be twice as much hints to replay when node(s) come back
>    online
>
> There's always 2 sides to fiddling with the knobs in C*. Cheers!
>
>

Re: How to read content of hints file and apply them manually?

Posted by Richard Andersen <ri...@andersenfamily.us>.
I am in agreement with Patrick, this is a typical symptom of saturated IO. Are there a high of drops and/or pending compactions?

Get Outlook for Android<https://aka.ms/ghei36>
________________________________
From: Patrick McFadin <pm...@gmail.com>
Sent: Tuesday, January 28, 2020 11:25:49 AM
To: user@cassandra.apache.org <us...@cassandra.apache.org>
Subject: Re: How to read content of hints file and apply them manually?

Just to add in here. Any time I see any hints on a cluster, that's like seeing smoke. If you can't explain it, you have a fire somewhere and it's not going to get any better.

By the few messages I've seen, I would start by looking at your IO subsystem on your nodes. Do you have enough throughput to write and read at the same time? These are exactly the symptoms I see when running Cassandra on a SAN or NAS.

Patrick

On Mon, Jan 27, 2020 at 8:17 PM Surbhi Gupta <su...@gmail.com>> wrote:
We tried to tune sethintedhandoffthrottlekb to 100 , 1024 , 10240 but nothing helped .
Our hints related parameters are as below, if you don't find any parameter below then it is not set in our environment and should be of the default value.

max_hint_window_in_ms: 10800000 # 3 hours

hinted_handoff_enabled: true

hinted_handoff_throttle_in_kb: 100

max_hints_delivery_threads: 8

hints_directory: /var/lib/cassandra/hints

hints_flush_period_in_ms: 10000

max_hints_file_size_in_mb: 128

On Mon, 27 Jan 2020 at 18:34, Jeff Jirsa <jj...@gmail.com>> wrote:

The high cpu is probably the hints getting replayed slamming the write path

Slowing it down with the hint throttle may help

It’s not instant.

On Jan 27, 2020, at 6:05 PM, Erick Ramirez <fl...@gmail.com>> wrote:


Increase the max_hint_window_in_ms setting in cassandra.yaml to more than 3 hours, perhaps 6 hours. If the issue still persists networking may need to be tested for bandwidth issues.

Just a note of warning about bumping up the hint window without understanding the pros and cons. Be aware that doubling it means:

  *   you'll end up doubling the size of stored hints in the hints_directory
  *   there'll be twice as much hints to replay when node(s) come back online

There's always 2 sides to fiddling with the knobs in C*. Cheers!

Re: How to read content of hints file and apply them manually?

Posted by Patrick McFadin <pm...@gmail.com>.
Just to add in here. Any time I see any hints on a cluster, that's like
seeing smoke. If you can't explain it, you have a fire somewhere and it's
not going to get any better.

By the few messages I've seen, I would start by looking at your IO
subsystem on your nodes. Do you have enough throughput to write and read at
the same time? These are exactly the symptoms I see when running Cassandra
on a SAN or NAS.

Patrick

On Mon, Jan 27, 2020 at 8:17 PM Surbhi Gupta <su...@gmail.com>
wrote:

> We tried to tune sethintedhandoffthrottlekb to 100 , 1024 , 10240 but
> nothing helped .
> Our hints related parameters are as below, if you don't find any parameter
> below then it is not set in our environment and should be of the default
> value.
>
> max_hint_window_in_ms: 10800000 # 3 hours
>
> hinted_handoff_enabled: true
>
> hinted_handoff_throttle_in_kb: 100
>
> max_hints_delivery_threads: 8
>
> hints_directory: /var/lib/cassandra/hints
>
> hints_flush_period_in_ms: 10000
>
> max_hints_file_size_in_mb: 128
>
> On Mon, 27 Jan 2020 at 18:34, Jeff Jirsa <jj...@gmail.com> wrote:
>
>>
>> The high cpu is probably the hints getting replayed slamming the write
>> path
>>
>> Slowing it down with the hint throttle may help
>>
>> It’s not instant.
>>
>> On Jan 27, 2020, at 6:05 PM, Erick Ramirez <fl...@gmail.com> wrote:
>>
>> 
>>
>>> Increase the max_hint_window_in_ms setting in cassandra.yaml to more
>>> than 3 hours, perhaps 6 hours. If the issue still persists networking may
>>> need to be tested for bandwidth issues.
>>>
>>
>> Just a note of warning about bumping up the hint window without
>> understanding the pros and cons. Be aware that doubling it means:
>>
>>    - you'll end up doubling the size of stored hints in
>>    the hints_directory
>>    - there'll be twice as much hints to replay when node(s) come back
>>    online
>>
>> There's always 2 sides to fiddling with the knobs in C*. Cheers!
>>
>>

Re: How to read content of hints file and apply them manually?

Posted by Surbhi Gupta <su...@gmail.com>.
We tried to tune sethintedhandoffthrottlekb to 100 , 1024 , 10240 but
nothing helped .
Our hints related parameters are as below, if you don't find any parameter
below then it is not set in our environment and should be of the default
value.

max_hint_window_in_ms: 10800000 # 3 hours

hinted_handoff_enabled: true

hinted_handoff_throttle_in_kb: 100

max_hints_delivery_threads: 8

hints_directory: /var/lib/cassandra/hints

hints_flush_period_in_ms: 10000

max_hints_file_size_in_mb: 128

On Mon, 27 Jan 2020 at 18:34, Jeff Jirsa <jj...@gmail.com> wrote:

>
> The high cpu is probably the hints getting replayed slamming the write path
>
> Slowing it down with the hint throttle may help
>
> It’s not instant.
>
> On Jan 27, 2020, at 6:05 PM, Erick Ramirez <fl...@gmail.com> wrote:
>
> 
>
>> Increase the max_hint_window_in_ms setting in cassandra.yaml to more than
>> 3 hours, perhaps 6 hours. If the issue still persists networking may need
>> to be tested for bandwidth issues.
>>
>
> Just a note of warning about bumping up the hint window without
> understanding the pros and cons. Be aware that doubling it means:
>
>    - you'll end up doubling the size of stored hints in
>    the hints_directory
>    - there'll be twice as much hints to replay when node(s) come back
>    online
>
> There's always 2 sides to fiddling with the knobs in C*. Cheers!
>
>

Re: How to read content of hints file and apply them manually?

Posted by Jeff Jirsa <jj...@gmail.com>.
The high cpu is probably the hints getting replayed slamming the write path

Slowing it down with the hint throttle may help

It’s not instant. 

> On Jan 27, 2020, at 6:05 PM, Erick Ramirez <fl...@gmail.com> wrote:
> 
> 
>> Increase the max_hint_window_in_ms setting in cassandra.yaml to more than 3 hours, perhaps 6 hours. If the issue still persists networking may need to be tested for bandwidth issues.
> 
> Just a note of warning about bumping up the hint window without understanding the pros and cons. Be aware that doubling it means:
> you'll end up doubling the size of stored hints in the hints_directory
> there'll be twice as much hints to replay when node(s) come back online
> There's always 2 sides to fiddling with the knobs in C*. Cheers!

Re: How to read content of hints file and apply them manually?

Posted by Erick Ramirez <fl...@gmail.com>.
>
> Increase the max_hint_window_in_ms setting in cassandra.yaml to more than
> 3 hours, perhaps 6 hours. If the issue still persists networking may need
> to be tested for bandwidth issues.
>

Just a note of warning about bumping up the hint window without
understanding the pros and cons. Be aware that doubling it means:

   - you'll end up doubling the size of stored hints in the hints_directory
   - there'll be twice as much hints to replay when node(s) come back online

There's always 2 sides to fiddling with the knobs in C*. Cheers!

Re: How to read content of hints file and apply them manually?

Posted by Deepak Vohra <dv...@yahoo.com.INVALID>.
 
Surbhi,
The hints could be getting accumulated for one or both of the following reasons:
- Some node is becoming unavailable very routinely, which is unlikely- The hints are getting replayed very slowly due to network bandwidth issues, which is more likely
Increase the max_hint_window_in_ms setting in cassandra.yaml to more than 3 hours, perhaps 6 hours. If the issue still persists networking may need to be tested for bandwidth issues.
regards,Deepak    On Tuesday, January 28, 2020, 01:01:51 a.m. UTC, Surbhi Gupta <su...@gmail.com> wrote:  
 
 Why we think it might be related to hints is , because if we truncate the hints then load goes normal on the nodes.FYI , We had to run repair after truncating hints. 
Any thoughts ?

On Mon, 27 Jan 2020 at 15:27, Deepak Vohra <dv...@yahoo.com.invalid> wrote:

 
Hints are a stopgap measure and not a fix to the underlying issue. Run a full repair.    On Monday, January 27, 2020, 10:17:01 p.m. UTC, Surbhi Gupta <su...@gmail.com> wrote:  
 
 Hi,
We are on Open source 3.11 .We have a issue in one of the cluster where lots of hints gets piled up and they don't get applied within hinted handoff period ( 3 hour in our case) . And load and CPU of the server goes very high.We see lot of messages   in system.log and debug.log . Our read repair chance and dc_local_repair chance is 0.1 . Any pointers are welcome . 

ERROR [ReadRepairStage:83] 2020-01-27 13:08:43,695 CassandraDaemon.java:228 - Exception in thread Thread[ReadRepairStage:83,5,main]

org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.


DEBUG [ReadRepairStage:111] 2020-01-27 13:10:06,663 ReadCallback.java:242 - Digest mismatch:

org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(4759131696153881383, 9a21276d0af64de28eeeed5d3023b69e) (142a55e1e28de7daa2ddc34a361

474a0 vs fcba30f022ef25f456914c341022963d)
  
  

Re: How to read content of hints file and apply them manually?

Posted by Surbhi Gupta <su...@gmail.com>.
Why we think it might be related to hints is , because if we truncate the
hints then load goes normal on the nodes.
FYI , We had to run repair after truncating hints.
Any thoughts ?


On Mon, 27 Jan 2020 at 15:27, Deepak Vohra <dv...@yahoo.com.invalid>
wrote:

>
> Hints are a stopgap measure and not a fix to the underlying issue. Run a
> full repair.
> On Monday, January 27, 2020, 10:17:01 p.m. UTC, Surbhi Gupta <
> surbhi.gupta01@gmail.com> wrote:
>
>
> Hi,
>
> We are on Open source 3.11 .
> We have a issue in one of the cluster where lots of hints gets piled up
> and they don't get applied within hinted handoff period ( 3 hour in our
> case) .
> And load and CPU of the server goes very high.
> We see lot of messages   in system.log and debug.log . Our read repair
> chance and dc_local_repair chance is 0.1 . Any pointers are welcome .
>
> ERROR [ReadRepairStage:83] 2020-01-27 13:08:43,695
> CassandraDaemon.java:228 - Exception in thread
> Thread[ReadRepairStage:83,5,main]
>
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out
> - received only 0 responses.
>
> DEBUG [ReadRepairStage:111] 2020-01-27 13:10:06,663 ReadCallback.java:242
> - Digest mismatch:
>
> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
> DecoratedKey(4759131696153881383, 9a21276d0af64de28eeeed5d3023b69e)
> (142a55e1e28de7daa2ddc34a361
>
> 474a0 vs fcba30f022ef25f456914c341022963d)
>

Re: How to read content of hints file and apply them manually?

Posted by Deepak Vohra <dv...@yahoo.com.INVALID>.
 
Hints are a stopgap measure and not a fix to the underlying issue. Run a full repair.    On Monday, January 27, 2020, 10:17:01 p.m. UTC, Surbhi Gupta <su...@gmail.com> wrote:  
 
 Hi,
We are on Open source 3.11 .We have a issue in one of the cluster where lots of hints gets piled up and they don't get applied within hinted handoff period ( 3 hour in our case) . And load and CPU of the server goes very high.We see lot of messages   in system.log and debug.log . Our read repair chance and dc_local_repair chance is 0.1 . Any pointers are welcome . 

ERROR [ReadRepairStage:83] 2020-01-27 13:08:43,695 CassandraDaemon.java:228 - Exception in thread Thread[ReadRepairStage:83,5,main]

org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.


DEBUG [ReadRepairStage:111] 2020-01-27 13:10:06,663 ReadCallback.java:242 - Digest mismatch:

org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(4759131696153881383, 9a21276d0af64de28eeeed5d3023b69e) (142a55e1e28de7daa2ddc34a361

474a0 vs fcba30f022ef25f456914c341022963d)