You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Fred Habash <fm...@gmail.com> on 2018/06/11 14:16:48 UTC

Read Latency Doubles After Shrinking Cluster and Never Recovers

I have hit dead-ends every where I turned on this issue.

We had a 15-node cluster  that was doing 35 ms all along for years. At some
point, we made a decision to shrink it to 13. Read latency rose to near 70
ms. Shortly after, we decided this was not acceptable, so we added the
three nodes back in. Read latency dropped to near 50 ms and it has been
hovering around this value for over 6 months now.

Repairs run regularly, load on cluster nodes is even,  application activity
profile has not changed.

Why are we unable to get back the same read latency now that the cluster is
15 nodes large same as it was before?

-- 

----------------------------------------
Thank you

Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

Posted by Joseph Arriola <jc...@gmail.com>.

Have you already reviewed the data model? Maybe it is necessary to use
another partitioning strategy.



El El lun, 11 de jun. de 2018 a las 1:27 p. m., Fd Habash <
fmhabash@gmail.com> escribió:

> Yes, steady state growth which we expect. Even so, it still does not
> explain the anomaly in read latency.
>
>
>
> Going through all the metrics that are available to me, the read latency
> metric lines up with tasks metrics (active, pending, blocked).
>
>
>
> I’ll peek at the test metrics to see if I can identify their type.
>
>
>
> ----------------
> Thank you
>
>
>
> *From: *Jeff Jirsa <jj...@gmail.com>
> *Sent: *Monday, June 11, 2018 1:12 PM
> *To: *cassandra <us...@cassandra.apache.org>
>
>
> *Subject: *Re: Read Latency Doubles After Shrinking Cluster and Never
> Recovers
>
>
>
> The live_sstable_count graph suggests you were already trending upward. Is
> your data growing?
>
>
>
>
>
> On Mon, Jun 11, 2018 at 9:07 AM, Fd Habash <fm...@gmail.com> wrote:
>
> A picture is worth a thousand words!
>
>
>
> Bottom graph shows cluster read latency with trend change from around 1/17
> to 3/14 when nodes were initially removed then added.
>
>
>
> Top shows live_ss_table_count. Prior, it was about 720, then trended
> upwards towards 1000, and back to ~ 700. Even though sstable count
> restored, read latency did not.
>
>
>
>
>
> [image: cid:image002.png@01D4017C.CF0FAAF0]
>
>
>
> ----------------
> Thank you
>
>
>
> *From: *Nicolas Guyomar <ni...@gmail.com>
> *Sent: *Monday, June 11, 2018 11:32 AM
>
>
> *To: *user@cassandra.apache.org
> *Subject: *Re: Read Latency Doubles After Shrinking Cluster and Never
> Recovers
>
>
>
> Really wild guess : do you monitor I/O performance and are positive those
> are the same over the past year ? (network becoming a little busier, hard
> drive a bit slower and so on) ?
>
> Wild guess 2 : a new 'monitoring' software (log shipping agent for
> instance) added meanwhile on the box ?
>
>
>
> On 11 June 2018 at 16:56, Jeff Jirsa <jj...@gmail.com> wrote:
>
> No
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Jun 11, 2018, at 7:49 AM, Fd Habash <fm...@gmail.com> wrote:
>
> I will check for both.
>
>
>
> On a different subject, I have read some user testimonies that running
> ‘nodetool cleanup’ requires a C* process reboot at least around 2.2.8. Is
> this true?
>
>
>
>
>
> ----------------
> Thank you
>
>
>
> *From: *Nitan Kainth <ni...@gmail.com>
> *Sent: *Monday, June 11, 2018 10:40 AM
> *To: *user@cassandra.apache.org
> *Subject: *Re: Read Latency Doubles After Shrinking Cluster and Never
> Recovers
>
>
>
> I think it would because it Cassandra will process more sstables to create
> response to read queries.
>
>
>
> Now after clean if the data volume is same and compaction has been
> running, I can’t think of any more diagnostic step. Let’s wait for other
> experts to comment.
>
>
>
> Can you also check sstable count for each table just to be sure that they
> are not extraordinarily high?
>
> Sent from my iPhone
>
>
> On Jun 11, 2018, at 10:21 AM, Fd Habash <fm...@gmail.com> wrote:
>
> Yes we did after adding the three nodes back and a full cluster repair as
> well.
>
>
>
> But even it we didn’t run cleanup, would it have impacted read latency the
> fact that some nodes still have sstables that they no longer need?
>
>
>
> Thanks
>
>
>
> ----------------
> Thank you
>
>
>
> *From: *Nitan Kainth <ni...@gmail.com>
> *Sent: *Monday, June 11, 2018 10:18 AM
> *To: *user@cassandra.apache.org
> *Subject: *Re: Read Latency Doubles After Shrinking Cluster and Never
> Recovers
>
>
>
> Did you run cleanup too?
>
>
>
> On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash <fm...@gmail.com> wrote:
>
> I have hit dead-ends every where I turned on this issue.
>
>
>
> We had a 15-node cluster  that was doing 35 ms all along for years. At
> some point, we made a decision to shrink it to 13. Read latency rose to
> near 70 ms. Shortly after, we decided this was not acceptable, so we added
> the three nodes back in. Read latency dropped to near 50 ms and it has been
> hovering around this value for over 6 months now.
>
>
>
> Repairs run regularly, load on cluster nodes is even,  application
> activity profile has not changed.
>
>
>
> Why are we unable to get back the same read latency now that the cluster
> is 15 nodes large same as it was before?
>
>
>
> --
>
>
>
> ----------------------------------------
> Thank you
>
>
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org

RE: Read Latency Doubles After Shrinking Cluster and Never Recovers

Posted by Fd Habash <fm...@gmail.com>.

Yes, steady state growth which we expect. Even so, it still does not explain the anomaly in read latency. 

Going through all the metrics that are available to me, the read latency metric lines up with tasks metrics (active, pending, blocked). 

I’ll peek at the test metrics to see if I can identify their type.

----------------
Thank you

From: Jeff Jirsa
Sent: Monday, June 11, 2018 1:12 PM
To: cassandra
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

The live_sstable_count graph suggests you were already trending upward. Is your data growing? 


On Mon, Jun 11, 2018 at 9:07 AM, Fd Habash <fm...@gmail.com> wrote:
A picture is worth a thousand words!
 
Bottom graph shows cluster read latency with trend change from around 1/17 to 3/14 when nodes were initially removed then added. 
 
Top shows live_ss_table_count. Prior, it was about 720, then trended upwards towards 1000, and back to ~ 700. Even though sstable count restored, read latency did not.
 
 

 
----------------
Thank you
 
From: Nicolas Guyomar
Sent: Monday, June 11, 2018 11:32 AM

To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers
 
Really wild guess : do you monitor I/O performance and are positive those are the same over the past year ? (network becoming a little busier, hard drive a bit slower and so on) ? 
Wild guess 2 : a new 'monitoring' software (log shipping agent for instance) added meanwhile on the box ? 
 
On 11 June 2018 at 16:56, Jeff Jirsa <jj...@gmail.com> wrote:
No
-- 
Jeff Jirsa
 

On Jun 11, 2018, at 7:49 AM, Fd Habash <fm...@gmail.com> wrote:
I will check for both.
 
On a different subject, I have read some user testimonies that running ‘nodetool cleanup’ requires a C* process reboot at least around 2.2.8. Is this true?
 
 
----------------
Thank you
 
From: Nitan Kainth
Sent: Monday, June 11, 2018 10:40 AM
To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers
 
I think it would because it Cassandra will process more sstables to create response to read queries.
 
Now after clean if the data volume is same and compaction has been running, I can’t think of any more diagnostic step. Let’s wait for other experts to comment.
 
Can you also check sstable count for each table just to be sure that they are not extraordinarily high?
Sent from my iPhone

On Jun 11, 2018, at 10:21 AM, Fd Habash <fm...@gmail.com> wrote:
Yes we did after adding the three nodes back and a full cluster repair as well. 
 
But even it we didn’t run cleanup, would it have impacted read latency the fact that some nodes still have sstables that they no longer need? 
 
Thanks 
 
----------------
Thank you
 
From: Nitan Kainth
Sent: Monday, June 11, 2018 10:18 AM
To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers
 
Did you run cleanup too? 
 
On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash <fm...@gmail.com> wrote:
I have hit dead-ends every where I turned on this issue. 
 
We had a 15-node cluster  that was doing 35 ms all along for years. At some point, we made a decision to shrink it to 13. Read latency rose to near 70 ms. Shortly after, we decided this was not acceptable, so we added the three nodes back in. Read latency dropped to near 50 ms and it has been hovering around this value for over 6 months now.
 
Repairs run regularly, load on cluster nodes is even,  application activity profile has not changed. 
 
Why are we unable to get back the same read latency now that the cluster is 15 nodes large same as it was before?
 
-- 
 
----------------------------------------
Thank you
 
 
 
 
 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

Posted by Jeff Jirsa <jj...@gmail.com>.

The live_sstable_count graph suggests you were already trending upward. Is
your data growing?


On Mon, Jun 11, 2018 at 9:07 AM, Fd Habash <fm...@gmail.com> wrote:

> A picture is worth a thousand words!
>
>
>
> Bottom graph shows cluster read latency with trend change from around 1/17
> to 3/14 when nodes were initially removed then added.
>
>
>
> Top shows live_ss_table_count. Prior, it was about 720, then trended
> upwards towards 1000, and back to ~ 700. Even though sstable count
> restored, read latency did not.
>
>
>
>
>
>
>
> ----------------
> Thank you
>
>
>
> *From: *Nicolas Guyomar <ni...@gmail.com>
> *Sent: *Monday, June 11, 2018 11:32 AM
>
> *To: *user@cassandra.apache.org
> *Subject: *Re: Read Latency Doubles After Shrinking Cluster and Never
> Recovers
>
>
>
> Really wild guess : do you monitor I/O performance and are positive those
> are the same over the past year ? (network becoming a little busier, hard
> drive a bit slower and so on) ?
>
> Wild guess 2 : a new 'monitoring' software (log shipping agent for
> instance) added meanwhile on the box ?
>
>
>
> On 11 June 2018 at 16:56, Jeff Jirsa <jj...@gmail.com> wrote:
>
> No
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Jun 11, 2018, at 7:49 AM, Fd Habash <fm...@gmail.com> wrote:
>
> I will check for both.
>
>
>
> On a different subject, I have read some user testimonies that running
> ‘nodetool cleanup’ requires a C* process reboot at least around 2.2.8. Is
> this true?
>
>
>
>
>
> ----------------
> Thank you
>
>
>
> *From: *Nitan Kainth <ni...@gmail.com>
> *Sent: *Monday, June 11, 2018 10:40 AM
> *To: *user@cassandra.apache.org
> *Subject: *Re: Read Latency Doubles After Shrinking Cluster and Never
> Recovers
>
>
>
> I think it would because it Cassandra will process more sstables to create
> response to read queries.
>
>
>
> Now after clean if the data volume is same and compaction has been
> running, I can’t think of any more diagnostic step. Let’s wait for other
> experts to comment.
>
>
>
> Can you also check sstable count for each table just to be sure that they
> are not extraordinarily high?
>
> Sent from my iPhone
>
>
> On Jun 11, 2018, at 10:21 AM, Fd Habash <fm...@gmail.com> wrote:
>
> Yes we did after adding the three nodes back and a full cluster repair as
> well.
>
>
>
> But even it we didn’t run cleanup, would it have impacted read latency the
> fact that some nodes still have sstables that they no longer need?
>
>
>
> Thanks
>
>
>
> ----------------
> Thank you
>
>
>
> *From: *Nitan Kainth <ni...@gmail.com>
> *Sent: *Monday, June 11, 2018 10:18 AM
> *To: *user@cassandra.apache.org
> *Subject: *Re: Read Latency Doubles After Shrinking Cluster and Never
> Recovers
>
>
>
> Did you run cleanup too?
>
>
>
> On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash <fm...@gmail.com> wrote:
>
> I have hit dead-ends every where I turned on this issue.
>
>
>
> We had a 15-node cluster  that was doing 35 ms all along for years. At
> some point, we made a decision to shrink it to 13. Read latency rose to
> near 70 ms. Shortly after, we decided this was not acceptable, so we added
> the three nodes back in. Read latency dropped to near 50 ms and it has been
> hovering around this value for over 6 months now.
>
>
>
> Repairs run regularly, load on cluster nodes is even,  application
> activity profile has not changed.
>
>
>
> Why are we unable to get back the same read latency now that the cluster
> is 15 nodes large same as it was before?
>
>
>
> --
>
>
>
> ----------------------------------------
> Thank you
>
>
>
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>

RE: Read Latency Doubles After Shrinking Cluster and Never Recovers

Posted by Fd Habash <fm...@gmail.com>.

A picture is worth a thousand words!

Bottom graph shows cluster read latency with trend change from around 1/17 to 3/14 when nodes were initially removed then added. 

Top shows live_ss_table_count. Prior, it was about 720, then trended upwards towards 1000, and back to ~ 700. Even though sstable count restored, read latency did not.




----------------
Thank you

From: Nicolas Guyomar
Sent: Monday, June 11, 2018 11:32 AM
To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

Really wild guess : do you monitor I/O performance and are positive those are the same over the past year ? (network becoming a little busier, hard drive a bit slower and so on) ? 
Wild guess 2 : a new 'monitoring' software (log shipping agent for instance) added meanwhile on the box ? 

On 11 June 2018 at 16:56, Jeff Jirsa <jj...@gmail.com> wrote:
No
-- 
Jeff Jirsa


On Jun 11, 2018, at 7:49 AM, Fd Habash <fm...@gmail.com> wrote:
I will check for both.
 
On a different subject, I have read some user testimonies that running ‘nodetool cleanup’ requires a C* process reboot at least around 2.2.8. Is this true?
 
 
----------------
Thank you
 
From: Nitan Kainth
Sent: Monday, June 11, 2018 10:40 AM
To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers
 
I think it would because it Cassandra will process more sstables to create response to read queries.
 
Now after clean if the data volume is same and compaction has been running, I can’t think of any more diagnostic step. Let’s wait for other experts to comment.
 
Can you also check sstable count for each table just to be sure that they are not extraordinarily high?
Sent from my iPhone

On Jun 11, 2018, at 10:21 AM, Fd Habash <fm...@gmail.com> wrote:
Yes we did after adding the three nodes back and a full cluster repair as well. 
 
But even it we didn’t run cleanup, would it have impacted read latency the fact that some nodes still have sstables that they no longer need? 
 
Thanks 
 
----------------
Thank you
 
From: Nitan Kainth
Sent: Monday, June 11, 2018 10:18 AM
To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers
 
Did you run cleanup too? 
 
On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash <fm...@gmail.com> wrote:
I have hit dead-ends every where I turned on this issue. 
 
We had a 15-node cluster  that was doing 35 ms all along for years. At some point, we made a decision to shrink it to 13. Read latency rose to near 70 ms. Shortly after, we decided this was not acceptable, so we added the three nodes back in. Read latency dropped to near 50 ms and it has been hovering around this value for over 6 months now.
 
Repairs run regularly, load on cluster nodes is even,  application activity profile has not changed. 
 
Why are we unable to get back the same read latency now that the cluster is 15 nodes large same as it was before?
 
-- 
 
----------------------------------------
Thank you

Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

Posted by Nicolas Guyomar <ni...@gmail.com>.

Really wild guess : do you monitor I/O performance and are positive those
are the same over the past year ? (network becoming a little busier, hard
drive a bit slower and so on) ?
Wild guess 2 : a new 'monitoring' software (log shipping agent for
instance) added meanwhile on the box ?

On 11 June 2018 at 16:56, Jeff Jirsa <jj...@gmail.com> wrote:

> No
>
> --
> Jeff Jirsa
>
>
> On Jun 11, 2018, at 7:49 AM, Fd Habash <fm...@gmail.com> wrote:
>
> I will check for both.
>
>
>
> On a different subject, I have read some user testimonies that running
> ‘nodetool cleanup’ requires a C* process reboot at least around 2.2.8. Is
> this true?
>
>
>
>
>
> ----------------
> Thank you
>
>
>
> *From: *Nitan Kainth <ni...@gmail.com>
> *Sent: *Monday, June 11, 2018 10:40 AM
> *To: *user@cassandra.apache.org
> *Subject: *Re: Read Latency Doubles After Shrinking Cluster and Never
> Recovers
>
>
>
> I think it would because it Cassandra will process more sstables to create
> response to read queries.
>
>
>
> Now after clean if the data volume is same and compaction has been
> running, I can’t think of any more diagnostic step. Let’s wait for other
> experts to comment.
>
>
>
> Can you also check sstable count for each table just to be sure that they
> are not extraordinarily high?
>
> Sent from my iPhone
>
>
> On Jun 11, 2018, at 10:21 AM, Fd Habash <fm...@gmail.com> wrote:
>
> Yes we did after adding the three nodes back and a full cluster repair as
> well.
>
>
>
> But even it we didn’t run cleanup, would it have impacted read latency the
> fact that some nodes still have sstables that they no longer need?
>
>
>
> Thanks
>
>
>
> ----------------
> Thank you
>
>
>
> *From: *Nitan Kainth <ni...@gmail.com>
> *Sent: *Monday, June 11, 2018 10:18 AM
> *To: *user@cassandra.apache.org
> *Subject: *Re: Read Latency Doubles After Shrinking Cluster and Never
> Recovers
>
>
>
> Did you run cleanup too?
>
>
>
> On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash <fm...@gmail.com> wrote:
>
> I have hit dead-ends every where I turned on this issue.
>
>
>
> We had a 15-node cluster  that was doing 35 ms all along for years. At
> some point, we made a decision to shrink it to 13. Read latency rose to
> near 70 ms. Shortly after, we decided this was not acceptable, so we added
> the three nodes back in. Read latency dropped to near 50 ms and it has been
> hovering around this value for over 6 months now.
>
>
>
> Repairs run regularly, load on cluster nodes is even,  application
> activity profile has not changed.
>
>
>
> Why are we unable to get back the same read latency now that the cluster
> is 15 nodes large same as it was before?
>
>
>
> --
>
>
>
> ----------------------------------------
> Thank you
>
>
>
>
>
>
>
>
>

Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

Posted by Jeff Jirsa <jj...@gmail.com>.

No

-- 
Jeff Jirsa


> On Jun 11, 2018, at 7:49 AM, Fd Habash <fm...@gmail.com> wrote:
> 
> I will check for both.
>  
> On a different subject, I have read some user testimonies that running ‘nodetool cleanup’ requires a C* process reboot at least around 2.2.8. Is this true?
>  
>  
> ----------------
> Thank you
>  
> From: Nitan Kainth
> Sent: Monday, June 11, 2018 10:40 AM
> To: user@cassandra.apache.org
> Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers
>  
> I think it would because it Cassandra will process more sstables to create response to read queries.
>  
> Now after clean if the data volume is same and compaction has been running, I can’t think of any more diagnostic step. Let’s wait for other experts to comment.
>  
> Can you also check sstable count for each table just to be sure that they are not extraordinarily high?
> 
> Sent from my iPhone
> 
> On Jun 11, 2018, at 10:21 AM, Fd Habash <fm...@gmail.com> wrote:
> 
> Yes we did after adding the three nodes back and a full cluster repair as well.
>  
> But even it we didn’t run cleanup, would it have impacted read latency the fact that some nodes still have sstables that they no longer need?
>  
> Thanks
>  
> ----------------
> Thank you
>  
> From: Nitan Kainth
> Sent: Monday, June 11, 2018 10:18 AM
> To: user@cassandra.apache.org
> Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers
>  
> Did you run cleanup too? 
>  
> On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash <fm...@gmail.com> wrote:
> I have hit dead-ends every where I turned on this issue. 
>  
> We had a 15-node cluster  that was doing 35 ms all along for years. At some point, we made a decision to shrink it to 13. Read latency rose to near 70 ms. Shortly after, we decided this was not acceptable, so we added the three nodes back in. Read latency dropped to near 50 ms and it has been hovering around this value for over 6 months now.
>  
> Repairs run regularly, load on cluster nodes is even,  application activity profile has not changed. 
>  
> Why are we unable to get back the same read latency now that the cluster is 15 nodes large same as it was before?
>  
> --
>  
> ----------------------------------------
> Thank you
> 
> 
> 
>  
>  
>

RE: Read Latency Doubles After Shrinking Cluster and Never Recovers

Posted by Fd Habash <fm...@gmail.com>.

I will check for both.

On a different subject, I have read some user testimonies that running ‘nodetool cleanup’ requires a C* process reboot at least around 2.2.8. Is this true?


----------------
Thank you

From: Nitan Kainth
Sent: Monday, June 11, 2018 10:40 AM
To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

I think it would because it Cassandra will process more sstables to create response to read queries.

Now after clean if the data volume is same and compaction has been running, I can’t think of any more diagnostic step. Let’s wait for other experts to comment.

Can you also check sstable count for each table just to be sure that they are not extraordinarily high?
Sent from my iPhone

On Jun 11, 2018, at 10:21 AM, Fd Habash <fm...@gmail.com> wrote:
Yes we did after adding the three nodes back and a full cluster repair as well. 
 
But even it we didn’t run cleanup, would it have impacted read latency the fact that some nodes still have sstables that they no longer need? 
 
Thanks 
 
----------------
Thank you
 
From: Nitan Kainth
Sent: Monday, June 11, 2018 10:18 AM
To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers
 
Did you run cleanup too? 
 
On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash <fm...@gmail.com> wrote:
I have hit dead-ends every where I turned on this issue. 
 
We had a 15-node cluster  that was doing 35 ms all along for years. At some point, we made a decision to shrink it to 13. Read latency rose to near 70 ms. Shortly after, we decided this was not acceptable, so we added the three nodes back in. Read latency dropped to near 50 ms and it has been hovering around this value for over 6 months now.
 
Repairs run regularly, load on cluster nodes is even,  application activity profile has not changed. 
 
Why are we unable to get back the same read latency now that the cluster is 15 nodes large same as it was before?
 
-- 
 
----------------------------------------
Thank you

Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

Posted by Nitan Kainth <ni...@gmail.com>.

I think it would because it Cassandra will process more sstables to create response to read queries.

Now after clean if the data volume is same and compaction has been running, I can’t think of any more diagnostic step. Let’s wait for other experts to comment.

Can you also check sstable count for each table just to be sure that they are not extraordinarily high?

Sent from my iPhone

> On Jun 11, 2018, at 10:21 AM, Fd Habash <fm...@gmail.com> wrote:
> 
> Yes we did after adding the three nodes back and a full cluster repair as well.
>  
> But even it we didn’t run cleanup, would it have impacted read latency the fact that some nodes still have sstables that they no longer need?
>  
> Thanks
>  
> ----------------
> Thank you
>  
> From: Nitan Kainth
> Sent: Monday, June 11, 2018 10:18 AM
> To: user@cassandra.apache.org
> Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers
>  
> Did you run cleanup too? 
>  
> On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash <fm...@gmail.com> wrote:
> I have hit dead-ends every where I turned on this issue. 
>  
> We had a 15-node cluster  that was doing 35 ms all along for years. At some point, we made a decision to shrink it to 13. Read latency rose to near 70 ms. Shortly after, we decided this was not acceptable, so we added the three nodes back in. Read latency dropped to near 50 ms and it has been hovering around this value for over 6 months now.
>  
> Repairs run regularly, load on cluster nodes is even,  application activity profile has not changed. 
>  
> Why are we unable to get back the same read latency now that the cluster is 15 nodes large same as it was before?
>  
> --
>  
> ----------------------------------------
> Thank you
> 
> 
>  
>

RE: Read Latency Doubles After Shrinking Cluster and Never Recovers

Posted by Fd Habash <fm...@gmail.com>.

Yes we did after adding the three nodes back and a full cluster repair as well. 

But even it we didn’t run cleanup, would it have impacted read latency the fact that some nodes still have sstables that they no longer need? 

Thanks 

----------------
Thank you

From: Nitan Kainth
Sent: Monday, June 11, 2018 10:18 AM
To: user@cassandra.apache.org
Subject: Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

Did you run cleanup too? 

On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash <fm...@gmail.com> wrote:
I have hit dead-ends every where I turned on this issue. 

We had a 15-node cluster  that was doing 35 ms all along for years. At some point, we made a decision to shrink it to 13. Read latency rose to near 70 ms. Shortly after, we decided this was not acceptable, so we added the three nodes back in. Read latency dropped to near 50 ms and it has been hovering around this value for over 6 months now.

Repairs run regularly, load on cluster nodes is even,  application activity profile has not changed. 

Why are we unable to get back the same read latency now that the cluster is 15 nodes large same as it was before?

-- 

----------------------------------------
Thank you

Re: Read Latency Doubles After Shrinking Cluster and Never Recovers

Posted by Nitan Kainth <ni...@gmail.com>.

Did you run cleanup too?

On Mon, Jun 11, 2018 at 10:16 AM, Fred Habash <fm...@gmail.com> wrote:

> I have hit dead-ends every where I turned on this issue.
>
> We had a 15-node cluster  that was doing 35 ms all along for years. At
> some point, we made a decision to shrink it to 13. Read latency rose to
> near 70 ms. Shortly after, we decided this was not acceptable, so we added
> the three nodes back in. Read latency dropped to near 50 ms and it has been
> hovering around this value for over 6 months now.
>
> Repairs run regularly, load on cluster nodes is even,  application
> activity profile has not changed.
>
> Why are we unable to get back the same read latency now that the cluster
> is 15 nodes large same as it was before?
>
> --
>
> ----------------------------------------
> Thank you
>
>
>