You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Anubhav Kale <An...@microsoft.com> on 2016/10/05 20:34:48 UTC

Nodetool rebuild question

Hello,

As part of rebuild, I noticed that the destination node gets -tmp- files from other nodes. Are following statements correct ?


1.       The files are written to disk without going through memtables.

2.       Regular compactors eventually compact them to bring down # SSTables to a reasonable number.

We have noticed that the destination node has created > 40K *Data* files in first hour of streaming itself. We have not seen such pattern before, so trying to understand what could have changed. (We do use Vnodes and We haven't increased # nodes recently, but have decomm-ed a DC).

Thanks much !

Re: Nodetool rebuild question

Posted by Jeff Jirsa <je...@crowdstrike.com>.
Read repairs (both foreground/blocking due to consistency level requirements and background/nonblocking due to table option/probability) will go memTable -> flush -> sstable.

 

 

 

From: Anubhav Kale <An...@microsoft.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Thursday, October 6, 2016 at 11:50 AM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: RE: Nodetool rebuild question

 

Sure. 

 

When a read repair happens, does it go via the memtable -> SS Table route OR does the source node send SS Table tmp files directly to inconsistent replica ?

 

From: Jeff Jirsa [mailto:jeff.jirsa@crowdstrike.com] 
Sent: Wednesday, October 5, 2016 2:20 PM
To: user@cassandra.apache.org
Subject: Re: Nodetool rebuild question

 

If you set RF to 0, you can ignore my second sentence/paragraph. The third still applies.

 

 

From: Anubhav Kale <An...@microsoft.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Wednesday, October 5, 2016 at 1:56 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: RE: Nodetool rebuild question

 

Thanks. 

 

We always set RF to 0 and then “removenode” all nodes in the DC that we want to decom. So, I highly doubt that is the problem. Plus, #SSTables on a given node on average is ~2000 (we have 140 nodes in one ring and two rings overall).

 

From: Jeff Jirsa [mailto:jeff.jirsa@crowdstrike.com] 
Sent: Wednesday, October 5, 2016 1:44 PM
To: user@cassandra.apache.org
Subject: Re: Nodetool rebuild question

 

Both of your statements are true.

 

During your decom, you likely streamed LOTs of sstables to the remaining nodes (especially true if you didn’t drop the replication factor to 0 for the DC you decommissioned). Since those tens of thousands of sstables take a while to compact, if you then rebuild (or bootstrap) before compaction is done, you’ll get a LOT of extra sstables.

 

This is one of the reasons that people with large clusters don’t use vnodes – if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait potentially a day or more per node to compact away the leftovers before bootstrapping the next, which is prohibitive at scale. 

 

-          Jeff

 

From: Anubhav Kale <An...@microsoft.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Wednesday, October 5, 2016 at 1:34 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Nodetool rebuild question

 

Hello,

 

As part of rebuild, I noticed that the destination node gets -tmp- files from other nodes. Are following statements correct ?

 

1.       The files are written to disk without going through memtables.

2.       Regular compactors eventually compact them to bring down # SSTables to a reasonable number.

 

We have noticed that the destination node has created > 40K *Data* files in first hour of streaming itself. We have not seen such pattern before, so trying to understand what could have changed. (We do use Vnodes and We haven’t increased # nodes recently, but have decomm-ed a DC). 

 

Thanks much !

____________________________________________________________________
CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.

____________________________________________________________________
CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.

____________________________________________________________________
CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.

RE: Nodetool rebuild question

Posted by Anubhav Kale <An...@microsoft.com>.
Sure.

When a read repair happens, does it go via the memtable -> SS Table route OR does the source node send SS Table tmp files directly to inconsistent replica ?

From: Jeff Jirsa [mailto:jeff.jirsa@crowdstrike.com]
Sent: Wednesday, October 5, 2016 2:20 PM
To: user@cassandra.apache.org
Subject: Re: Nodetool rebuild question

If you set RF to 0, you can ignore my second sentence/paragraph. The third still applies.


From: Anubhav Kale <An...@microsoft.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Wednesday, October 5, 2016 at 1:56 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: RE: Nodetool rebuild question

Thanks.

We always set RF to 0 and then “removenode” all nodes in the DC that we want to decom. So, I highly doubt that is the problem. Plus, #SSTables on a given node on average is ~2000 (we have 140 nodes in one ring and two rings overall).

From: Jeff Jirsa [mailto:jeff.jirsa@crowdstrike.com]
Sent: Wednesday, October 5, 2016 1:44 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Nodetool rebuild question

Both of your statements are true.

During your decom, you likely streamed LOTs of sstables to the remaining nodes (especially true if you didn’t drop the replication factor to 0 for the DC you decommissioned). Since those tens of thousands of sstables take a while to compact, if you then rebuild (or bootstrap) before compaction is done, you’ll get a LOT of extra sstables.

This is one of the reasons that people with large clusters don’t use vnodes – if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait potentially a day or more per node to compact away the leftovers before bootstrapping the next, which is prohibitive at scale.


-          Jeff

From: Anubhav Kale <An...@microsoft.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Wednesday, October 5, 2016 at 1:34 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Nodetool rebuild question

Hello,

As part of rebuild, I noticed that the destination node gets -tmp- files from other nodes. Are following statements correct ?


1.       The files are written to disk without going through memtables.

2.       Regular compactors eventually compact them to bring down # SSTables to a reasonable number.

We have noticed that the destination node has created > 40K *Data* files in first hour of streaming itself. We have not seen such pattern before, so trying to understand what could have changed. (We do use Vnodes and We haven’t increased # nodes recently, but have decomm-ed a DC).

Thanks much !
____________________________________________________________________
CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.
____________________________________________________________________
CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.

Re: Nodetool rebuild question

Posted by Jeff Jirsa <je...@crowdstrike.com>.
If you set RF to 0, you can ignore my second sentence/paragraph. The third still applies.

 

 

From: Anubhav Kale <An...@microsoft.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Wednesday, October 5, 2016 at 1:56 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: RE: Nodetool rebuild question

 

Thanks. 

 

We always set RF to 0 and then “removenode” all nodes in the DC that we want to decom. So, I highly doubt that is the problem. Plus, #SSTables on a given node on average is ~2000 (we have 140 nodes in one ring and two rings overall).

 

From: Jeff Jirsa [mailto:jeff.jirsa@crowdstrike.com] 
Sent: Wednesday, October 5, 2016 1:44 PM
To: user@cassandra.apache.org
Subject: Re: Nodetool rebuild question

 

Both of your statements are true.

 

During your decom, you likely streamed LOTs of sstables to the remaining nodes (especially true if you didn’t drop the replication factor to 0 for the DC you decommissioned). Since those tens of thousands of sstables take a while to compact, if you then rebuild (or bootstrap) before compaction is done, you’ll get a LOT of extra sstables.

 

This is one of the reasons that people with large clusters don’t use vnodes – if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait potentially a day or more per node to compact away the leftovers before bootstrapping the next, which is prohibitive at scale. 

 

-          Jeff

 

From: Anubhav Kale <An...@microsoft.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Wednesday, October 5, 2016 at 1:34 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Nodetool rebuild question

 

Hello,

 

As part of rebuild, I noticed that the destination node gets -tmp- files from other nodes. Are following statements correct ?

 

1.       The files are written to disk without going through memtables.

2.       Regular compactors eventually compact them to bring down # SSTables to a reasonable number.

 

We have noticed that the destination node has created > 40K *Data* files in first hour of streaming itself. We have not seen such pattern before, so trying to understand what could have changed. (We do use Vnodes and We haven’t increased # nodes recently, but have decomm-ed a DC). 

 

Thanks much !

____________________________________________________________________
CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.

____________________________________________________________________
CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.

RE: Nodetool rebuild question

Posted by Anubhav Kale <An...@microsoft.com>.
Thanks.

We always set RF to 0 and then “removenode” all nodes in the DC that we want to decom. So, I highly doubt that is the problem. Plus, #SSTables on a given node on average is ~2000 (we have 140 nodes in one ring and two rings overall).

From: Jeff Jirsa [mailto:jeff.jirsa@crowdstrike.com]
Sent: Wednesday, October 5, 2016 1:44 PM
To: user@cassandra.apache.org
Subject: Re: Nodetool rebuild question

Both of your statements are true.

During your decom, you likely streamed LOTs of sstables to the remaining nodes (especially true if you didn’t drop the replication factor to 0 for the DC you decommissioned). Since those tens of thousands of sstables take a while to compact, if you then rebuild (or bootstrap) before compaction is done, you’ll get a LOT of extra sstables.

This is one of the reasons that people with large clusters don’t use vnodes – if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait potentially a day or more per node to compact away the leftovers before bootstrapping the next, which is prohibitive at scale.


-          Jeff

From: Anubhav Kale <An...@microsoft.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Wednesday, October 5, 2016 at 1:34 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Nodetool rebuild question

Hello,

As part of rebuild, I noticed that the destination node gets -tmp- files from other nodes. Are following statements correct ?


1.       The files are written to disk without going through memtables.

2.       Regular compactors eventually compact them to bring down # SSTables to a reasonable number.

We have noticed that the destination node has created > 40K *Data* files in first hour of streaming itself. We have not seen such pattern before, so trying to understand what could have changed. (We do use Vnodes and We haven’t increased # nodes recently, but have decomm-ed a DC).

Thanks much !
____________________________________________________________________
CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.

Re: Nodetool rebuild question

Posted by Jeff Jirsa <je...@crowdstrike.com>.
Both of your statements are true.

 

During your decom, you likely streamed LOTs of sstables to the remaining nodes (especially true if you didn’t drop the replication factor to 0 for the DC you decommissioned). Since those tens of thousands of sstables take a while to compact, if you then rebuild (or bootstrap) before compaction is done, you’ll get a LOT of extra sstables.

 

This is one of the reasons that people with large clusters don’t use vnodes – if you needed to bootstrap ~100 more nodes into a cluster, you’d have to wait potentially a day or more per node to compact away the leftovers before bootstrapping the next, which is prohibitive at scale. 

 

-          Jeff

 

From: Anubhav Kale <An...@microsoft.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Wednesday, October 5, 2016 at 1:34 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Nodetool rebuild question

 

Hello,

 

As part of rebuild, I noticed that the destination node gets -tmp- files from other nodes. Are following statements correct ?

 

1.       The files are written to disk without going through memtables.

2.       Regular compactors eventually compact them to bring down # SSTables to a reasonable number.

 

We have noticed that the destination node has created > 40K *Data* files in first hour of streaming itself. We have not seen such pattern before, so trying to understand what could have changed. (We do use Vnodes and We haven’t increased # nodes recently, but have decomm-ed a DC). 

 

Thanks much !

____________________________________________________________________
CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.