You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Narendra Sharma <na...@gmail.com> on 2013/12/15 22:14:46 UTC

Cassandra 1.1.6 - Disk usage and Load displayed in ring doesn't match

We have 8 node cluster. Replication factor is 3.

For some of the nodes the Disk usage (du -ksh .) in the data directory for
CF doesn't match the Load reported in nodetool ring command. When we
expanded the cluster from 4 node to 8 nodes (4 weeks back), everything was
okay. Over period of last 2-3 weeks the disk usage has gone up. We
increased the RF from 2 to 3 2 weeks ago.

I am not sure if increasing the RF is causing this issue.

For one of the nodes that I analyzed:
1. nodetool ring reported load as 575.38 GB

2. nodetool cfstats for the CF reported:
SSTable count: 28
Space used (live): 572671381955
Space used (total): 572671381955


3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
46

4. 'du -ksh .' in the data folder for CF returned
720G

The above numbers indicate that there are some sstables that are obsolete
and are still occupying space on disk. What could be wrong? Will restarting
the node help? The cassandra process is running for last 45 days with no
downtime. However, because the disk usage is high, we are not able to run
full compaction.

Also, I can't find reference to each of the sstables on disk in the
system.log file. For eg I have one data file on disk as (ls -lth):
86G Nov 20 06:14

I have system.log file with first line:
INFO [main] 2013-11-18 09:41:56,120 AbstractCassandraDaemon.java (line 101)
Logging initialized

The 86G file must be a result of some compaction. I see no reference of
data file in system.log file between 11/18 to 11/25. What could be the
reason for that? The only reference is dated 11/29 when the file was being
streamed to another node (new node).

How can I identify the obsolete files and remove them? I am thinking about
following. Let me know if it make sense.
1. Restart the node and check the state.
2. Move the oldest data files to another location (to another mount point)
3. Restart the node again
4. Run repair on the node so that it can get the missing data from its
peers.


I compared the numbers of a healthy node for the same CF:
1. nodetool ring reported load as 662.95 GB

2. nodetool cfstats for the CF reported:
SSTable count: 16
Space used (live): 670524321067
Space used (total): 670524321067

3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
16

4. 'du -ksh .' in the data folder for CF returned
625G


-Naren



-- 
Narendra Sharma
Software Engineer
*http://www.aeris.com <http://www.aeris.com>*
*http://narendrasharma.blogspot.com/ <http://narendrasharma.blogspot.com/>*

Re: Cassandra 1.1.6 - Disk usage and Load displayed in ring doesn't match

Posted by Narendra Sharma <na...@gmail.com>.

Thanks Aaron. No tmp files and not even a single exception in the
system.log.

If the file was last modified on 20-Nov then there must be an entry for
that in the log (either completed streaming or compacted).


On Tue, Dec 17, 2013 at 7:23 PM, Aaron Morton <aa...@thelastpickle.com>wrote:

> -tmp- files will sit in the data dir, if there was an error creating them
> during compaction or flushing to disk they will sit around until a restart.
>
> Check the logs for errors to see if compaction was failing on something.
>
> Cheers
>
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 17/12/2013, at 12:28 pm, Narendra Sharma <na...@gmail.com>
> wrote:
>
> No snapshots.
>
> I restarted the node and now the Load in ring is in sync with the disk
> usage. Not sure what caused it to go out of sync. However, the Live SStable
> count doesn't match exactly with the number of data files on disk.
>
> I am going through the Cassandra code to understand what could be the
> reason for the mismatch in the sstable count and also why there is no
> reference of some of the data files in system.log.
>
>
>
>
> On Mon, Dec 16, 2013 at 2:45 PM, Arindam Barua <ab...@247-inc.com> wrote:
>
>>
>>
>> Do you have any snapshots on the nodes where you are seeing this issue?
>>
>> Snapshots will link to sstables which will cause them not be deleted.
>>
>>
>>
>> -Arindam
>>
>>
>>
>> *From:* Narendra Sharma [mailto:narendra.sharma@gmail.com]
>> *Sent:* Sunday, December 15, 2013 1:15 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Cassandra 1.1.6 - Disk usage and Load displayed in ring
>> doesn't match
>>
>>
>>
>> We have 8 node cluster. Replication factor is 3.
>>
>>
>>
>> For some of the nodes the Disk usage (du -ksh .) in the data directory
>> for CF doesn't match the Load reported in nodetool ring command. When we
>> expanded the cluster from 4 node to 8 nodes (4 weeks back), everything was
>> okay. Over period of last 2-3 weeks the disk usage has gone up. We
>> increased the RF from 2 to 3 2 weeks ago.
>>
>>
>>
>> I am not sure if increasing the RF is causing this issue.
>>
>>
>>
>> For one of the nodes that I analyzed:
>>
>> 1. nodetool ring reported load as 575.38 GB
>>
>>
>>
>> 2. nodetool cfstats for the CF reported:
>>
>> SSTable count: 28
>>
>> Space used (live): 572671381955
>>
>> Space used (total): 572671381955
>>
>>
>>
>>
>>
>> 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
>>
>> 46
>>
>>
>>
>> 4. 'du -ksh .' in the data folder for CF returned
>>
>> 720G
>>
>>
>>
>> The above numbers indicate that there are some sstables that are obsolete
>> and are still occupying space on disk. What could be wrong? Will restarting
>> the node help? The cassandra process is running for last 45 days with no
>> downtime. However, because the disk usage is high, we are not able to run
>> full compaction.
>>
>>
>>
>> Also, I can't find reference to each of the sstables on disk in the
>> system.log file. For eg I have one data file on disk as (ls -lth):
>>
>> 86G Nov 20 06:14
>>
>>
>>
>> I have system.log file with first line:
>>
>> INFO [main] 2013-11-18 09:41:56,120 AbstractCassandraDaemon.java (line
>> 101) Logging initialized
>>
>>
>>
>> The 86G file must be a result of some compaction. I see no reference of
>> data file in system.log file between 11/18 to 11/25. What could be the
>> reason for that? The only reference is dated 11/29 when the file was being
>> streamed to another node (new node).
>>
>>
>>
>> How can I identify the obsolete files and remove them? I am thinking
>> about following. Let me know if it make sense.
>>
>> 1. Restart the node and check the state.
>>
>> 2. Move the oldest data files to another location (to another mount point)
>>
>> 3. Restart the node again
>>
>> 4. Run repair on the node so that it can get the missing data from its
>> peers.
>>
>>
>>
>>
>>
>> I compared the numbers of a healthy node for the same CF:
>>
>> 1. nodetool ring reported load as 662.95 GB
>>
>>
>>
>> 2. nodetool cfstats for the CF reported:
>>
>> SSTable count: 16
>>
>> Space used (live): 670524321067
>>
>> Space used (total): 670524321067
>>
>>
>>
>> 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
>>
>> 16
>>
>>
>>
>> 4. 'du -ksh .' in the data folder for CF returned
>>
>> 625G
>>
>>
>>
>>
>>
>> -Naren
>>
>>
>>
>>
>>
>>
>> --
>> Narendra Sharma
>>
>> Software Engineer
>>
>> *http://www.aeris.com <http://www.aeris.com/>*
>>
>> *http://narendrasharma.blogspot.com/
>> <http://narendrasharma.blogspot.com/>*
>>
>>
>>
>
>
>
> --
> Narendra Sharma
> Software Engineer
> *http://www.aeris.com <http://www.aeris.com/>*
> *http://narendrasharma.blogspot.com/ <http://narendrasharma.blogspot.com/>*
>
>
>


-- 
Narendra Sharma
Software Engineer
*http://www.aeris.com <http://www.aeris.com>*
*http://narendrasharma.blogspot.com/ <http://narendrasharma.blogspot.com/>*

Re: Cassandra 1.1.6 - Disk usage and Load displayed in ring doesn't match

Posted by Narendra Sharma <na...@gmail.com>.

Thanks Julien. We ran repair. Increasing the RF should not make sstables
obselete. I can understand reducing RF or adding new node etc can result in
few obsolete sstables which eventually go away after you run cleanup.


On Wed, Dec 18, 2013 at 1:49 AM, Julien Campan <ju...@gmail.com>wrote:

> Hi,
> When you are increasing the RF, you need to perform repair for the
> keyspace on each node.(Because datas are not automaticaly streamed).
> After that you should perform a cleanup on each node to remove obsolete
> sstable.
>
>
> Good luck :)
>
> Julien Campan.
>
>
>
>
>
>
>
>
>
> 2013/12/18 Aaron Morton <aa...@thelastpickle.com>
>
>> -tmp- files will sit in the data dir, if there was an error creating them
>> during compaction or flushing to disk they will sit around until a restart.
>>
>> Check the logs for errors to see if compaction was failing on something.
>>
>> Cheers
>>
>>  -----------------
>> Aaron Morton
>> New Zealand
>> @aaronmorton
>>
>> Co-Founder & Principal Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> On 17/12/2013, at 12:28 pm, Narendra Sharma <na...@gmail.com>
>> wrote:
>>
>> No snapshots.
>>
>> I restarted the node and now the Load in ring is in sync with the disk
>> usage. Not sure what caused it to go out of sync. However, the Live SStable
>> count doesn't match exactly with the number of data files on disk.
>>
>> I am going through the Cassandra code to understand what could be the
>> reason for the mismatch in the sstable count and also why there is no
>> reference of some of the data files in system.log.
>>
>>
>>
>>
>> On Mon, Dec 16, 2013 at 2:45 PM, Arindam Barua <ab...@247-inc.com>wrote:
>>
>>>
>>>
>>> Do you have any snapshots on the nodes where you are seeing this issue?
>>>
>>> Snapshots will link to sstables which will cause them not be deleted.
>>>
>>>
>>>
>>> -Arindam
>>>
>>>
>>>
>>> *From:* Narendra Sharma [mailto:narendra.sharma@gmail.com]
>>> *Sent:* Sunday, December 15, 2013 1:15 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Cassandra 1.1.6 - Disk usage and Load displayed in ring
>>> doesn't match
>>>
>>>
>>>
>>> We have 8 node cluster. Replication factor is 3.
>>>
>>>
>>>
>>> For some of the nodes the Disk usage (du -ksh .) in the data directory
>>> for CF doesn't match the Load reported in nodetool ring command. When we
>>> expanded the cluster from 4 node to 8 nodes (4 weeks back), everything was
>>> okay. Over period of last 2-3 weeks the disk usage has gone up. We
>>> increased the RF from 2 to 3 2 weeks ago.
>>>
>>>
>>>
>>> I am not sure if increasing the RF is causing this issue.
>>>
>>>
>>>
>>> For one of the nodes that I analyzed:
>>>
>>> 1. nodetool ring reported load as 575.38 GB
>>>
>>>
>>>
>>> 2. nodetool cfstats for the CF reported:
>>>
>>> SSTable count: 28
>>>
>>> Space used (live): 572671381955
>>>
>>> Space used (total): 572671381955
>>>
>>>
>>>
>>>
>>>
>>> 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
>>>
>>> 46
>>>
>>>
>>>
>>> 4. 'du -ksh .' in the data folder for CF returned
>>>
>>> 720G
>>>
>>>
>>>
>>> The above numbers indicate that there are some sstables that are
>>> obsolete and are still occupying space on disk. What could be wrong? Will
>>> restarting the node help? The cassandra process is running for last 45 days
>>> with no downtime. However, because the disk usage is high, we are not able
>>> to run full compaction.
>>>
>>>
>>>
>>> Also, I can't find reference to each of the sstables on disk in the
>>> system.log file. For eg I have one data file on disk as (ls -lth):
>>>
>>> 86G Nov 20 06:14
>>>
>>>
>>>
>>> I have system.log file with first line:
>>>
>>> INFO [main] 2013-11-18 09:41:56,120 AbstractCassandraDaemon.java (line
>>> 101) Logging initialized
>>>
>>>
>>>
>>> The 86G file must be a result of some compaction. I see no reference of
>>> data file in system.log file between 11/18 to 11/25. What could be the
>>> reason for that? The only reference is dated 11/29 when the file was being
>>> streamed to another node (new node).
>>>
>>>
>>>
>>> How can I identify the obsolete files and remove them? I am thinking
>>> about following. Let me know if it make sense.
>>>
>>> 1. Restart the node and check the state.
>>>
>>> 2. Move the oldest data files to another location (to another mount
>>> point)
>>>
>>> 3. Restart the node again
>>>
>>> 4. Run repair on the node so that it can get the missing data from its
>>> peers.
>>>
>>>
>>>
>>>
>>>
>>> I compared the numbers of a healthy node for the same CF:
>>>
>>> 1. nodetool ring reported load as 662.95 GB
>>>
>>>
>>>
>>> 2. nodetool cfstats for the CF reported:
>>>
>>> SSTable count: 16
>>>
>>> Space used (live): 670524321067
>>>
>>> Space used (total): 670524321067
>>>
>>>
>>>
>>> 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
>>>
>>> 16
>>>
>>>
>>>
>>> 4. 'du -ksh .' in the data folder for CF returned
>>>
>>> 625G
>>>
>>>
>>>
>>>
>>>
>>> -Naren
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Narendra Sharma
>>>
>>> Software Engineer
>>>
>>> *http://www.aeris.com <http://www.aeris.com/>*
>>>
>>> *http://narendrasharma.blogspot.com/
>>> <http://narendrasharma.blogspot.com/>*
>>>
>>>
>>>
>>
>>
>>
>> --
>> Narendra Sharma
>> Software Engineer
>> *http://www.aeris.com <http://www.aeris.com/>*
>> *http://narendrasharma.blogspot.com/
>> <http://narendrasharma.blogspot.com/>*
>>
>>
>>
>


-- 
Narendra Sharma
Software Engineer
*http://www.aeris.com <http://www.aeris.com>*
*http://narendrasharma.blogspot.com/ <http://narendrasharma.blogspot.com/>*

Re: Cassandra 1.1.6 - Disk usage and Load displayed in ring doesn't match

Posted by Julien Campan <ju...@gmail.com>.

Hi,
When you are increasing the RF, you need to perform repair for the keyspace
on each node.(Because datas are not automaticaly streamed).
After that you should perform a cleanup on each node to remove obsolete
sstable.


Good luck :)

Julien Campan.









2013/12/18 Aaron Morton <aa...@thelastpickle.com>

> -tmp- files will sit in the data dir, if there was an error creating them
> during compaction or flushing to disk they will sit around until a restart.
>
> Check the logs for errors to see if compaction was failing on something.
>
> Cheers
>
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 17/12/2013, at 12:28 pm, Narendra Sharma <na...@gmail.com>
> wrote:
>
> No snapshots.
>
> I restarted the node and now the Load in ring is in sync with the disk
> usage. Not sure what caused it to go out of sync. However, the Live SStable
> count doesn't match exactly with the number of data files on disk.
>
> I am going through the Cassandra code to understand what could be the
> reason for the mismatch in the sstable count and also why there is no
> reference of some of the data files in system.log.
>
>
>
>
> On Mon, Dec 16, 2013 at 2:45 PM, Arindam Barua <ab...@247-inc.com> wrote:
>
>>
>>
>> Do you have any snapshots on the nodes where you are seeing this issue?
>>
>> Snapshots will link to sstables which will cause them not be deleted.
>>
>>
>>
>> -Arindam
>>
>>
>>
>> *From:* Narendra Sharma [mailto:narendra.sharma@gmail.com]
>> *Sent:* Sunday, December 15, 2013 1:15 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Cassandra 1.1.6 - Disk usage and Load displayed in ring
>> doesn't match
>>
>>
>>
>> We have 8 node cluster. Replication factor is 3.
>>
>>
>>
>> For some of the nodes the Disk usage (du -ksh .) in the data directory
>> for CF doesn't match the Load reported in nodetool ring command. When we
>> expanded the cluster from 4 node to 8 nodes (4 weeks back), everything was
>> okay. Over period of last 2-3 weeks the disk usage has gone up. We
>> increased the RF from 2 to 3 2 weeks ago.
>>
>>
>>
>> I am not sure if increasing the RF is causing this issue.
>>
>>
>>
>> For one of the nodes that I analyzed:
>>
>> 1. nodetool ring reported load as 575.38 GB
>>
>>
>>
>> 2. nodetool cfstats for the CF reported:
>>
>> SSTable count: 28
>>
>> Space used (live): 572671381955
>>
>> Space used (total): 572671381955
>>
>>
>>
>>
>>
>> 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
>>
>> 46
>>
>>
>>
>> 4. 'du -ksh .' in the data folder for CF returned
>>
>> 720G
>>
>>
>>
>> The above numbers indicate that there are some sstables that are obsolete
>> and are still occupying space on disk. What could be wrong? Will restarting
>> the node help? The cassandra process is running for last 45 days with no
>> downtime. However, because the disk usage is high, we are not able to run
>> full compaction.
>>
>>
>>
>> Also, I can't find reference to each of the sstables on disk in the
>> system.log file. For eg I have one data file on disk as (ls -lth):
>>
>> 86G Nov 20 06:14
>>
>>
>>
>> I have system.log file with first line:
>>
>> INFO [main] 2013-11-18 09:41:56,120 AbstractCassandraDaemon.java (line
>> 101) Logging initialized
>>
>>
>>
>> The 86G file must be a result of some compaction. I see no reference of
>> data file in system.log file between 11/18 to 11/25. What could be the
>> reason for that? The only reference is dated 11/29 when the file was being
>> streamed to another node (new node).
>>
>>
>>
>> How can I identify the obsolete files and remove them? I am thinking
>> about following. Let me know if it make sense.
>>
>> 1. Restart the node and check the state.
>>
>> 2. Move the oldest data files to another location (to another mount point)
>>
>> 3. Restart the node again
>>
>> 4. Run repair on the node so that it can get the missing data from its
>> peers.
>>
>>
>>
>>
>>
>> I compared the numbers of a healthy node for the same CF:
>>
>> 1. nodetool ring reported load as 662.95 GB
>>
>>
>>
>> 2. nodetool cfstats for the CF reported:
>>
>> SSTable count: 16
>>
>> Space used (live): 670524321067
>>
>> Space used (total): 670524321067
>>
>>
>>
>> 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
>>
>> 16
>>
>>
>>
>> 4. 'du -ksh .' in the data folder for CF returned
>>
>> 625G
>>
>>
>>
>>
>>
>> -Naren
>>
>>
>>
>>
>>
>>
>> --
>> Narendra Sharma
>>
>> Software Engineer
>>
>> *http://www.aeris.com <http://www.aeris.com/>*
>>
>> *http://narendrasharma.blogspot.com/
>> <http://narendrasharma.blogspot.com/>*
>>
>>
>>
>
>
>
> --
> Narendra Sharma
> Software Engineer
> *http://www.aeris.com <http://www.aeris.com/>*
> *http://narendrasharma.blogspot.com/ <http://narendrasharma.blogspot.com/>*
>
>
>

Re: Cassandra 1.1.6 - Disk usage and Load displayed in ring doesn't match

Posted by Aaron Morton <aa...@thelastpickle.com>.

-tmp- files will sit in the data dir, if there was an error creating them during compaction or flushing to disk they will sit around until a restart. 

Check the logs for errors to see if compaction was failing on something.

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 17/12/2013, at 12:28 pm, Narendra Sharma <na...@gmail.com> wrote:

> No snapshots.
> 
> I restarted the node and now the Load in ring is in sync with the disk usage. Not sure what caused it to go out of sync. However, the Live SStable count doesn't match exactly with the number of data files on disk.
> 
> I am going through the Cassandra code to understand what could be the reason for the mismatch in the sstable count and also why there is no reference of some of the data files in system.log.
> 
> 
> 
> 
> On Mon, Dec 16, 2013 at 2:45 PM, Arindam Barua <ab...@247-inc.com> wrote:
>  
> 
> Do you have any snapshots on the nodes where you are seeing this issue?
> 
> Snapshots will link to sstables which will cause them not be deleted.
> 
>  
> 
> -Arindam
> 
>  
> 
> From: Narendra Sharma [mailto:narendra.sharma@gmail.com] 
> Sent: Sunday, December 15, 2013 1:15 PM
> To: user@cassandra.apache.org
> Subject: Cassandra 1.1.6 - Disk usage and Load displayed in ring doesn't match
> 
>  
> 
> We have 8 node cluster. Replication factor is 3. 
> 
>  
> 
> For some of the nodes the Disk usage (du -ksh .) in the data directory for CF doesn't match the Load reported in nodetool ring command. When we expanded the cluster from 4 node to 8 nodes (4 weeks back), everything was okay. Over period of last 2-3 weeks the disk usage has gone up. We increased the RF from 2 to 3 2 weeks ago.
> 
>  
> 
> I am not sure if increasing the RF is causing this issue.
> 
>  
> 
> For one of the nodes that I analyzed:
> 
> 1. nodetool ring reported load as 575.38 GB
> 
>  
> 
> 2. nodetool cfstats for the CF reported:
> 
> SSTable count: 28
> 
> Space used (live): 572671381955
> 
> Space used (total): 572671381955
> 
>  
> 
>  
> 
> 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
> 
> 46
> 
>  
> 
> 4. 'du -ksh .' in the data folder for CF returned
> 
> 720G
> 
>  
> 
> The above numbers indicate that there are some sstables that are obsolete and are still occupying space on disk. What could be wrong? Will restarting the node help? The cassandra process is running for last 45 days with no downtime. However, because the disk usage is high, we are not able to run full compaction.
> 
>  
> 
> Also, I can't find reference to each of the sstables on disk in the system.log file. For eg I have one data file on disk as (ls -lth):
> 
> 86G Nov 20 06:14
> 
>  
> 
> I have system.log file with first line:
> 
> INFO [main] 2013-11-18 09:41:56,120 AbstractCassandraDaemon.java (line 101) Logging initialized
> 
>  
> 
> The 86G file must be a result of some compaction. I see no reference of data file in system.log file between 11/18 to 11/25. What could be the reason for that? The only reference is dated 11/29 when the file was being streamed to another node (new node).
> 
>  
> 
> How can I identify the obsolete files and remove them? I am thinking about following. Let me know if it make sense.
> 
> 1. Restart the node and check the state.
> 
> 2. Move the oldest data files to another location (to another mount point)
> 
> 3. Restart the node again
> 
> 4. Run repair on the node so that it can get the missing data from its peers.
> 
>  
> 
>  
> 
> I compared the numbers of a healthy node for the same CF:
> 
> 1. nodetool ring reported load as 662.95 GB
> 
>  
> 
> 2. nodetool cfstats for the CF reported:
> 
> SSTable count: 16
> 
> Space used (live): 670524321067
> 
> Space used (total): 670524321067
> 
>  
> 
> 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
> 
> 16
> 
>  
> 
> 4. 'du -ksh .' in the data folder for CF returned
> 
> 625G
> 
>  
> 
>  
> 
> -Naren
> 
>  
> 
> 
> 
>  
> 
> -- 
> Narendra Sharma
> 
> Software Engineer
> 
> http://www.aeris.com
> 
> http://narendrasharma.blogspot.com/
> 
>  
> 
> 
> 
> 
> -- 
> Narendra Sharma
> Software Engineer
> http://www.aeris.com
> http://narendrasharma.blogspot.com/
>

Re: Cassandra 1.1.6 - Disk usage and Load displayed in ring doesn't match

Posted by Narendra Sharma <na...@gmail.com>.

No snapshots.

I restarted the node and now the Load in ring is in sync with the disk
usage. Not sure what caused it to go out of sync. However, the Live SStable
count doesn't match exactly with the number of data files on disk.

I am going through the Cassandra code to understand what could be the
reason for the mismatch in the sstable count and also why there is no
reference of some of the data files in system.log.




On Mon, Dec 16, 2013 at 2:45 PM, Arindam Barua <ab...@247-inc.com> wrote:

>
>
> Do you have any snapshots on the nodes where you are seeing this issue?
>
> Snapshots will link to sstables which will cause them not be deleted.
>
>
>
> -Arindam
>
>
>
> *From:* Narendra Sharma [mailto:narendra.sharma@gmail.com]
> *Sent:* Sunday, December 15, 2013 1:15 PM
> *To:* user@cassandra.apache.org
> *Subject:* Cassandra 1.1.6 - Disk usage and Load displayed in ring
> doesn't match
>
>
>
> We have 8 node cluster. Replication factor is 3.
>
>
>
> For some of the nodes the Disk usage (du -ksh .) in the data directory for
> CF doesn't match the Load reported in nodetool ring command. When we
> expanded the cluster from 4 node to 8 nodes (4 weeks back), everything was
> okay. Over period of last 2-3 weeks the disk usage has gone up. We
> increased the RF from 2 to 3 2 weeks ago.
>
>
>
> I am not sure if increasing the RF is causing this issue.
>
>
>
> For one of the nodes that I analyzed:
>
> 1. nodetool ring reported load as 575.38 GB
>
>
>
> 2. nodetool cfstats for the CF reported:
>
> SSTable count: 28
>
> Space used (live): 572671381955
>
> Space used (total): 572671381955
>
>
>
>
>
> 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
>
> 46
>
>
>
> 4. 'du -ksh .' in the data folder for CF returned
>
> 720G
>
>
>
> The above numbers indicate that there are some sstables that are obsolete
> and are still occupying space on disk. What could be wrong? Will restarting
> the node help? The cassandra process is running for last 45 days with no
> downtime. However, because the disk usage is high, we are not able to run
> full compaction.
>
>
>
> Also, I can't find reference to each of the sstables on disk in the
> system.log file. For eg I have one data file on disk as (ls -lth):
>
> 86G Nov 20 06:14
>
>
>
> I have system.log file with first line:
>
> INFO [main] 2013-11-18 09:41:56,120 AbstractCassandraDaemon.java (line
> 101) Logging initialized
>
>
>
> The 86G file must be a result of some compaction. I see no reference of
> data file in system.log file between 11/18 to 11/25. What could be the
> reason for that? The only reference is dated 11/29 when the file was being
> streamed to another node (new node).
>
>
>
> How can I identify the obsolete files and remove them? I am thinking about
> following. Let me know if it make sense.
>
> 1. Restart the node and check the state.
>
> 2. Move the oldest data files to another location (to another mount point)
>
> 3. Restart the node again
>
> 4. Run repair on the node so that it can get the missing data from its
> peers.
>
>
>
>
>
> I compared the numbers of a healthy node for the same CF:
>
> 1. nodetool ring reported load as 662.95 GB
>
>
>
> 2. nodetool cfstats for the CF reported:
>
> SSTable count: 16
>
> Space used (live): 670524321067
>
> Space used (total): 670524321067
>
>
>
> 3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
>
> 16
>
>
>
> 4. 'du -ksh .' in the data folder for CF returned
>
> 625G
>
>
>
>
>
> -Naren
>
>
>
>
>
>
> --
> Narendra Sharma
>
> Software Engineer
>
> *http://www.aeris.com <http://www.aeris.com>*
>
> *http://narendrasharma.blogspot.com/ <http://narendrasharma.blogspot.com/>*
>
>
>



-- 
Narendra Sharma
Software Engineer
*http://www.aeris.com <http://www.aeris.com>*
*http://narendrasharma.blogspot.com/ <http://narendrasharma.blogspot.com/>*

RE: Cassandra 1.1.6 - Disk usage and Load displayed in ring doesn't match

Posted by Arindam Barua <ab...@247-inc.com>.

Do you have any snapshots on the nodes where you are seeing this issue?
Snapshots will link to sstables which will cause them not be deleted.

-Arindam

From: Narendra Sharma [mailto:narendra.sharma@gmail.com]
Sent: Sunday, December 15, 2013 1:15 PM
To: user@cassandra.apache.org
Subject: Cassandra 1.1.6 - Disk usage and Load displayed in ring doesn't match

We have 8 node cluster. Replication factor is 3.

For some of the nodes the Disk usage (du -ksh .) in the data directory for CF doesn't match the Load reported in nodetool ring command. When we expanded the cluster from 4 node to 8 nodes (4 weeks back), everything was okay. Over period of last 2-3 weeks the disk usage has gone up. We increased the RF from 2 to 3 2 weeks ago.

I am not sure if increasing the RF is causing this issue.

For one of the nodes that I analyzed:
1. nodetool ring reported load as 575.38 GB

2. nodetool cfstats for the CF reported:
SSTable count: 28
Space used (live): 572671381955
Space used (total): 572671381955

3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
46

4. 'du -ksh .' in the data folder for CF returned
720G

The above numbers indicate that there are some sstables that are obsolete and are still occupying space on disk. What could be wrong? Will restarting the node help? The cassandra process is running for last 45 days with no downtime. However, because the disk usage is high, we are not able to run full compaction.

Also, I can't find reference to each of the sstables on disk in the system.log file. For eg I have one data file on disk as (ls -lth):
86G Nov 20 06:14

I have system.log file with first line:
INFO [main] 2013-11-18 09:41:56,120 AbstractCassandraDaemon.java (line 101) Logging initialized

The 86G file must be a result of some compaction. I see no reference of data file in system.log file between 11/18 to 11/25. What could be the reason for that? The only reference is dated 11/29 when the file was being streamed to another node (new node).

How can I identify the obsolete files and remove them? I am thinking about following. Let me know if it make sense.
1. Restart the node and check the state.
2. Move the oldest data files to another location (to another mount point)
3. Restart the node again
4. Run repair on the node so that it can get the missing data from its peers.

I compared the numbers of a healthy node for the same CF:
1. nodetool ring reported load as 662.95 GB

2. nodetool cfstats for the CF reported:
SSTable count: 16
Space used (live): 670524321067
Space used (total): 670524321067

3. 'ls -1 *Data* | wc -l' in the data folder for CF returned
16

4. 'du -ksh .' in the data folder for CF returned
625G

-Naren

--
Narendra Sharma
Software Engineer
http://www.aeris.com
http://narendrasharma.blogspot.com/