You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Chris Bannister (JIRA)" <ji...@apache.org> on 2015/07/09 12:07:05 UTC
[jira] [Commented] (CASSANDRA-9382) Snapshot file descriptors not getting purged (possible fd leak)

    [ https://issues.apache.org/jira/browse/CASSANDRA-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14620234#comment-14620234 ] 

Chris Bannister commented on CASSANDRA-9382:
--------------------------------------------

I have observed what could be the same issue, repair jobs have been randomly failing with could not create snapshot and no other issues, then we ran out of disk space on the node, restarting cassandra freed lots of a disk space so it looks like something is causing the references to the FD to be held onto by the process. This is Cassandra 2.0.13. I have a heap dump and the lsof of the machine at the time, 

{code}
grep DEL openfiles | wc -l
34376
{code}

Which are mostly

{code}
/mnt/cassandra/data/keyspace/table/snapshots/90566a10-fd79-11e4-a65b-bdf89f99ad2d/keyspace-table-jb-205530-Index.db
{code}

> Snapshot file descriptors not getting purged (possible fd leak)
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-9382
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9382
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Mark Curtis
>            Assignee: Yuki Morishita
>
> OpsCenter has the repair service which does a lot of small range repairs. Each repair would generate a snapshot as per normal. The cluster was showing a steady increase in disk space over the course of a couple of days and the only way to workaround the issue was to restart the node.
> Upon some further inspection it was seen that a lsof output of the cassandra process was still showing file descriptors for snapshots that no longer existed on the file system. For example:
> {code}
> ava    5822 cassandra  DEL    REG             202,32                 7359833 /media/ephemeral1/cassandra/data/somekeyspace/table1/snapshots/669a3a30-f3d3-11e4-bec6-3f6c4fb06498/somekeyspace-table1-jb-897689-Data.db
> {code}
> We also took a heapdump which basically showed the same thing, lots of references to these file handles. We checked the logs for any errors especially relating to repairs that might have failed but there was nothing observed
> The repair service logs in OpsCenter showed also that all repairs (1000s of them) had completed successfully, again showing that there was no repair issue.
> I have not yet been able to reproduce the issue locally on a test box. The cluster that this original issue appeared on was a production cluster with the following spec:
> cassandra_versions: 2.0.14.352
> cluster_cores : 8, 
> cluster_instance_types : i2.2xlarge
> cluster_os : Amazon linux amd64 
> node count: 4
> node java version: Oracle Java 1.7.0_51



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)