You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2010/11/17 10:17:17 UTC

[jira] Created: (CASSANDRA-1752) repair leaving FDs unclosed

repair leaving FDs unclosed
---------------------------

                 Key: CASSANDRA-1752
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
             Project: Cassandra
          Issue Type: Bug
          Components: Core
            Reporter: Jonathan Ellis
             Fix For: 0.6.9


"We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.

"Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.

"This seems to be related to running a repair, as we haven't seen it in any other situations before."

A quick check of FileStreamTask shows that the obvious base is covered:
{code}
        finally
        {
            try
            {
                raf.close();
            }
            catch (IOException e)
            {
                throw new AssertionError(e);
            }
        }
{code}

So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-1752.
---------------------------------------

    Resolution: Fixed

re-committed, thanks

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6-v2.txt, 1752-0.6-v3.txt, 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Tyler Hobbs (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964886#action_12964886 ] 

Tyler Hobbs commented on CASSANDRA-1752:
----------------------------------------

This appears to be a large part of the problem: http://bugs.sun.com/view_bug.do?bug_id=4724038

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Tyler Hobbs (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tyler Hobbs updated CASSANDRA-1752:
-----------------------------------

    Attachment: 1752-0.6.txt

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Tyler Hobbs (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965034#action_12965034 ] 

Tyler Hobbs commented on CASSANDRA-1752:
----------------------------------------

The temporary files that are streamed get deleted whenever the node receives a message saying that the file was streamed successfully.  There isn't a need for SSTableReaders at all in this case; only the names of the files produced by the anticompaction are needed for streaming.  The fix here is to to simply close the SSTableWriter without opening an SSTableReader after anticompaction and return a the list of filenames for use with streaming instead.  This way, if waitForStreamCompletion() hangs indefinitely, there are no SSTRs around to keep the FDs open.

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Tyler Hobbs (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tyler Hobbs updated CASSANDRA-1752:
-----------------------------------

    Attachment: 1752-0.6-v3.txt

Unit tests are fixed in the v3 patch.

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6-v2.txt, 1752-0.6-v3.txt, 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964903#action_12964903 ] 

Jonathan Ellis commented on CASSANDRA-1752:
-------------------------------------------

But the unmapping is supposed to take place at finalization time, which is also when we actually issue the unlink.

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965339#action_12965339 ] 

Jonathan Ellis commented on CASSANDRA-1752:
-------------------------------------------

can't we leave the timing and logging code inside the Helper compaction method to reduce duplication in its callers?

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965930#action_12965930 ] 

Hudson commented on CASSANDRA-1752:
-----------------------------------

Integrated in Cassandra-0.6 #14 (See [https://hudson.apache.org/hudson/job/Cassandra-0.6/14/])
    avoid opening readers on anticompacted to-be-streamed temporary files
patch by thobbs; reviewed by mdennis and jbellis for CASSANDRA-1752


> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6-v2.txt, 1752-0.6-v3.txt, 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965340#action_12965340 ] 

Jonathan Ellis commented on CASSANDRA-1752:
-------------------------------------------

where do the temporary files get deleted post-stream with this patch?

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis reopened CASSANDRA-1752:
---------------------------------------


> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6-v2.txt, 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Tyler Hobbs (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tyler Hobbs updated CASSANDRA-1752:
-----------------------------------

    Attachment: 1752-0.6-v2.txt

Cleaned up version of patch attached.

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6-v2.txt, 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Tyler Hobbs (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964912#action_12964912 ] 

Tyler Hobbs commented on CASSANDRA-1752:
----------------------------------------

Ah, when StreamOut.transferSSTables() blocks on waitForStreamCompletion(), the list of SSTableReaders is still in scope, so they aren't garbage collected.

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965767#action_12965767 ] 

Jonathan Ellis commented on CASSANDRA-1752:
-------------------------------------------

reverted -- tests fail to build

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6-v2.txt, 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis reassigned CASSANDRA-1752:
-----------------------------------------

    Assignee: Tyler Hobbs

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965779#action_12965779 ] 

Hudson commented on CASSANDRA-1752:
-----------------------------------

Integrated in Cassandra-0.6 #13 (See [https://hudson.apache.org/hudson/job/Cassandra-0.6/13/])
    avoid opening readers on anticompacted to-be-streamed temporary files
patch by thobbs; reviewed by mdennis and jbellis for CASSANDRA-1752


> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6-v2.txt, 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Tyler Hobbs (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965534#action_12965534 ] 

Tyler Hobbs commented on CASSANDRA-1752:
----------------------------------------

Files are deleted post-stream in StreamOutManager.finishAndStartNext().  I'll clean up the code a bit and post a new patch shortly.

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1752) repair leaving FDs unclosed

Posted by "Matthew F. Dennis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965088#action_12965088 ] 

Matthew F. Dennis commented on CASSANDRA-1752:
----------------------------------------------

+1 

(but CompactionManager.java:355-358 are superfluous given the loop check after that statement and the added return at the end of the method)

> repair leaving FDs unclosed
> ---------------------------
>
>                 Key: CASSANDRA-1752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Tyler Hobbs
>             Fix For: 0.6.9
>
>         Attachments: 1752-0.6.txt
>
>
> "We noticed that after a `nodetool repair` was ran, several of our nodes reported high disk usage; -- even one node hit 100% disk usage. After a restart of that node, disk usage drop instantly by 80 gigabytes -- well that was confusing, but we quickly formed the theory that Cassandra must of been holding open references to deleted file descriptors.
> "Later, i found this node as an example, it is using about 8-10 gigabytes more than it should be -- 118 gigabytes reported by df, yet du reports only 106 gigabytes in the cassandra directory (nothing else on the mahcine). As you can see from the lsof listing, it is holding open FDs to files that no longer exist on the filesystem, and there are no open streams or as far as I can tell other reasons for the deleted sstable to be open.
> "This seems to be related to running a repair, as we haven't seen it in any other situations before."
> A quick check of FileStreamTask shows that the obvious base is covered:
> {code}
>         finally
>         {
>             try
>             {
>                 raf.close();
>             }
>             catch (IOException e)
>             {
>                 throw new AssertionError(e);
>             }
>         }
> {code}
> So it seems that either the transfer loop is never finishing to get to that finally block (in which case why isn't it showing up in outbound streams?) or something else is the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.