You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Benjamin Coverston (JIRA)" <ji...@apache.org> on 2011/03/02 21:24:37 UTC

[jira] Created: (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
-------------------------------------------------------------------------------------------------------

                 Key: CASSANDRA-2261
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
             Project: Cassandra
          Issue Type: Improvement
          Components: Core
    Affects Versions: 0.7.2, 0.6.11
            Reporter: Benjamin Coverston


When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.

One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.

If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).



-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Pavel Yaskevich (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pavel Yaskevich updated CASSANDRA-2261:
---------------------------------------

    Attachment: CASSANDRA-2261-v3.patch

changes in v3:

# LeveledManifest now returns list with suspected SSTables filtered out if original candidates happen to have at least one suspected SSTables (originally it was just aborting compaction by returning empty list).

# Blacklisting test now supportes both SizeTiered and Leveled strategies. Also for(;;) replaced with bounded loop to avoid the situation when we can get into infinite loop if forceMajorCompaction does not fail.

# Styling and nits is fixed.
                
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1.1
>
>         Attachments: 2261-v2.patch, 2261.patch, CASSANDRA-2261-v3.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2261:
------------------------------------------

    Attachment: 2261.txt

Blacklists SSTables that have been identified to be corrupt and prevents them from participating in compactions.

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6.11, 0.7.2
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>              Labels: not_a_pony
>         Attachments: 2261.txt
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011375#comment-13011375 ] 

Jonathan Ellis commented on CASSANDRA-2261:
-------------------------------------------

Weird, I'm getting patch failures on trunk already.  Was the revision done against updated trunk?

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.5
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Brandon Williams (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218564#comment-13218564 ] 

Brandon Williams commented on CASSANDRA-2261:
---------------------------------------------

+1
                
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1.1
>
>         Attachments: 2261-v2.patch, 2261.patch, CASSANDRA-2261-v3.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Pavel Yaskevich (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218539#comment-13218539 ] 

Pavel Yaskevich edited comment on CASSANDRA-2261 at 2/28/12 7:53 PM:
---------------------------------------------------------------------

changes in v3:

  - LeveledManifest now returns list with suspected SSTables filtered out if original candidates happen to have at least one suspected SSTables (originally it was just aborting compaction by returning empty list).

  - Blacklisting test now supportes both SizeTiered and Leveled strategies. Also for(;;) replaced with bounded loop to avoid the situation when we can get into infinite loop if forceMajorCompaction does not fail.

  - Styling and nits is fixed.
                
      was (Author: xedin):
    changes in v3:

# LeveledManifest now returns list with suspected SSTables filtered out if original candidates happen to have at least one suspected SSTables (originally it was just aborting compaction by returning empty list).

# Blacklisting test now supportes both SizeTiered and Leveled strategies. Also for(;;) replaced with bounded loop to avoid the situation when we can get into infinite loop if forceMajorCompaction does not fail.

# Styling and nits is fixed.
                  
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1.1
>
>         Attachments: 2261-v2.patch, 2261.patch, CASSANDRA-2261-v3.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Pavel Yaskevich (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pavel Yaskevich reassigned CASSANDRA-2261:
------------------------------------------

    Assignee: Pavel Yaskevich  (was: Benjamin Coverston)
    
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Pavel Yaskevich
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1.1
>
>         Attachments: 2261-v2.patch, 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2261:
------------------------------------------

    Attachment:     (was: 2261.txt)

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.5
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2261:
------------------------------------------

    Comment: was deleted

(was: Adding patch, tested this with corrupted indexes and was able to successfully blacklist an SSTable s.t. it would not participate in future compactions.
)

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6.11, 0.7.2
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>              Labels: not_a_pony
>         Attachments: 2261.txt
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215652#comment-13215652 ] 

Benjamin Coverston commented on CASSANDRA-2261:
-----------------------------------------------

I intend to, however I have not had time to work on it since your last comment. I won't be offended if it gets re-assigned. It's probably going to be at least another few weeks before I can look at it.
                
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1.1
>
>         Attachments: 2261-v2.patch, 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2261:
------------------------------------------

    Attachment:     (was: 2261.patch)

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.5
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Pavel Yaskevich (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215623#comment-13215623 ] 

Pavel Yaskevich commented on CASSANDRA-2261:
--------------------------------------------

Ben, are you going to work on this?
                
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1.1
>
>         Attachments: 2261-v2.patch, 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2261:
------------------------------------------

    Attachment:     (was: 2261.txt)

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.4
>
>         Attachments: 2261.txt
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Nick Bailey (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003434#comment-13003434 ] 

Nick Bailey commented on CASSANDRA-2261:
----------------------------------------

This is basically the same concept as CASSANDRA-2229 right? Should we close that since we have a patch here?

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.4
>
>         Attachments: 2261.txt
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2261:
--------------------------------------

             Reviewer: stuhood
             Priority: Minor  (was: Major)
    Affects Version/s:     (was: 0.6.11)
                           (was: 0.7.2)
                       0.6
        Fix Version/s: 0.7.4

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.4
>
>         Attachments: 2261.txt
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2261:
------------------------------------------

    Attachment:     (was: 2261-v2.patch)
    
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1
>
>         Attachments: 2261-v2.patch, 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011377#comment-13011377 ] 

Jonathan Ellis commented on CASSANDRA-2261:
-------------------------------------------

Oh, I guess this targets 0.7.  Are we that sure it won't cause problems?

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.5
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170461#comment-13170461 ] 

Benjamin Coverston commented on CASSANDRA-2261:
-----------------------------------------------

happily
                
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011736#comment-13011736 ] 

Jonathan Ellis commented on CASSANDRA-2261:
-------------------------------------------

I'd prefer to keep it in memory anyway. Retrying every server restart in case it's some kind of transient failure seems like a good thing.

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.5
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2261:
------------------------------------------

    Attachment: 2261-v2.patch

Patch re-written for the 1.0.x branch.
                
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1
>
>         Attachments: 2261-v2.patch, 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Pavel Yaskevich (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193195#comment-13193195 ] 

Pavel Yaskevich commented on CASSANDRA-2261:
--------------------------------------------

Patch overall looks good, has some code styling issues in files (LeveledManifest.java, SizeTieredCompactionStrategy.java).

And when you comment logger.debug(...) in ColumnFamilyStore.java:1332 you will be able to see following exception (one exception for each of the Keyspace1-Standard1 SSTables) shown by CompactionsTest.testBlacklisting():

{nofromat}
    [junit] java.lang.RuntimeException: SSTableScanner(file=RandomAccessReader(filePath='/Users/xedin/work/java/cassandra/build/test/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-hc-5-Data.db', skipIOCache=true) sstable=SSTableReader(path='build/test/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-hc-5-Data.db') exhausted=false) failed to provide next columns from KeyScanningIterator(finishedAt:0)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:193)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:146)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:138)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:38)
    [junit] 	at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:149)
    [junit] 	at org.apache.cassandra.utils.MergeIterator$ManyToOne.<init>(MergeIterator.java:90)
    [junit] 	at org.apache.cassandra.utils.MergeIterator.get(MergeIterator.java:47)
    [junit] 	at org.apache.cassandra.db.compaction.CompactionIterable.iterator(CompactionIterable.java:79)
    [junit] 	at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:129)
    [junit] 	at org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:260)
    [junit] 	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
    [junit] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    [junit] 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    [junit] 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    [junit] 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    [junit] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    [junit] 	at java.lang.Thread.run(Thread.java:680)
    [junit] Caused by: java.io.EOFException
    [junit] 	at java.io.RandomAccessFile.readFully(RandomAccessFile.java:399)
    [junit] 	at java.io.RandomAccessFile.readFully(RandomAccessFile.java:377)
    [junit] 	at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:324)
    [junit] 	at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:391)
    [junit] 	at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:373)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:173)
    [junit] 	... 16 more
{noformat}

Could you please investigate that cases them?
                
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1
>
>         Attachments: 2261-v2.patch, 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011034#comment-13011034 ] 

Benjamin Coverston commented on CASSANDRA-2261:
-----------------------------------------------

Thanks for the review, I'll make the code style changes and remove the modified default. Sorry about that.

I'll also add a warning.

Thanks Stu!

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.5
>
>         Attachments: 2261.txt
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2261:
------------------------------------------

    Attachment: 2261.patch

Revised patch

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.5
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2261:
------------------------------------------

    Attachment: 2261-v2.patch
    
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1
>
>         Attachments: 2261-v2.patch, 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2261:
--------------------------------------

    Fix Version/s:     (was: 0.7.5)
                   0.8

Let's target 0.8.  It looks like we're eating exception information -- an IOException that gets rethrown/caught as SSTableIOException will not be logged.

What does adding SSTableIOException buy us?

IMO the ideal fix would be, audit our uses of IOE to make sure we're not laundering them (by re-throwing as RTE or IOError) where we want to catch them, and just catch IOE for uses like this.

(I'd also like to audit to *add* laundering to IOError for unrecoverable problems where they happen, instead of polluting the call heirarchy until we arbitrarily launder somewhere else.  And maybe even launder to AssertionError for "can't happen" IOE like where we are reading from a stream we know is in-memory.)

But, I am not a fan of checked exceptions so maybe that is not the Real Java Solution.

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.8
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011716#comment-13011716 ] 

Benjamin Coverston commented on CASSANDRA-2261:
-----------------------------------------------

I'm fine rebasing this to trunk.

Unless we want to make a different change. I added the component type so that the failure information would be durable, but if that's not as important I can keep the information in-memory.

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.5
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Jonathan Ellis (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170457#comment-13170457 ] 

Jonathan Ellis commented on CASSANDRA-2261:
-------------------------------------------

I don't suppose you'd care to rebase to trunk?
                
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2261:
------------------------------------------

    Attachment: 2261.patch

Added a new patch that keeps the corrupted SSTable list resident and also added a new test.

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.5
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011594#comment-13011594 ] 

Stu Hood commented on CASSANDRA-2261:
-------------------------------------

I'd prefer not to add a new component type to 0.7, since it wouldn't be backwards compatible within the minor versions.

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.5
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003441#comment-13003441 ] 

Jonathan Ellis commented on CASSANDRA-2261:
-------------------------------------------

Done.

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.4
>
>         Attachments: 2261.txt
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008082#comment-13008082 ] 

Stu Hood commented on CASSANDRA-2261:
-------------------------------------

* Are the changes to the example config files necessary? Snapshot before compaction is not a good default, etc.
* Code style is a little wonky... particularly braces around try/catch and spaces between for( and if(
* When failing to mark an SSTable as suspect, throwing an IOError would mask the root cause: should probably log a warning instead

Thanks Ben!

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.5
>
>         Attachments: 2261.txt
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2261:
--------------------------------------

    Affects Version/s:     (was: 0.6)
        Fix Version/s:     (was: 0.8)
                       1.0

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.0
>
>         Attachments: 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Jonathan Ellis (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2261:
--------------------------------------

    Reviewer: xedin  (was: stuhood)
    
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1
>
>         Attachments: 2261-v2.patch, 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Pavel Yaskevich (Issue Comment Edited) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193195#comment-13193195 ] 

Pavel Yaskevich edited comment on CASSANDRA-2261 at 1/25/12 6:42 PM:
---------------------------------------------------------------------

Patch overall looks good, has some code styling issues in files (LeveledManifest.java, SizeTieredCompactionStrategy.java).

And when you comment logger.debug(...) in ColumnFamilyStore.java:1332 you will be able to see following exception (one exception for each of the Keyspace1-Standard1 SSTables) shown by CompactionsTest.testBlacklisting():

{noformat}
    [junit] java.lang.RuntimeException: SSTableScanner(file=RandomAccessReader(filePath='/Users/xedin/work/java/cassandra/build/test/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-hc-5-Data.db', skipIOCache=true) sstable=SSTableReader(path='build/test/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-hc-5-Data.db') exhausted=false) failed to provide next columns from KeyScanningIterator(finishedAt:0)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:193)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:146)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:138)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:38)
    [junit] 	at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:149)
    [junit] 	at org.apache.cassandra.utils.MergeIterator$ManyToOne.<init>(MergeIterator.java:90)
    [junit] 	at org.apache.cassandra.utils.MergeIterator.get(MergeIterator.java:47)
    [junit] 	at org.apache.cassandra.db.compaction.CompactionIterable.iterator(CompactionIterable.java:79)
    [junit] 	at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:129)
    [junit] 	at org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:260)
    [junit] 	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
    [junit] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    [junit] 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    [junit] 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    [junit] 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    [junit] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    [junit] 	at java.lang.Thread.run(Thread.java:680)
    [junit] Caused by: java.io.EOFException
    [junit] 	at java.io.RandomAccessFile.readFully(RandomAccessFile.java:399)
    [junit] 	at java.io.RandomAccessFile.readFully(RandomAccessFile.java:377)
    [junit] 	at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:324)
    [junit] 	at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:391)
    [junit] 	at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:373)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:173)
    [junit] 	... 16 more
{noformat}

Could you please investigate that cases them?
                
      was (Author: xedin):
    Patch overall looks good, has some code styling issues in files (LeveledManifest.java, SizeTieredCompactionStrategy.java).

And when you comment logger.debug(...) in ColumnFamilyStore.java:1332 you will be able to see following exception (one exception for each of the Keyspace1-Standard1 SSTables) shown by CompactionsTest.testBlacklisting():

{nofromat}
    [junit] java.lang.RuntimeException: SSTableScanner(file=RandomAccessReader(filePath='/Users/xedin/work/java/cassandra/build/test/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-hc-5-Data.db', skipIOCache=true) sstable=SSTableReader(path='build/test/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-hc-5-Data.db') exhausted=false) failed to provide next columns from KeyScanningIterator(finishedAt:0)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:193)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:146)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:138)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:38)
    [junit] 	at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:149)
    [junit] 	at org.apache.cassandra.utils.MergeIterator$ManyToOne.<init>(MergeIterator.java:90)
    [junit] 	at org.apache.cassandra.utils.MergeIterator.get(MergeIterator.java:47)
    [junit] 	at org.apache.cassandra.db.compaction.CompactionIterable.iterator(CompactionIterable.java:79)
    [junit] 	at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:129)
    [junit] 	at org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:260)
    [junit] 	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
    [junit] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    [junit] 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    [junit] 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    [junit] 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    [junit] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    [junit] 	at java.lang.Thread.run(Thread.java:680)
    [junit] Caused by: java.io.EOFException
    [junit] 	at java.io.RandomAccessFile.readFully(RandomAccessFile.java:399)
    [junit] 	at java.io.RandomAccessFile.readFully(RandomAccessFile.java:377)
    [junit] 	at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:324)
    [junit] 	at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:391)
    [junit] 	at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:373)
    [junit] 	at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:173)
    [junit] 	... 16 more
{noformat}

Could you please investigate that cases them?
                  
> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 1.1
>
>         Attachments: 2261-v2.patch, 2261.patch
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2261) During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.

Posted by "Benjamin Coverston (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Coverston updated CASSANDRA-2261:
------------------------------------------

    Attachment: 2261.txt

Cleaned up patch for 2261

> During Compaction, Corrupt SSTables with rows that cause failures should be identified and blacklisted.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2261
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Benjamin Coverston
>            Assignee: Benjamin Coverston
>            Priority: Minor
>              Labels: not_a_pony
>             Fix For: 0.7.4
>
>         Attachments: 2261.txt
>
>
> When a compaction of a set of SSTables fails because of corruption it will continue to try to compact that SSTable causing pending compactions to build up.
> One way to mitigate this problem would be to log the error, then identify the specific SSTable that caused the failure, subsequently blacklisting that SSTable and ensuring that it is no longer included in future compactions. For this we could simply store the problematic SSTable's name in memory.
> If it's not possible to identify the SSTable that caused the issue, then perhaps blacklisting the (ordered) permutation of SSTables to be compacted together is something that can be done to solve this problem in a more general case, and avoid issues where two (or more) SSTables have trouble compacting a particular row. For this option we would probably want to store the lists of the bad combinations in the system table somewhere s.t. these can survive a node failure (there have been a few cases where I have seen a compaction cause a node failure).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira