You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Elden Bishop (JIRA)" <ji...@apache.org> on 2013/01/31 22:17:12 UTC
[jira] [Created] (CASSANDRA-5210) DB is randomly and undetectably corrupted during high traffic column family flushes

Elden Bishop created CASSANDRA-5210:
---------------------------------------

             Summary: DB is randomly and undetectably corrupted during high traffic column family flushes 
                 Key: CASSANDRA-5210
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5210
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.2.1, 1.2.0, 1.1.9, 1.1.8, 1.1.7, 1.1.6, 1.1.5, 1.1.4, 1.1.3, 1.1.2, 1.1.1, 1.1.0, 0.8.10, 0.8.9, 0.8.8, 0.8.7, 0.8.6, 0.8.5, 0.8.4, 0.8.3, 0.8.2, 0.8.1
         Environment: Cassandra 0.8+, OS/X, java version "1.6.0_37" 
            Reporter: Elden Bishop


Writes during high traffic column family flushes corrupt the DB and make slice queries return incorrect data.

Any multi-column write on any version of Cassandra can put the DB in a state where some columns cannot be read alongside other columns.

eg.

{{
// *** for any NON-NULL column (eg. col_a=>AAA)
cqlsh> SELECT 'col_a' FROM test WHERE KEY='row_a';
   returns:     'AAA'

// *** it can disappear when queried alongside another column
cqlsh> SELECT 'col_a', 'col_b' FROM test WHERE KEY='row_a';
   returns:      null,   'BBB' // *** col_a is MISSING

// *** but it depends on the other columns
cqlsh> SELECT 'col_a', 'col_b', 'col_c' FROM test WHERE KEY='row_a';
   returns:     'AAA',   'BBB',   'CCC' // *** col_a is BACK
}}

Once in this state the database is corrupt and essentially returning random data depending on what columns you query. Single column queries always return correct results so there is no way to verify the data. No errors are logged during corruption and it is impossible to detect without querying all combinations of all columns.

To reproduce:

1. Unzip a distribution of Cassandra and create a test.test column family.
2. In a loop alternate between updating either row 'a' or a random row.
   Write a random value to four random columns (out of 10000). Keep track
   of all columns set in row 'a'.
3. Each pass through the loop query four random columns (out of 10000) from row 'a'. If a column that is known to be set is null, print out the columns that were requested during the query.
4. The DB is now corrupt and will return the column if queried by itself but will return null if queried alongside the columns that triggered the error. This is a permanent condition.


Observations: This bug only manifests directly after a high traffic column family flush occurs in the log. This is a correlation based on simply watching the log. There are no errors or warnings of any kind.

Workaround: Any multi-column read is potentially invalid and corruption is virtually undetectable. The only workaround is never writing or reading more than a single column in a query.

I have a simple groovy script that can trigger the error. I have verified the behavior on Cassandra versions as old as 0.8.1


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira