You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benedict (JIRA)" <ji...@apache.org> on 2015/08/27 10:15:46 UTC

[jira] [Comment Edited] (CASSANDRA-10109) Windows dtest 3.0: ttl_test.py failures

    [ https://issues.apache.org/jira/browse/CASSANDRA-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14716257#comment-14716257 ] 

Benedict edited comment on CASSANDRA-10109 at 8/27/15 8:14 AM:
---------------------------------------------------------------

So, thinking about it, all we really want to do is ensure that clients don't see temporary files (i.e. incomplete files, or files that we may yet abort). On startup we don't have to worry about this; we shouldn't ever have to retry on startup, since the state on disk will not be changing, and we should avoid any necessity on reads. Retries worry me. So, I propose the following:

{noformat}
Online listings:
- List data files
- List txn logs  (must be after to ensure we have seen all txn logs covering the files - this step should only be done if not SecureDirectoryStream)
- Read txn logs
- If the commit/abort record is present, just apply that; don't worry about missing files, since we're actively mutating the state and it should be expected that some may be involved in later transactions
- otherwise, if all tracked files are present (or only the last entry is missing, but is NEW), txn is in progress (so treat as aborted)
- If some files are missing (note, here we should check disk rather than our in memory listing IFF the listing does not contain a file), that implies the transaction has since completed
-- re-read the transaction and look for its current state
-- if the txn log is missing, we can safely do nothing
-- if the last record is now present, apply the logic
-- if none of these hold we must have a bug, so throw an exception
{noformat}

We can do this because we require that clients safely cope with missing files however we perform listings, since they're actively being mutated and can disappear at any time; new or old.

On startup, however, our current logic is fine. But we don't need to retry; we should just fail if we encounter an unrecoverable exception. This should simplify things.


was (Author: benedict):
So, thinking about it, all we really want to do is ensure that clients don't see temporary files (i.e. incomplete files, or files that we may yet abort). On startup we don't have to worry about this; we shouldn't ever have to retry on startup, since the state on disk will not be changing, and we should avoid any necessity on reads. Retries worry me. So, I propose the following:

{noformat}
Online listings:
- List data files
- List txn logs  (must be after to ensure we have seen all txn logs covering the files - this step should only be done if not SecureDirectoryStream)
- Read txn logs
- If the commit/abort record is present, just apply that; don't worry about missing files, since we're actively mutating the state and it should be expected that some may be involved in later transactions
- otherwise, if all tracked files are present, txn is in progress (so treat as aborted)
- If some files are missing (note, here we should check disk rather than our in memory listing IFF the listing does not contain a file), that implies the transaction has since completed
-- re-read the transaction and look for its current state
-- if the txn log is missing, we can safely do nothing
-- if the last record is now present, apply the logic
-- if neither of these two hold we must have a bug, so throw an exception
{noformat}

We can do this because we require that clients safely cope with missing files however we perform listings, since they're actively being mutated and can disappear at any time; new or old.

On startup, however, our current logic is fine. But we don't need to retry; we should just fail if we encounter an unrecoverable exception. This should simplify things.

> Windows dtest 3.0: ttl_test.py failures
> ---------------------------------------
>
>                 Key: CASSANDRA-10109
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10109
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Joshua McKenzie
>            Assignee: Stefania
>              Labels: Windows
>             Fix For: 3.0.0 rc1
>
>
> ttl_test.py:TestTTL.update_column_ttl_with_default_ttl_test2
> ttl_test.py:TestTTL.update_multiple_columns_ttl_test
> ttl_test.py:TestTTL.update_single_column_ttl_test
> Errors locally are different than CI from yesterday. Yesterday on CI we have timeouts and general node hangs. Today on all 3 tests when run locally I see:
> {noformat}
> Traceback (most recent call last):
>   File "c:\src\cassandra-dtest\dtest.py", line 532, in tearDown
>     raise AssertionError('Unexpected error in %s node log: %s' % (node.name, errors))
> AssertionError: Unexpected error in node1 node log: ['ERROR [main] 2015-08-17 16:53:43,120 NoSpamLogger.java:97 - This platform does not support atomic directory streams (SecureDirectoryStream); race conditions when loading sstable files could occurr']
> {noformat}
> This traces back to the commit for CASSANDRA-7066 today by [~Stefania] and [~benedict].  Stefania - care to take this ticket and also look further into whether or not we're going to have issues with 7066 on Windows? That error message certainly *sounds* like it's not a good thing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)