You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Stu Hood (JIRA)" <ji...@apache.org> on 2011/04/25 03:22:05 UTC

[jira] [Created] (CASSANDRA-2552) ReadResponseResolver Race

ReadResponseResolver Race
-------------------------

                 Key: CASSANDRA-2552
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
             Project: Cassandra
          Issue Type: Bug
          Components: Core
            Reporter: Stu Hood
            Assignee: Stu Hood
             Fix For: 0.8.0


When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
# Add new response
# Check response set size
# Check that data is present

I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case where the condition is not properly triggered:
{noformat}
INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
{noformat}
The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026210#comment-13026210 ] 

Stu Hood commented on CASSANDRA-2552:
-------------------------------------

+1

I still think a Map is overkill here, but I can't reproduce a race with the v2 algorithm.

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.1
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7.6
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, 2552-v2-07.txt, 2552-v2.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2552:
--------------------------------------

    Affects Version/s: 0.7.1
        Fix Version/s:     (was: 0.8.0)
                       0.7.6

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.1
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7.6
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, 2552-v2-07.txt, 2552-v2.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025878#comment-13025878 ] 

Jonathan Ellis commented on CASSANDRA-2552:
-------------------------------------------

I'm not sure I'm a fan of the ARR solution.  Wouldn't it be similar complexity (one O(N) operation per message received) to keep NBHM and implement getMessageCount as an iterate-entries operation?  (the O(N) op in ARR is of course the search-for-free-slot in append.)

I'm -0 on changing RSRR away from LBQ when LBQ is known to work fine in practice.

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025102#comment-13025102 ] 

Stu Hood commented on CASSANDRA-2552:
-------------------------------------

Chris pointed out that the example passes if you replace NBHM with CHM, but I don't think NBHM is necessarily to blame here: each thread views a locally consistent copy, likely due to Cliff's use of sun.misc.Unsafe references.

It's possible that a similar race applies to RangeSliceResponseResolver, but I think changes to LBQ (like CHM) will be broadcast to all threads.

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2552:
--------------------------------------

    Attachment:     (was: 2552-v2.txt)

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025518#comment-13025518 ] 

Jonathan Ellis commented on CASSANDRA-2552:
-------------------------------------------

You're right: size() is implemented as a org.cliffc.high_scale_lib.Counter object, which says

{code}
  // Add the given value to current counter value.  Concurrent updates will
  // not be lost, but addAndGet or getAndAdd are not implemented because but
  // the total counter value is not atomically updated.
  //public void add( long x );

...

  // Current value of the counter.  Since other threads are updating furiously
  // the value is only approximate, but it includes all counts made by the
  // current thread.  Requires a pass over all the striped counters.
  //public long get();
{code}

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-2552:
--------------------------------

    Description: 
When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
# Add new response
# Check response set size
# Check that data is present

I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
{noformat}
INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
{noformat}
The last line shows that both results were present, and that one of them was holding data.

  was:
When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
# Add new response
# Check response set size
# Check that data is present

I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case where the condition is not properly triggered:
{noformat}
INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
{noformat}
The last line shows that both results were present, and that one of them was holding data.


> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025490#comment-13025490 ] 

Stu Hood commented on CASSANDRA-2552:
-------------------------------------

> I.e., for at least one thread, both update operations will have completed when the iterator is created
Note that the race I observed via the debug output for that test was actually on the size() operation, which doesn't put any such guarantees in its javadocs.

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026000#comment-13026000 ] 

Jonathan Ellis commented on CASSANDRA-2552:
-------------------------------------------

the correctness criterion is that once the messages are received, at least one thread running response will see that both blockfor and data are satisfied; this meets that need.

note that received(-messages-that-count-towards-blockfor) is NOT the same as size (see: DRC) so you need a separate variable anyway even with ARR.

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.1
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7.6
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, 2552-v2-07.txt, 2552-v2.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025867#comment-13025867 ] 

Jonathan Ellis commented on CASSANDRA-2552:
-------------------------------------------

bq. is there a reason RSRR can't inherit ARR or does it just predate that refactoring?

To answer my own ARR assumes we're returning Rows, which would be easy to fix, and that Messages turn into ReadResponse objects, which would be harder since we'd need to have a <T extends ISerializable> interface where ISerializeable gave us a Serializer class declaring "void serialize(T, outputstream) and T deserialize(inputstream)", i.e., we start to get into fixing ICompactSerializer and all the mess that would be.

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024787#comment-13024787 ] 

Jonathan Ellis commented on CASSANDRA-2552:
-------------------------------------------

Hard to tell exactly what's going on here w/o knowing where your logging was added.

In particular it's important to note that we don't prevent responses from being processed after we've already given up and decided to call a timeout (but before we've torn down the request callback).

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025882#comment-13025882 ] 

Jonathan Ellis commented on CASSANDRA-2552:
-------------------------------------------

bq. Wouldn't it be similar complexity (one O(N) operation per message received) to keep NBHM and implement getMessageCount as an iterate-entries operation?

We can actually do size-by-iteration for basically free (with a little refactoring), since we're already iterating for isDataPresent. We can just push the iteration into the callers who care about size-and-data-present and do it with one loop.

But if I am honest that is premature optimization.  We are already using the AtomicInteger approach in DatacenterReadCallback.  I'll submit a patch to standardize on that.


> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025847#comment-13025847 ] 

Jonathan Ellis commented on CASSANDRA-2552:
-------------------------------------------

is there a reason RSRR can't inherit ARR or does it just predate that refactoring?

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-2552:
--------------------------------

    Attachment: 0001-Move-Resolvers-to-atomic-append-count.txt

Attaching a patch that replaces NBHM with an AtomicReferenceArray that is appended to and counted atomically. This patch eliminated the timeouts we were seeing.

CHM may also be a legitimate solution, but it feels a bit like an abuse of a map.

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025337#comment-13025337 ] 

Jonathan Ellis commented on CASSANDRA-2552:
-------------------------------------------

That sure sounds like a NBHM bug to me. The javadoc says,

bq. Retrievals reflect the results of the most recently completed update operations holding upon their onset... Similarly, Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration.


> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025692#comment-13025692 ] 

Sylvain Lebresne commented on CASSANDRA-2552:
---------------------------------------------

I am no expert of the Java Memory Model, but I can't find anything that preclude this behavior in the CHM docs either (there really is not much on the size function). So I would have liked the CHM solution if we could be sure it always fix that problem (I would have liked it because it was a one line change and I think maps are here to be "abused"), but as far as I can tell, it may well only make the bug much less frequent or fix it only on some architecture (the code of CHM seems to indicate it is safe but it's complicated enough that I wouldn't bet my life on it).

Note that if that's true, LBQ too could well allow for a race here without breaking it's specification (it seems to use a AtomicInteger for the size internally so it is trivially ok, but if the spec doesn't force anything, I suppose that could change).

So I suppose if we want to do right by the spec, we should probably update both AbstractRowResolver and RangeSliceResponseResolver (note that using an AtomicInteger to count the number of responses could be slightly simpler, but I'm fine with an AtomicReferenceArray). 

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025086#comment-13025086 ] 

Stu Hood edited comment on CASSANDRA-2552 at 4/26/11 6:19 AM:
--------------------------------------------------------------

Here is a cut down testcase that reproduces the race: it looks like two threads can race on step 2 such that neither accounts for the item added by the other, and both think the set of responses is too small.

I have a patch that makes append + size an atomic operation: I'll post it as soon as I clean it up a bit.

      was (Author: stuhood):
    Here is a cut down testcase that reproduces the race: it looks like two threads can race on step 2 such that they neither accounts for the item added by the other, and both think the set of responses is too small.

I have a patch that makes append + size an atomic operation: I'll post it as soon as I clean it up a bit.
  
> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-2552:
--------------------------------

    Attachment: ResolveRaceTest.java

Here is a cut down testcase that reproduces the race: it looks like two threads can race on step 2 such that they neither accounts for the item added by the other, and both think the set of responses is too small.

I have a patch that makes append + size an atomic operation: I'll post it as soon as I clean it up a bit.

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stu Hood updated CASSANDRA-2552:
--------------------------------

    Reviewer: stuhood
    Assignee: Jonathan Ellis  (was: Stu Hood)

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.1
>            Reporter: Stu Hood
>            Assignee: Jonathan Ellis
>             Fix For: 0.7.6
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, 2552-v2-07.txt, 2552-v2.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025337#comment-13025337 ] 

Jonathan Ellis edited comment on CASSANDRA-2552 at 4/26/11 5:22 PM:
--------------------------------------------------------------------

That sure sounds like a NBHM bug to me. The javadoc says,

bq. Retrievals reflect the results of the most recently completed update operations holding upon their onset... Similarly, Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration.

I.e., for at least one thread, *both* update operations will have completed when the iterator is created, so it should see all the entries.

(Will review the actual patch shortly, I'm just saying I think we should report a bug too.)

      was (Author: jbellis):
    That sure sounds like a NBHM bug to me. The javadoc says,

bq. Retrievals reflect the results of the most recently completed update operations holding upon their onset... Similarly, Iterators and Enumerations return elements reflecting the state of the hash table at some point at or since the creation of the iterator/enumeration.

  
> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2552:
--------------------------------------

    Attachment: 2552-v2.txt

v2 w/ AtomicInteger approach

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, 2552-v2.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026306#comment-13026306 ] 

Hudson commented on CASSANDRA-2552:
-----------------------------------

Integrated in Cassandra-0.7 #459 (See [https://builds.apache.org/hudson/job/Cassandra-0.7/459/])
    fix incorrect use ofNBHM.size in ReadCallback
patch by jbellis; reviewed by stuhood for CASSANDRA-2552


> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.1
>            Reporter: Stu Hood
>            Assignee: Jonathan Ellis
>             Fix For: 0.7.6, 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, 2552-v2-07.txt, 2552-v2.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2552:
--------------------------------------

    Attachment: 2552-v2-07.txt

and a v2 for 0.7

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, 2552-v2-07.txt, 2552-v2.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025995#comment-13025995 ] 

Stu Hood commented on CASSANDRA-2552:
-------------------------------------

Although I didn't reproduce a race between size() and isDataPresent(), isn't that one still possible? IMO, two operations that are atomic independently shouldn't be trusted to compose. The point of the ARR wasn't to improve runtime, it was simply to make all three steps atomic.

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.7.1
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.7.6
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, 2552-v2-07.txt, 2552-v2.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Stu Hood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024705#comment-13024705 ] 

Stu Hood commented on CASSANDRA-2552:
-------------------------------------

I have a patch ready that I believe fixes this: testing it out before posting.

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2552) ReadResponseResolver Race

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2552:
--------------------------------------

    Attachment: 2552-v2.txt

updated v2 fixes AsyncRepairCallback and RepairCallback as well

> ReadResponseResolver Race
> -------------------------
>
>                 Key: CASSANDRA-2552
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2552
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Stu Hood
>            Assignee: Stu Hood
>             Fix For: 0.8.0
>
>         Attachments: 0001-Move-Resolvers-to-atomic-append-count.txt, 2552-v2.txt, ResolveRaceTest.java
>
>
> When receiving a response, ReadResponseResolver uses a 3 step process to decide whether to trigger the condition that enough responses have arrived:
> # Add new response
> # Check response set size
> # Check that data is present
> I think that these steps must have been reordered by the compiler in some cases, because I was able to reproduce a case for a QUORUM read where the condition is not properly triggered:
> {noformat}
> INFO [RequestResponseStage:15] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=false in 2 messages
> INFO [RequestResponseStage:8] 2011-04-25 00:26:53,514 ReadResponseResolver.java (line 87) post append for 1087367065: hasData=true in 1 messages
> INFO [pool-1-thread-54] 2011-04-25 00:27:03,516 StorageProxy.java (line 623) Read timeout: java.util.concurrent.TimeoutException: ReadResponseResolver@1087367065(/10.34.131.109=false,/10.34.132.122=true,)
> {noformat}
> The last line shows that both results were present, and that one of them was holding data.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira