You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by Alan Conway <ac...@redhat.com> on 2013/01/29 21:01:26 UTC

Review Request: HA Fix race condition in rejecting connections.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/
-----------------------------------------------------------

Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.


Description
-------

Sporadic failure of test_failover_python was caused by a race in rejecting
connections. There was a very small window where work could be done by a
connection after it was rejected.


Diffs
-----

  /trunk/qpid/cpp/src/qpid/ha/BackupConnectionExcluder.h 1439431 
  /trunk/qpid/cpp/src/qpid/ha/ConnectionObserver.h 1439431 
  /trunk/qpid/cpp/src/qpid/ha/ConnectionObserver.cpp 1439431 
  /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 

Diff: https://reviews.apache.org/r/9137/diff/


Testing
-------


Thanks,

Alan Conway


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Alan Conway <ac...@redhat.com>.

> On Jan. 29, 2013, 9:59 p.m., Andrew Stitcher wrote:
> > /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp, line 99
> > <https://reviews.apache.org/r/9137/diff/2/?file=253082#file253082line99>
> >
> >     I don't think this should be protected by !readError as the read and write sides are independent.
> >     
> >     [Although it may well be impossible to get an abort if we've previously got a readError]

Done, now reads:

    if (!readError) {
        aio->requestCallback(boost::bind(&AsynchIOHandler::eof, this, _1));
    }
    aio->queueWriteClose();


- Alan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15809
-----------------------------------------------------------


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Andrew Stitcher <as...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15809
-----------------------------------------------------------



/trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp
<https://reviews.apache.org/r/9137/#comment34053>

    I don't think this should be protected by !readError as the read and write sides are independent.
    
    [Although it may well be impossible to get an abort if we've previously got a readError]


- Andrew Stitcher


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Alan Conway <ac...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/
-----------------------------------------------------------

(Updated Jan. 30, 2013, 6:40 p.m.)


Review request for qpid and Andrew Stitcher.


Description
-------

HA Fix race condition in rejecting connections.

Sporadic failure of test_failover_python was caused by a race in rejecting
connections. There was a very small window where work could be done by a
connection after it was rejected.


Diffs
-----

  /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 

Diff: https://reviews.apache.org/r/9137/diff/


Testing
-------

make check, manual heartbeat test


Thanks,

Alan Conway


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Alan Conway <ac...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/
-----------------------------------------------------------

(Updated Jan. 30, 2013, 6:40 p.m.)


Review request for qpid and Andrew Stitcher.


Changes
-------

Put the correct Andrew in the people list.


Description
-------

HA Fix race condition in rejecting connections.

Sporadic failure of test_failover_python was caused by a race in rejecting
connections. There was a very small window where work could be done by a
connection after it was rejected.


Diffs
-----

  /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 

Diff: https://reviews.apache.org/r/9137/diff/


Testing
-------

make check, manual heartbeat test


Thanks,

Alan Conway


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Alan Conway <ac...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/
-----------------------------------------------------------

(Updated Jan. 30, 2013, 6:39 p.m.)


Review request for qpid and Andrew Stitcher.


Description
-------

HA Fix race condition in rejecting connections.

Sporadic failure of test_failover_python was caused by a race in rejecting
connections. There was a very small window where work could be done by a
connection after it was rejected.


Diffs
-----

  /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 

Diff: https://reviews.apache.org/r/9137/diff/


Testing
-------

make check, manual heartbeat test


Thanks,

Alan Conway


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Andrew Stitcher <as...@apache.org>.

> On Jan. 30, 2013, 12:28 p.m., Gordon Sim wrote:
> > Is abort() the right call to make? What about close()? It looks to me like the 'purpose' of abort is to trigger a simulated eof call on the connection processing thread from e.g. the heartbeat timer thread. Since in this case we are already on the connection processing thread, why would close() not do the job (this is not the same as issuing a clean connection.close sequence I don't believe).
> > 
> > One thing to remember with any change to the IO code is that different 'transports' (ssl, rdma) and platforms (windows) may involve different codepaths.
> 
> Alan Conway wrote:
>     Clients do not fail-over if the connection is closed politely, which is the objective here.
> 
> Gordon Sim wrote:
>     Understood and to be clear I'm not suggesting what I would call a 'polite' close (i.e. connection-close; connection-close-ok handshake). However calling qpid::amqp_0_10::Connection::close() should result in the IO layer detecting that the upper layer wants to close and doing so (see qpid::sys::AsynchIOHandler::idle() for example).
>     
>     The behaviour of these calls is not well defined certainly. My only concern is we seem to be modifying what abort() is supposed to do and at least from a skim of the code it seems close() might do what is required (certainly it should result in aio->queueWriteClose() being called in AsynchIOHandler).
> 
> Andrew Stitcher wrote:
>     It is currently true on trunk that that ssl and tcp share the code paths at this level but rdma does not have abort() implemented at all.
>     
>     I tend to agree that overloading abort() is not entirely the correct semantic although does seem pretty close - abort() says "abort this connection; I think it has failed but is still connected for some reason". The correct semantic here could be "close this connection with error".
>     
>     I wonder if it is possible to hook into the authentication in a similar way to ACL to reject the connection there with a failure code.
> 
> Alan Conway wrote:
>     I'd argue that this fixes a bug in abort() - you shouldn't be able to write to an aborted connection. It all works nicely now, unless there's a compelling reason to take a different approach I'd like to leave it as-is. Semantically it seems fine, abort() means terminate the connection abruptly without the close handshake, which is exactly what the HA code wants to do.
> 
> Gordon Sim wrote:
>     I would argue that at present a key part of the role of abort() in practice is to be called on another thread, hence the requesting of a callback to eof on the IO processing thread which is likely the source of the original problem (i.e. abort returning doesn't mean the eof has actually been handled). Is this change safe if called from a thread other than the connections IO thread? It seems to me that the right fix is to provide an 'inline' way of doing the same thing as abort(). (I'm suggesting calling qpid::amqp_0_10::Connection::close() _may_ be that).

It is safe to call queueWriteClosed() in any thread. So I think this change is:

1. Safe.
2. Correct, in that a more robust semantic for abort() is indeed to not transmit anything further on this connection (even if the original case for abort() did not require this)

I'm saying this from the POV of the IO code though, not necessarily the application logic.


- Andrew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15837
-----------------------------------------------------------


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Gordon Sim <gs...@redhat.com>.

> On Jan. 30, 2013, 12:28 p.m., Gordon Sim wrote:
> > Is abort() the right call to make? What about close()? It looks to me like the 'purpose' of abort is to trigger a simulated eof call on the connection processing thread from e.g. the heartbeat timer thread. Since in this case we are already on the connection processing thread, why would close() not do the job (this is not the same as issuing a clean connection.close sequence I don't believe).
> > 
> > One thing to remember with any change to the IO code is that different 'transports' (ssl, rdma) and platforms (windows) may involve different codepaths.
> 
> Alan Conway wrote:
>     Clients do not fail-over if the connection is closed politely, which is the objective here.
> 
> Gordon Sim wrote:
>     Understood and to be clear I'm not suggesting what I would call a 'polite' close (i.e. connection-close; connection-close-ok handshake). However calling qpid::amqp_0_10::Connection::close() should result in the IO layer detecting that the upper layer wants to close and doing so (see qpid::sys::AsynchIOHandler::idle() for example).
>     
>     The behaviour of these calls is not well defined certainly. My only concern is we seem to be modifying what abort() is supposed to do and at least from a skim of the code it seems close() might do what is required (certainly it should result in aio->queueWriteClose() being called in AsynchIOHandler).
> 
> Andrew Stitcher wrote:
>     It is currently true on trunk that that ssl and tcp share the code paths at this level but rdma does not have abort() implemented at all.
>     
>     I tend to agree that overloading abort() is not entirely the correct semantic although does seem pretty close - abort() says "abort this connection; I think it has failed but is still connected for some reason". The correct semantic here could be "close this connection with error".
>     
>     I wonder if it is possible to hook into the authentication in a similar way to ACL to reject the connection there with a failure code.
> 
> Alan Conway wrote:
>     I'd argue that this fixes a bug in abort() - you shouldn't be able to write to an aborted connection. It all works nicely now, unless there's a compelling reason to take a different approach I'd like to leave it as-is. Semantically it seems fine, abort() means terminate the connection abruptly without the close handshake, which is exactly what the HA code wants to do.
> 
> Gordon Sim wrote:
>     I would argue that at present a key part of the role of abort() in practice is to be called on another thread, hence the requesting of a callback to eof on the IO processing thread which is likely the source of the original problem (i.e. abort returning doesn't mean the eof has actually been handled). Is this change safe if called from a thread other than the connections IO thread? It seems to me that the right fix is to provide an 'inline' way of doing the same thing as abort(). (I'm suggesting calling qpid::amqp_0_10::Connection::close() _may_ be that).
> 
> Andrew Stitcher wrote:
>     It is safe to call queueWriteClosed() in any thread. So I think this change is:
>     
>     1. Safe.
>     2. Correct, in that a more robust semantic for abort() is indeed to not transmit anything further on this connection (even if the original case for abort() did not require this)
>     
>     I'm saying this from the POV of the IO code though, not necessarily the application logic.

I think the change is odd. Requesting a callback to eof if we are already on the IO thread seems an odd thing to do. I think it would be clearer to distinguish between closing a connection abruptly from the IO thread (the new use case), and requesting an eof to be simulated on the IO thread from some other thread (the original use case). Partly this is down to abort being a bad name for the original use case in my view.

However, if you are sure its safe (baring in mind there are no tests for heartbeats, concurrent shutdown by both ends etc), then I don't object to it being committed.


- Gordon


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15837
-----------------------------------------------------------


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Gordon Sim <gs...@redhat.com>.

> On Jan. 30, 2013, 12:28 p.m., Gordon Sim wrote:
> > Is abort() the right call to make? What about close()? It looks to me like the 'purpose' of abort is to trigger a simulated eof call on the connection processing thread from e.g. the heartbeat timer thread. Since in this case we are already on the connection processing thread, why would close() not do the job (this is not the same as issuing a clean connection.close sequence I don't believe).
> > 
> > One thing to remember with any change to the IO code is that different 'transports' (ssl, rdma) and platforms (windows) may involve different codepaths.
> 
> Alan Conway wrote:
>     Clients do not fail-over if the connection is closed politely, which is the objective here.
> 
> Gordon Sim wrote:
>     Understood and to be clear I'm not suggesting what I would call a 'polite' close (i.e. connection-close; connection-close-ok handshake). However calling qpid::amqp_0_10::Connection::close() should result in the IO layer detecting that the upper layer wants to close and doing so (see qpid::sys::AsynchIOHandler::idle() for example).
>     
>     The behaviour of these calls is not well defined certainly. My only concern is we seem to be modifying what abort() is supposed to do and at least from a skim of the code it seems close() might do what is required (certainly it should result in aio->queueWriteClose() being called in AsynchIOHandler).
> 
> Andrew Stitcher wrote:
>     It is currently true on trunk that that ssl and tcp share the code paths at this level but rdma does not have abort() implemented at all.
>     
>     I tend to agree that overloading abort() is not entirely the correct semantic although does seem pretty close - abort() says "abort this connection; I think it has failed but is still connected for some reason". The correct semantic here could be "close this connection with error".
>     
>     I wonder if it is possible to hook into the authentication in a similar way to ACL to reject the connection there with a failure code.
> 
> Alan Conway wrote:
>     I'd argue that this fixes a bug in abort() - you shouldn't be able to write to an aborted connection. It all works nicely now, unless there's a compelling reason to take a different approach I'd like to leave it as-is. Semantically it seems fine, abort() means terminate the connection abruptly without the close handshake, which is exactly what the HA code wants to do.

I would argue that at present a key part of the role of abort() in practice is to be called on another thread, hence the requesting of a callback to eof on the IO processing thread which is likely the source of the original problem (i.e. abort returning doesn't mean the eof has actually been handled). Is this change safe if called from a thread other than the connections IO thread? It seems to me that the right fix is to provide an 'inline' way of doing the same thing as abort(). (I'm suggesting calling qpid::amqp_0_10::Connection::close() _may_ be that).


- Gordon


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15837
-----------------------------------------------------------


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Gordon Sim <gs...@redhat.com>.

> On Jan. 30, 2013, 12:28 p.m., Gordon Sim wrote:
> > Is abort() the right call to make? What about close()? It looks to me like the 'purpose' of abort is to trigger a simulated eof call on the connection processing thread from e.g. the heartbeat timer thread. Since in this case we are already on the connection processing thread, why would close() not do the job (this is not the same as issuing a clean connection.close sequence I don't believe).
> > 
> > One thing to remember with any change to the IO code is that different 'transports' (ssl, rdma) and platforms (windows) may involve different codepaths.
> 
> Alan Conway wrote:
>     Clients do not fail-over if the connection is closed politely, which is the objective here.

Understood and to be clear I'm not suggesting what I would call a 'polite' close (i.e. connection-close; connection-close-ok handshake). However calling qpid::amqp_0_10::Connection::close() should result in the IO layer detecting that the upper layer wants to close and doing so (see qpid::sys::AsynchIOHandler::idle() for example).

The behaviour of these calls is not well defined certainly. My only concern is we seem to be modifying what abort() is supposed to do and at least from a skim of the code it seems close() might do what is required (certainly it should result in aio->queueWriteClose() being called in AsynchIOHandler).


- Gordon


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15837
-----------------------------------------------------------


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Andrew Stitcher <as...@apache.org>.

> On Jan. 30, 2013, 12:28 p.m., Gordon Sim wrote:
> > Is abort() the right call to make? What about close()? It looks to me like the 'purpose' of abort is to trigger a simulated eof call on the connection processing thread from e.g. the heartbeat timer thread. Since in this case we are already on the connection processing thread, why would close() not do the job (this is not the same as issuing a clean connection.close sequence I don't believe).
> > 
> > One thing to remember with any change to the IO code is that different 'transports' (ssl, rdma) and platforms (windows) may involve different codepaths.
> 
> Alan Conway wrote:
>     Clients do not fail-over if the connection is closed politely, which is the objective here.
> 
> Gordon Sim wrote:
>     Understood and to be clear I'm not suggesting what I would call a 'polite' close (i.e. connection-close; connection-close-ok handshake). However calling qpid::amqp_0_10::Connection::close() should result in the IO layer detecting that the upper layer wants to close and doing so (see qpid::sys::AsynchIOHandler::idle() for example).
>     
>     The behaviour of these calls is not well defined certainly. My only concern is we seem to be modifying what abort() is supposed to do and at least from a skim of the code it seems close() might do what is required (certainly it should result in aio->queueWriteClose() being called in AsynchIOHandler).
> 
> Andrew Stitcher wrote:
>     It is currently true on trunk that that ssl and tcp share the code paths at this level but rdma does not have abort() implemented at all.
>     
>     I tend to agree that overloading abort() is not entirely the correct semantic although does seem pretty close - abort() says "abort this connection; I think it has failed but is still connected for some reason". The correct semantic here could be "close this connection with error".
>     
>     I wonder if it is possible to hook into the authentication in a similar way to ACL to reject the connection there with a failure code.
> 
> Alan Conway wrote:
>     I'd argue that this fixes a bug in abort() - you shouldn't be able to write to an aborted connection. It all works nicely now, unless there's a compelling reason to take a different approach I'd like to leave it as-is. Semantically it seems fine, abort() means terminate the connection abruptly without the close handshake, which is exactly what the HA code wants to do.
> 
> Gordon Sim wrote:
>     I would argue that at present a key part of the role of abort() in practice is to be called on another thread, hence the requesting of a callback to eof on the IO processing thread which is likely the source of the original problem (i.e. abort returning doesn't mean the eof has actually been handled). Is this change safe if called from a thread other than the connections IO thread? It seems to me that the right fix is to provide an 'inline' way of doing the same thing as abort(). (I'm suggesting calling qpid::amqp_0_10::Connection::close() _may_ be that).
> 
> Andrew Stitcher wrote:
>     It is safe to call queueWriteClosed() in any thread. So I think this change is:
>     
>     1. Safe.
>     2. Correct, in that a more robust semantic for abort() is indeed to not transmit anything further on this connection (even if the original case for abort() did not require this)
>     
>     I'm saying this from the POV of the IO code though, not necessarily the application logic.
> 
> Gordon Sim wrote:
>     I think the change is odd. Requesting a callback to eof if we are already on the IO thread seems an odd thing to do. I think it would be clearer to distinguish between closing a connection abruptly from the IO thread (the new use case), and requesting an eof to be simulated on the IO thread from some other thread (the original use case). Partly this is down to abort being a bad name for the original use case in my view.
>     
>     However, if you are sure its safe (baring in mind there are no tests for heartbeats, concurrent shutdown by both ends etc), then I don't object to it being committed.

We could add a new abortOnIOThread() call and rename the existing to abortArbitraryThread() (or some such names) but the latter subsumes the former after this change so why would we want to do that?

To me closing the connection abruptly has the same meaning irrespective of the thread it comes from. The fact that the eof machinery is reused "under the hood" is irrelevant to my mind the call is meant to abort the connection.


- Andrew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15837
-----------------------------------------------------------


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Alan Conway <ac...@redhat.com>.

> On Jan. 30, 2013, 12:28 p.m., Gordon Sim wrote:
> > Is abort() the right call to make? What about close()? It looks to me like the 'purpose' of abort is to trigger a simulated eof call on the connection processing thread from e.g. the heartbeat timer thread. Since in this case we are already on the connection processing thread, why would close() not do the job (this is not the same as issuing a clean connection.close sequence I don't believe).
> > 
> > One thing to remember with any change to the IO code is that different 'transports' (ssl, rdma) and platforms (windows) may involve different codepaths.

Clients do not fail-over if the connection is closed politely, which is the objective here.


- Alan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15837
-----------------------------------------------------------


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Andrew Stitcher <as...@apache.org>.

> On Jan. 30, 2013, 12:28 p.m., Gordon Sim wrote:
> > Is abort() the right call to make? What about close()? It looks to me like the 'purpose' of abort is to trigger a simulated eof call on the connection processing thread from e.g. the heartbeat timer thread. Since in this case we are already on the connection processing thread, why would close() not do the job (this is not the same as issuing a clean connection.close sequence I don't believe).
> > 
> > One thing to remember with any change to the IO code is that different 'transports' (ssl, rdma) and platforms (windows) may involve different codepaths.
> 
> Alan Conway wrote:
>     Clients do not fail-over if the connection is closed politely, which is the objective here.
> 
> Gordon Sim wrote:
>     Understood and to be clear I'm not suggesting what I would call a 'polite' close (i.e. connection-close; connection-close-ok handshake). However calling qpid::amqp_0_10::Connection::close() should result in the IO layer detecting that the upper layer wants to close and doing so (see qpid::sys::AsynchIOHandler::idle() for example).
>     
>     The behaviour of these calls is not well defined certainly. My only concern is we seem to be modifying what abort() is supposed to do and at least from a skim of the code it seems close() might do what is required (certainly it should result in aio->queueWriteClose() being called in AsynchIOHandler).

It is currently true on trunk that that ssl and tcp share the code paths at this level but rdma does not have abort() implemented at all.

I tend to agree that overloading abort() is not entirely the correct semantic although does seem pretty close - abort() says "abort this connection; I think it has failed but is still connected for some reason". The correct semantic here could be "close this connection with error".

I wonder if it is possible to hook into the authentication in a similar way to ACL to reject the connection there with a failure code.


- Andrew


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15837
-----------------------------------------------------------


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Gordon Sim <gs...@redhat.com>.

> On Jan. 30, 2013, 12:28 p.m., Gordon Sim wrote:
> > Is abort() the right call to make? What about close()? It looks to me like the 'purpose' of abort is to trigger a simulated eof call on the connection processing thread from e.g. the heartbeat timer thread. Since in this case we are already on the connection processing thread, why would close() not do the job (this is not the same as issuing a clean connection.close sequence I don't believe).
> > 
> > One thing to remember with any change to the IO code is that different 'transports' (ssl, rdma) and platforms (windows) may involve different codepaths.
> 
> Alan Conway wrote:
>     Clients do not fail-over if the connection is closed politely, which is the objective here.
> 
> Gordon Sim wrote:
>     Understood and to be clear I'm not suggesting what I would call a 'polite' close (i.e. connection-close; connection-close-ok handshake). However calling qpid::amqp_0_10::Connection::close() should result in the IO layer detecting that the upper layer wants to close and doing so (see qpid::sys::AsynchIOHandler::idle() for example).
>     
>     The behaviour of these calls is not well defined certainly. My only concern is we seem to be modifying what abort() is supposed to do and at least from a skim of the code it seems close() might do what is required (certainly it should result in aio->queueWriteClose() being called in AsynchIOHandler).
> 
> Andrew Stitcher wrote:
>     It is currently true on trunk that that ssl and tcp share the code paths at this level but rdma does not have abort() implemented at all.
>     
>     I tend to agree that overloading abort() is not entirely the correct semantic although does seem pretty close - abort() says "abort this connection; I think it has failed but is still connected for some reason". The correct semantic here could be "close this connection with error".
>     
>     I wonder if it is possible to hook into the authentication in a similar way to ACL to reject the connection there with a failure code.
> 
> Alan Conway wrote:
>     I'd argue that this fixes a bug in abort() - you shouldn't be able to write to an aborted connection. It all works nicely now, unless there's a compelling reason to take a different approach I'd like to leave it as-is. Semantically it seems fine, abort() means terminate the connection abruptly without the close handshake, which is exactly what the HA code wants to do.
> 
> Gordon Sim wrote:
>     I would argue that at present a key part of the role of abort() in practice is to be called on another thread, hence the requesting of a callback to eof on the IO processing thread which is likely the source of the original problem (i.e. abort returning doesn't mean the eof has actually been handled). Is this change safe if called from a thread other than the connections IO thread? It seems to me that the right fix is to provide an 'inline' way of doing the same thing as abort(). (I'm suggesting calling qpid::amqp_0_10::Connection::close() _may_ be that).
> 
> Andrew Stitcher wrote:
>     It is safe to call queueWriteClosed() in any thread. So I think this change is:
>     
>     1. Safe.
>     2. Correct, in that a more robust semantic for abort() is indeed to not transmit anything further on this connection (even if the original case for abort() did not require this)
>     
>     I'm saying this from the POV of the IO code though, not necessarily the application logic.
> 
> Gordon Sim wrote:
>     I think the change is odd. Requesting a callback to eof if we are already on the IO thread seems an odd thing to do. I think it would be clearer to distinguish between closing a connection abruptly from the IO thread (the new use case), and requesting an eof to be simulated on the IO thread from some other thread (the original use case). Partly this is down to abort being a bad name for the original use case in my view.
>     
>     However, if you are sure its safe (baring in mind there are no tests for heartbeats, concurrent shutdown by both ends etc), then I don't object to it being committed.
> 
> Andrew Stitcher wrote:
>     We could add a new abortOnIOThread() call and rename the existing to abortArbitraryThread() (or some such names) but the latter subsumes the former after this change so why would we want to do that?
>     
>     To me closing the connection abruptly has the same meaning irrespective of the thread it comes from. The fact that the eof machinery is reused "under the hood" is irrelevant to my mind the call is meant to abort the connection.
>

I think closing abruptly while processing IO is a different thing from requesting simulation of an eof by an arbitrary thread. The expectations are different in each case (as exemplified buy this new use case). If you were implementing the former you would not do it by requesting a callback. So, while this may work and be safe, it is not in my personal opinion particularly clear code. Then again this interface between the broker and the IO code is rather a mess in general, so if you are sure its safe, as I say I have no objection. (i.e. feel free to ship it).


- Gordon


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15837
-----------------------------------------------------------


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Alan Conway <ac...@redhat.com>.

> On Jan. 30, 2013, 12:28 p.m., Gordon Sim wrote:
> > Is abort() the right call to make? What about close()? It looks to me like the 'purpose' of abort is to trigger a simulated eof call on the connection processing thread from e.g. the heartbeat timer thread. Since in this case we are already on the connection processing thread, why would close() not do the job (this is not the same as issuing a clean connection.close sequence I don't believe).
> > 
> > One thing to remember with any change to the IO code is that different 'transports' (ssl, rdma) and platforms (windows) may involve different codepaths.
> 
> Alan Conway wrote:
>     Clients do not fail-over if the connection is closed politely, which is the objective here.
> 
> Gordon Sim wrote:
>     Understood and to be clear I'm not suggesting what I would call a 'polite' close (i.e. connection-close; connection-close-ok handshake). However calling qpid::amqp_0_10::Connection::close() should result in the IO layer detecting that the upper layer wants to close and doing so (see qpid::sys::AsynchIOHandler::idle() for example).
>     
>     The behaviour of these calls is not well defined certainly. My only concern is we seem to be modifying what abort() is supposed to do and at least from a skim of the code it seems close() might do what is required (certainly it should result in aio->queueWriteClose() being called in AsynchIOHandler).
> 
> Andrew Stitcher wrote:
>     It is currently true on trunk that that ssl and tcp share the code paths at this level but rdma does not have abort() implemented at all.
>     
>     I tend to agree that overloading abort() is not entirely the correct semantic although does seem pretty close - abort() says "abort this connection; I think it has failed but is still connected for some reason". The correct semantic here could be "close this connection with error".
>     
>     I wonder if it is possible to hook into the authentication in a similar way to ACL to reject the connection there with a failure code.

I'd argue that this fixes a bug in abort() - you shouldn't be able to write to an aborted connection. It all works nicely now, unless there's a compelling reason to take a different approach I'd like to leave it as-is. Semantically it seems fine, abort() means terminate the connection abruptly without the close handshake, which is exactly what the HA code wants to do.


- Alan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15837
-----------------------------------------------------------


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Gordon Sim <gs...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15837
-----------------------------------------------------------


Is abort() the right call to make? What about close()? It looks to me like the 'purpose' of abort is to trigger a simulated eof call on the connection processing thread from e.g. the heartbeat timer thread. Since in this case we are already on the connection processing thread, why would close() not do the job (this is not the same as issuing a clean connection.close sequence I don't believe).

One thing to remember with any change to the IO code is that different 'transports' (ssl, rdma) and platforms (windows) may involve different codepaths.

- Gordon Sim


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Andrew Stitcher <as...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15855
-----------------------------------------------------------

Ship it!


I'm happy to have this code change in trunk, even if abort() is not the best way to solve the underlying problem (not saying it isn't).

- Andrew Stitcher


On Jan. 29, 2013, 9:49 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 9:49 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> HA Fix race condition in rejecting connections.
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> make check, manual heartbeat test
> 
> 
> Thanks,
> 
> Alan Conway
> 
>


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Alan Conway <ac...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/
-----------------------------------------------------------

(Updated Jan. 29, 2013, 9:49 p.m.)


Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.


Changes
-------

Dropped exception approach, fixed abort() instead.


Description (updated)
-------

HA Fix race condition in rejecting connections.

Sporadic failure of test_failover_python was caused by a race in rejecting
connections. There was a very small window where work could be done by a
connection after it was rejected.


Diffs (updated)
-----

  /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 

Diff: https://reviews.apache.org/r/9137/diff/


Testing (updated)
-------

make check, manual heartbeat test


Thanks,

Alan Conway


Re: Review Request: HA Fix race condition in rejecting connections.

Posted by Andrew Stitcher <as...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9137/#review15803
-----------------------------------------------------------



/trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp
<https://reviews.apache.org/r/9137/#comment34013>

    Don't do this!
    
    If there's not an error don't allow the exception to propagate here in the first place.
    
    I don't want exceptions that aren't really exceptional ending up here, which is the import of this change.


- Andrew Stitcher


On Jan. 29, 2013, 8:01 p.m., Alan Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9137/
> -----------------------------------------------------------
> 
> (Updated Jan. 29, 2013, 8:01 p.m.)
> 
> 
> Review request for qpid and Andrew-Duplicate-Accct-Inactiv Konwinski.
> 
> 
> Description
> -------
> 
> Sporadic failure of test_failover_python was caused by a race in rejecting
> connections. There was a very small window where work could be done by a
> connection after it was rejected.
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/ha/BackupConnectionExcluder.h 1439431 
>   /trunk/qpid/cpp/src/qpid/ha/ConnectionObserver.h 1439431 
>   /trunk/qpid/cpp/src/qpid/ha/ConnectionObserver.cpp 1439431 
>   /trunk/qpid/cpp/src/qpid/sys/AsynchIOHandler.cpp 1439431 
> 
> Diff: https://reviews.apache.org/r/9137/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Alan Conway
> 
>