You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by Chug Rolke <cr...@redhat.com> on 2013/11/19 20:53:49 UTC

Review Request 15677: Windows client error closing AMQP 1.0 connection

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15677/
-----------------------------------------------------------

Review request for qpid, Andrew Stitcher and Steve Huston.


Bugs: QPID-5363
    https://issues.apache.org/jira/browse/QPID-5363


Repository: qpid


Description
-------

Windows fails with an access violation about 10% of the time while closing AMQP 1.0 connections:

STACK
=====
>        qpidmessagingd.dll!qpid::messaging::amqp::TcpTransport::close()  Line 123 + 0x10 bytes        C++
         qpidmessagingd.dll!qpid::messaging::amqp::ConnectionContext::close()  Line 134 + 0x29 bytes        C++
         qpidmessagingd.dll!qpid::messaging::amqp::ConnectionHandle::close()  Line 62        C++
         qpidmessagingd.dll!qpid::messaging::Connection::close()  Line 78 + 0x24 bytes        C++
         hello_world.exe!main(int argc=4, char * * argv=0x0053a688)  Line 51 + 0xe bytes        C++
         hello_world.exe!__tmainCRTStartup()  Line 586 + 0x19 bytes        C
         hello_world.exe!mainCRTStartup()  Line 403        C

CODE
====
void TcpTransport::close()
{
    QPID_LOG(debug, id << " TcpTransport closing...");
    if (aio)
        aio->queueWriteClose();    <=======
}

FAILURE
=======
aio's vftable is all 0xdddddddd, indicating that it has been deleted.

THE FIX
=======
Use some locks to protect closing the aio object.

DISCUSSION
==========
The locks in the diff appear to work ok in that the code passes thousands of runs. Are they OK theoretically?

Include in 0.26 release
-----------------------
Please indicate your approval or not in QPID-5363.


Diffs
-----

  trunk/qpid/cpp/src/qpid/messaging/amqp/TcpTransport.h 1543532 
  trunk/qpid/cpp/src/qpid/messaging/amqp/TcpTransport.cpp 1543532 

Diff: https://reviews.apache.org/r/15677/diff/


Testing
-------

Running HelloWorld in a loop in multiple windows works ok.

Passes unit tests


Thanks,

Chug Rolke


Re: Review Request 15677: Windows client error closing AMQP 1.0 connection

Posted by Gordon Sim <gs...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15677/#review29162
-----------------------------------------------------------

Ship it!


The same fix is needed in SslTransport I believe. Though that may not be used on windows and may not actually show up he same crash, the code as is does have a race condition so even for Linux it is a bug in my view.

- Gordon Sim


On Nov. 19, 2013, 7:53 p.m., Chug Rolke wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15677/
> -----------------------------------------------------------
> 
> (Updated Nov. 19, 2013, 7:53 p.m.)
> 
> 
> Review request for qpid, Andrew Stitcher and Steve Huston.
> 
> 
> Bugs: QPID-5363
>     https://issues.apache.org/jira/browse/QPID-5363
> 
> 
> Repository: qpid
> 
> 
> Description
> -------
> 
> Windows fails with an access violation about 10% of the time while closing AMQP 1.0 connections:
> 
> STACK
> =====
> >        qpidmessagingd.dll!qpid::messaging::amqp::TcpTransport::close()  Line 123 + 0x10 bytes        C++
>          qpidmessagingd.dll!qpid::messaging::amqp::ConnectionContext::close()  Line 134 + 0x29 bytes        C++
>          qpidmessagingd.dll!qpid::messaging::amqp::ConnectionHandle::close()  Line 62        C++
>          qpidmessagingd.dll!qpid::messaging::Connection::close()  Line 78 + 0x24 bytes        C++
>          hello_world.exe!main(int argc=4, char * * argv=0x0053a688)  Line 51 + 0xe bytes        C++
>          hello_world.exe!__tmainCRTStartup()  Line 586 + 0x19 bytes        C
>          hello_world.exe!mainCRTStartup()  Line 403        C
> 
> CODE
> ====
> void TcpTransport::close()
> {
>     QPID_LOG(debug, id << " TcpTransport closing...");
>     if (aio)
>         aio->queueWriteClose();    <=======
> }
> 
> FAILURE
> =======
> aio's vftable is all 0xdddddddd, indicating that it has been deleted.
> 
> THE FIX
> =======
> Use some locks to protect closing the aio object.
> 
> DISCUSSION
> ==========
> The locks in the diff appear to work ok in that the code passes thousands of runs. Are they OK theoretically?
> 
> Include in 0.26 release
> -----------------------
> Please indicate your approval or not in QPID-5363.
> 
> 
> Diffs
> -----
> 
>   trunk/qpid/cpp/src/qpid/messaging/amqp/TcpTransport.h 1543532 
>   trunk/qpid/cpp/src/qpid/messaging/amqp/TcpTransport.cpp 1543532 
> 
> Diff: https://reviews.apache.org/r/15677/diff/
> 
> 
> Testing
> -------
> 
> Running HelloWorld in a loop in multiple windows works ok.
> 
> Passes unit tests
> 
> 
> Thanks,
> 
> Chug Rolke
> 
>


Re: Review Request 15677: Windows client error closing AMQP 1.0 connection

Posted by Gordon Sim <gs...@redhat.com>.

> On Nov. 19, 2013, 8:08 p.m., Andrew Stitcher wrote:
> > I don't think the locks can deadlock, so from that point of view they are ok theoretically.
> > 
> > But there are some very important open questions in my mind:
> > 1. Why there are multiple callers of close and is there a good reason for it.
> > 2. Why don't we see crashes under Linux too, and does this fix work there?
> > 3. Also this looks like cut'n'paste programming from code elsewhere in the tree - are you sure that the original doesn't also have the bug (and if not why not?)

1. I can see two hypothetical reasons for the problem. The first is that the disconnected callback is called by the windows IO layer when the remote peer - i.e. the broker - closes the socket, whereas on linux the eof callback is called. If, as seems to be the case for linux, a disconnected callback is always followed by a closed callback, then socketClosed(), and therefore queueForDeletion() would be called twice. The second possibility is disconnection by the remote peer concurrent with a close request from the application. Certainly the code as it is now (prior to this patch) has a race condition in it that doesn't handle that concurrency adequately. The patch here addresses both of these hypothetical problems. Some extra tracing could shed some light in which (if either) of these is happening.

2. The fix does work on linux. My guess as to why we don't see the same crashes is due to differences in the underlying AsynchIO implementations. Perhaps its as simple as timing related. It does seem that on linux the disconnected event isn't called, at least in the common cases. Perhaps the posix implementation is more tolerant of double calls to queueForDeletion()? 

3. The code here is largely copied from qpid::client::TcpConnector, differing mainly in the encode and decode. There is some other locking in that class that would I think prevent the race. The same concerns around the disconnected callback handler apply, but I'm wondering whether that is only ever actually called in response to abort()?.


- Gordon


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15677/#review29132
-----------------------------------------------------------


On Nov. 19, 2013, 7:53 p.m., Chug Rolke wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15677/
> -----------------------------------------------------------
> 
> (Updated Nov. 19, 2013, 7:53 p.m.)
> 
> 
> Review request for qpid, Andrew Stitcher and Steve Huston.
> 
> 
> Bugs: QPID-5363
>     https://issues.apache.org/jira/browse/QPID-5363
> 
> 
> Repository: qpid
> 
> 
> Description
> -------
> 
> Windows fails with an access violation about 10% of the time while closing AMQP 1.0 connections:
> 
> STACK
> =====
> >        qpidmessagingd.dll!qpid::messaging::amqp::TcpTransport::close()  Line 123 + 0x10 bytes        C++
>          qpidmessagingd.dll!qpid::messaging::amqp::ConnectionContext::close()  Line 134 + 0x29 bytes        C++
>          qpidmessagingd.dll!qpid::messaging::amqp::ConnectionHandle::close()  Line 62        C++
>          qpidmessagingd.dll!qpid::messaging::Connection::close()  Line 78 + 0x24 bytes        C++
>          hello_world.exe!main(int argc=4, char * * argv=0x0053a688)  Line 51 + 0xe bytes        C++
>          hello_world.exe!__tmainCRTStartup()  Line 586 + 0x19 bytes        C
>          hello_world.exe!mainCRTStartup()  Line 403        C
> 
> CODE
> ====
> void TcpTransport::close()
> {
>     QPID_LOG(debug, id << " TcpTransport closing...");
>     if (aio)
>         aio->queueWriteClose();    <=======
> }
> 
> FAILURE
> =======
> aio's vftable is all 0xdddddddd, indicating that it has been deleted.
> 
> THE FIX
> =======
> Use some locks to protect closing the aio object.
> 
> DISCUSSION
> ==========
> The locks in the diff appear to work ok in that the code passes thousands of runs. Are they OK theoretically?
> 
> Include in 0.26 release
> -----------------------
> Please indicate your approval or not in QPID-5363.
> 
> 
> Diffs
> -----
> 
>   trunk/qpid/cpp/src/qpid/messaging/amqp/TcpTransport.h 1543532 
>   trunk/qpid/cpp/src/qpid/messaging/amqp/TcpTransport.cpp 1543532 
> 
> Diff: https://reviews.apache.org/r/15677/diff/
> 
> 
> Testing
> -------
> 
> Running HelloWorld in a loop in multiple windows works ok.
> 
> Passes unit tests
> 
> 
> Thanks,
> 
> Chug Rolke
> 
>


Re: Review Request 15677: Windows client error closing AMQP 1.0 connection

Posted by Andrew Stitcher <as...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15677/#review29132
-----------------------------------------------------------


I don't think the locks can deadlock, so from that point of view they are ok theoretically.

But there are some very important open questions in my mind:
1. Why there are multiple callers of close and is there a good reason for it.
2. Why don't we see crashes under Linux too, and does this fix work there?
3. Also this looks like cut'n'paste programming from code elsewhere in the tree - are you sure that the original doesn't also have the bug (and if not why not?)

- Andrew Stitcher


On Nov. 19, 2013, 7:53 p.m., Chug Rolke wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15677/
> -----------------------------------------------------------
> 
> (Updated Nov. 19, 2013, 7:53 p.m.)
> 
> 
> Review request for qpid, Andrew Stitcher and Steve Huston.
> 
> 
> Bugs: QPID-5363
>     https://issues.apache.org/jira/browse/QPID-5363
> 
> 
> Repository: qpid
> 
> 
> Description
> -------
> 
> Windows fails with an access violation about 10% of the time while closing AMQP 1.0 connections:
> 
> STACK
> =====
> >        qpidmessagingd.dll!qpid::messaging::amqp::TcpTransport::close()  Line 123 + 0x10 bytes        C++
>          qpidmessagingd.dll!qpid::messaging::amqp::ConnectionContext::close()  Line 134 + 0x29 bytes        C++
>          qpidmessagingd.dll!qpid::messaging::amqp::ConnectionHandle::close()  Line 62        C++
>          qpidmessagingd.dll!qpid::messaging::Connection::close()  Line 78 + 0x24 bytes        C++
>          hello_world.exe!main(int argc=4, char * * argv=0x0053a688)  Line 51 + 0xe bytes        C++
>          hello_world.exe!__tmainCRTStartup()  Line 586 + 0x19 bytes        C
>          hello_world.exe!mainCRTStartup()  Line 403        C
> 
> CODE
> ====
> void TcpTransport::close()
> {
>     QPID_LOG(debug, id << " TcpTransport closing...");
>     if (aio)
>         aio->queueWriteClose();    <=======
> }
> 
> FAILURE
> =======
> aio's vftable is all 0xdddddddd, indicating that it has been deleted.
> 
> THE FIX
> =======
> Use some locks to protect closing the aio object.
> 
> DISCUSSION
> ==========
> The locks in the diff appear to work ok in that the code passes thousands of runs. Are they OK theoretically?
> 
> Include in 0.26 release
> -----------------------
> Please indicate your approval or not in QPID-5363.
> 
> 
> Diffs
> -----
> 
>   trunk/qpid/cpp/src/qpid/messaging/amqp/TcpTransport.h 1543532 
>   trunk/qpid/cpp/src/qpid/messaging/amqp/TcpTransport.cpp 1543532 
> 
> Diff: https://reviews.apache.org/r/15677/diff/
> 
> 
> Testing
> -------
> 
> Running HelloWorld in a loop in multiple windows works ok.
> 
> Passes unit tests
> 
> 
> Thanks,
> 
> Chug Rolke
> 
>