You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by Kenneth Giusti <kg...@apache.org> on 2012/08/06 19:51:40 UTC

Review Request: Race condition in LinkRegistry.cpp

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6396/
-----------------------------------------------------------

Review request for qpid, Alan Conway, Gordon Sim, and Ted Ross.


Description
-------

Occasionally, the cluster tests will fail in the test_federation_multilink_failover test with the following error:

cluster_tests.ShortTests.test_federation_multilink_failover ................................................................................................................... fail
Error during test:  Traceback (most recent call last):
    File "/home/kgiusti/Desktop/work/qpid/trunk/build/qpid/src/tests/python/commands/qpid-python-test", line 340, in run
      phase()
    File "/home/kgiusti/Desktop/work/qpid/trunk/qpid/cpp/src/tests/cluster_tests.py", line 992, in test_federation_multilink_failover
      assert self._verify_federation(src_cluster[1], "FedX/two", dst_cluster[1], "destQ2")
  AssertionError

The problem is due to a race condition in the LinkRegistry code.  When a new connection event occurs for a federation Link, the LinkRegistry attempts to find a Link instance that is attempting to connect to the remote in order to assign the connection.  The problem is due to the fact that the search for the target link is done under a lock, but the assignment is done outside of the lock (to prevent lock inversion).

The proposed fix has LinkRegistry hold all disconnected Links in a separate container, and perform the search of that container (and the removal on match) while holding a lock.


This addresses bug QPID-4193.
    https://issues.apache.org/jira/browse/QPID-4193


Diffs
-----

  /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.h 1368984 
  /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.cpp 1368984 
  /trunk/qpid/cpp/src/qpid/cluster/Connection.cpp 1368984 
  /trunk/qpid/cpp/src/qpid/cluster/UpdateClient.cpp 1368984 

Diff: https://reviews.apache.org/r/6396/diff/


Testing
-------

Federation and cluster unit tests.
Ran test_federation_multilink_failover repeatedly with no crash.


Thanks,

Kenneth Giusti


Re: Review Request: Race condition in LinkRegistry.cpp

Posted by Gordon Sim <gs...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6396/#review10011
-----------------------------------------------------------

Ship it!


Ship It!

- Gordon Sim


On Aug. 6, 2012, 5:51 p.m., Kenneth Giusti wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6396/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2012, 5:51 p.m.)
> 
> 
> Review request for qpid, Alan Conway, Gordon Sim, and Ted Ross.
> 
> 
> Description
> -------
> 
> Occasionally, the cluster tests will fail in the test_federation_multilink_failover test with the following error:
> 
> cluster_tests.ShortTests.test_federation_multilink_failover ................................................................................................................... fail
> Error during test:  Traceback (most recent call last):
>     File "/home/kgiusti/Desktop/work/qpid/trunk/build/qpid/src/tests/python/commands/qpid-python-test", line 340, in run
>       phase()
>     File "/home/kgiusti/Desktop/work/qpid/trunk/qpid/cpp/src/tests/cluster_tests.py", line 992, in test_federation_multilink_failover
>       assert self._verify_federation(src_cluster[1], "FedX/two", dst_cluster[1], "destQ2")
>   AssertionError
> 
> The problem is due to a race condition in the LinkRegistry code.  When a new connection event occurs for a federation Link, the LinkRegistry attempts to find a Link instance that is attempting to connect to the remote in order to assign the connection.  The problem is due to the fact that the search for the target link is done under a lock, but the assignment is done outside of the lock (to prevent lock inversion).
> 
> The proposed fix has LinkRegistry hold all disconnected Links in a separate container, and perform the search of that container (and the removal on match) while holding a lock.
> 
> 
> This addresses bug QPID-4193.
>     https://issues.apache.org/jira/browse/QPID-4193
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.h 1368984 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/Connection.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/UpdateClient.cpp 1368984 
> 
> Diff: https://reviews.apache.org/r/6396/diff/
> 
> 
> Testing
> -------
> 
> Federation and cluster unit tests.
> Ran test_federation_multilink_failover repeatedly with no crash.
> 
> 
> Thanks,
> 
> Kenneth Giusti
> 
>


Re: Review Request: Race condition in LinkRegistry.cpp

Posted by Alan Conway <ac...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6396/#review9991
-----------------------------------------------------------

Ship it!


Ship It!

- Alan Conway


On Aug. 6, 2012, 5:51 p.m., Kenneth Giusti wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6396/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2012, 5:51 p.m.)
> 
> 
> Review request for qpid, Alan Conway, Gordon Sim, and Ted Ross.
> 
> 
> Description
> -------
> 
> Occasionally, the cluster tests will fail in the test_federation_multilink_failover test with the following error:
> 
> cluster_tests.ShortTests.test_federation_multilink_failover ................................................................................................................... fail
> Error during test:  Traceback (most recent call last):
>     File "/home/kgiusti/Desktop/work/qpid/trunk/build/qpid/src/tests/python/commands/qpid-python-test", line 340, in run
>       phase()
>     File "/home/kgiusti/Desktop/work/qpid/trunk/qpid/cpp/src/tests/cluster_tests.py", line 992, in test_federation_multilink_failover
>       assert self._verify_federation(src_cluster[1], "FedX/two", dst_cluster[1], "destQ2")
>   AssertionError
> 
> The problem is due to a race condition in the LinkRegistry code.  When a new connection event occurs for a federation Link, the LinkRegistry attempts to find a Link instance that is attempting to connect to the remote in order to assign the connection.  The problem is due to the fact that the search for the target link is done under a lock, but the assignment is done outside of the lock (to prevent lock inversion).
> 
> The proposed fix has LinkRegistry hold all disconnected Links in a separate container, and perform the search of that container (and the removal on match) while holding a lock.
> 
> 
> This addresses bug QPID-4193.
>     https://issues.apache.org/jira/browse/QPID-4193
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.h 1368984 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/Connection.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/UpdateClient.cpp 1368984 
> 
> Diff: https://reviews.apache.org/r/6396/diff/
> 
> 
> Testing
> -------
> 
> Federation and cluster unit tests.
> Ran test_federation_multilink_failover repeatedly with no crash.
> 
> 
> Thanks,
> 
> Kenneth Giusti
> 
>


Re: Review Request: Race condition in LinkRegistry.cpp

Posted by Alan Conway <ac...@redhat.com>.

> On Aug. 7, 2012, 4:51 p.m., Alan Conway wrote:
> > Where is "the assignment is done outside of the lock" in the old code? The assignment in notifyConnection seems to have the same locking in both old and new code.
> 
> Kenneth Giusti wrote:
>     You're correct, the call to link::established() still happens outside the lock (it has to, to avoid lock inversion).
>     
>     What the patch does is prevent simultainous calls into LinkRegistry::notifyConnection (from parallel threads) from being able to mistakenly select the _same_ pending Link.  It does this by holding all "pending" Links in a container, and removing them - while locked - as they are selected.  Notice the call to pendingLinks.erase() while the lock is held.
>     
>     This prevents two threads from selecting the same pending Link.
> 
> Alan Conway wrote:
>     So it is possible for multiple connections to match the same link? In which case, how do you know you've selected the correct one?
> 
> Kenneth Giusti wrote:
>     Yes - it is possible to have multiple links running between the same two brokers.  You can create such links via management (QMF) by assigning unique names (since they will have the same remote address).
>     
>     The link really doesn't care which connection it gets as long as the connection is to the right destination.  So if two links to the same destination request connections at the same moment, it really doesn't matter which link gets assigned which connection.
>     
>     As an aside: a -much- better solution would be to have the Link that requested the connection get notified of the connection directly, rather having the notification go through the LinkRegistry, which then has to find a Link that is pending a connection to the remote.  The LinkRegistry really shouldn't be in the middle of the connection assignment.  I looked at doing that awhile back, but the implementation started getting complex....

OK, gotcha. Agreed with your aside. Ship it:)


- Alan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6396/#review9965
-----------------------------------------------------------


On Aug. 6, 2012, 5:51 p.m., Kenneth Giusti wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6396/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2012, 5:51 p.m.)
> 
> 
> Review request for qpid, Alan Conway, Gordon Sim, and Ted Ross.
> 
> 
> Description
> -------
> 
> Occasionally, the cluster tests will fail in the test_federation_multilink_failover test with the following error:
> 
> cluster_tests.ShortTests.test_federation_multilink_failover ................................................................................................................... fail
> Error during test:  Traceback (most recent call last):
>     File "/home/kgiusti/Desktop/work/qpid/trunk/build/qpid/src/tests/python/commands/qpid-python-test", line 340, in run
>       phase()
>     File "/home/kgiusti/Desktop/work/qpid/trunk/qpid/cpp/src/tests/cluster_tests.py", line 992, in test_federation_multilink_failover
>       assert self._verify_federation(src_cluster[1], "FedX/two", dst_cluster[1], "destQ2")
>   AssertionError
> 
> The problem is due to a race condition in the LinkRegistry code.  When a new connection event occurs for a federation Link, the LinkRegistry attempts to find a Link instance that is attempting to connect to the remote in order to assign the connection.  The problem is due to the fact that the search for the target link is done under a lock, but the assignment is done outside of the lock (to prevent lock inversion).
> 
> The proposed fix has LinkRegistry hold all disconnected Links in a separate container, and perform the search of that container (and the removal on match) while holding a lock.
> 
> 
> This addresses bug QPID-4193.
>     https://issues.apache.org/jira/browse/QPID-4193
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.h 1368984 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/Connection.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/UpdateClient.cpp 1368984 
> 
> Diff: https://reviews.apache.org/r/6396/diff/
> 
> 
> Testing
> -------
> 
> Federation and cluster unit tests.
> Ran test_federation_multilink_failover repeatedly with no crash.
> 
> 
> Thanks,
> 
> Kenneth Giusti
> 
>


Re: Review Request: Race condition in LinkRegistry.cpp

Posted by Kenneth Giusti <kg...@apache.org>.

> On Aug. 7, 2012, 4:51 p.m., Alan Conway wrote:
> > Where is "the assignment is done outside of the lock" in the old code? The assignment in notifyConnection seems to have the same locking in both old and new code.

You're correct, the call to link::established() still happens outside the lock (it has to, to avoid lock inversion).

What the patch does is prevent simultainous calls into LinkRegistry::notifyConnection (from parallel threads) from being able to mistakenly select the _same_ pending Link.  It does this by holding all "pending" Links in a container, and removing them - while locked - as they are selected.  Notice the call to pendingLinks.erase() while the lock is held.

This prevents two threads from selecting the same pending Link.


- Kenneth


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6396/#review9965
-----------------------------------------------------------


On Aug. 6, 2012, 5:51 p.m., Kenneth Giusti wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6396/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2012, 5:51 p.m.)
> 
> 
> Review request for qpid, Alan Conway, Gordon Sim, and Ted Ross.
> 
> 
> Description
> -------
> 
> Occasionally, the cluster tests will fail in the test_federation_multilink_failover test with the following error:
> 
> cluster_tests.ShortTests.test_federation_multilink_failover ................................................................................................................... fail
> Error during test:  Traceback (most recent call last):
>     File "/home/kgiusti/Desktop/work/qpid/trunk/build/qpid/src/tests/python/commands/qpid-python-test", line 340, in run
>       phase()
>     File "/home/kgiusti/Desktop/work/qpid/trunk/qpid/cpp/src/tests/cluster_tests.py", line 992, in test_federation_multilink_failover
>       assert self._verify_federation(src_cluster[1], "FedX/two", dst_cluster[1], "destQ2")
>   AssertionError
> 
> The problem is due to a race condition in the LinkRegistry code.  When a new connection event occurs for a federation Link, the LinkRegistry attempts to find a Link instance that is attempting to connect to the remote in order to assign the connection.  The problem is due to the fact that the search for the target link is done under a lock, but the assignment is done outside of the lock (to prevent lock inversion).
> 
> The proposed fix has LinkRegistry hold all disconnected Links in a separate container, and perform the search of that container (and the removal on match) while holding a lock.
> 
> 
> This addresses bug QPID-4193.
>     https://issues.apache.org/jira/browse/QPID-4193
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.h 1368984 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/Connection.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/UpdateClient.cpp 1368984 
> 
> Diff: https://reviews.apache.org/r/6396/diff/
> 
> 
> Testing
> -------
> 
> Federation and cluster unit tests.
> Ran test_federation_multilink_failover repeatedly with no crash.
> 
> 
> Thanks,
> 
> Kenneth Giusti
> 
>


Re: Review Request: Race condition in LinkRegistry.cpp

Posted by Alan Conway <ac...@redhat.com>.

> On Aug. 7, 2012, 4:51 p.m., Alan Conway wrote:
> > Where is "the assignment is done outside of the lock" in the old code? The assignment in notifyConnection seems to have the same locking in both old and new code.
> 
> Kenneth Giusti wrote:
>     You're correct, the call to link::established() still happens outside the lock (it has to, to avoid lock inversion).
>     
>     What the patch does is prevent simultainous calls into LinkRegistry::notifyConnection (from parallel threads) from being able to mistakenly select the _same_ pending Link.  It does this by holding all "pending" Links in a container, and removing them - while locked - as they are selected.  Notice the call to pendingLinks.erase() while the lock is held.
>     
>     This prevents two threads from selecting the same pending Link.

So it is possible for multiple connections to match the same link? In which case, how do you know you've selected the correct one?


- Alan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6396/#review9965
-----------------------------------------------------------


On Aug. 6, 2012, 5:51 p.m., Kenneth Giusti wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6396/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2012, 5:51 p.m.)
> 
> 
> Review request for qpid, Alan Conway, Gordon Sim, and Ted Ross.
> 
> 
> Description
> -------
> 
> Occasionally, the cluster tests will fail in the test_federation_multilink_failover test with the following error:
> 
> cluster_tests.ShortTests.test_federation_multilink_failover ................................................................................................................... fail
> Error during test:  Traceback (most recent call last):
>     File "/home/kgiusti/Desktop/work/qpid/trunk/build/qpid/src/tests/python/commands/qpid-python-test", line 340, in run
>       phase()
>     File "/home/kgiusti/Desktop/work/qpid/trunk/qpid/cpp/src/tests/cluster_tests.py", line 992, in test_federation_multilink_failover
>       assert self._verify_federation(src_cluster[1], "FedX/two", dst_cluster[1], "destQ2")
>   AssertionError
> 
> The problem is due to a race condition in the LinkRegistry code.  When a new connection event occurs for a federation Link, the LinkRegistry attempts to find a Link instance that is attempting to connect to the remote in order to assign the connection.  The problem is due to the fact that the search for the target link is done under a lock, but the assignment is done outside of the lock (to prevent lock inversion).
> 
> The proposed fix has LinkRegistry hold all disconnected Links in a separate container, and perform the search of that container (and the removal on match) while holding a lock.
> 
> 
> This addresses bug QPID-4193.
>     https://issues.apache.org/jira/browse/QPID-4193
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.h 1368984 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/Connection.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/UpdateClient.cpp 1368984 
> 
> Diff: https://reviews.apache.org/r/6396/diff/
> 
> 
> Testing
> -------
> 
> Federation and cluster unit tests.
> Ran test_federation_multilink_failover repeatedly with no crash.
> 
> 
> Thanks,
> 
> Kenneth Giusti
> 
>


Re: Review Request: Race condition in LinkRegistry.cpp

Posted by Kenneth Giusti <kg...@apache.org>.

> On Aug. 7, 2012, 4:51 p.m., Alan Conway wrote:
> > Where is "the assignment is done outside of the lock" in the old code? The assignment in notifyConnection seems to have the same locking in both old and new code.
> 
> Kenneth Giusti wrote:
>     You're correct, the call to link::established() still happens outside the lock (it has to, to avoid lock inversion).
>     
>     What the patch does is prevent simultainous calls into LinkRegistry::notifyConnection (from parallel threads) from being able to mistakenly select the _same_ pending Link.  It does this by holding all "pending" Links in a container, and removing them - while locked - as they are selected.  Notice the call to pendingLinks.erase() while the lock is held.
>     
>     This prevents two threads from selecting the same pending Link.
> 
> Alan Conway wrote:
>     So it is possible for multiple connections to match the same link? In which case, how do you know you've selected the correct one?

Yes - it is possible to have multiple links running between the same two brokers.  You can create such links via management (QMF) by assigning unique names (since they will have the same remote address).

The link really doesn't care which connection it gets as long as the connection is to the right destination.  So if two links to the same destination request connections at the same moment, it really doesn't matter which link gets assigned which connection.

As an aside: a -much- better solution would be to have the Link that requested the connection get notified of the connection directly, rather having the notification go through the LinkRegistry, which then has to find a Link that is pending a connection to the remote.  The LinkRegistry really shouldn't be in the middle of the connection assignment.  I looked at doing that awhile back, but the implementation started getting complex....  


- Kenneth


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6396/#review9965
-----------------------------------------------------------


On Aug. 6, 2012, 5:51 p.m., Kenneth Giusti wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6396/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2012, 5:51 p.m.)
> 
> 
> Review request for qpid, Alan Conway, Gordon Sim, and Ted Ross.
> 
> 
> Description
> -------
> 
> Occasionally, the cluster tests will fail in the test_federation_multilink_failover test with the following error:
> 
> cluster_tests.ShortTests.test_federation_multilink_failover ................................................................................................................... fail
> Error during test:  Traceback (most recent call last):
>     File "/home/kgiusti/Desktop/work/qpid/trunk/build/qpid/src/tests/python/commands/qpid-python-test", line 340, in run
>       phase()
>     File "/home/kgiusti/Desktop/work/qpid/trunk/qpid/cpp/src/tests/cluster_tests.py", line 992, in test_federation_multilink_failover
>       assert self._verify_federation(src_cluster[1], "FedX/two", dst_cluster[1], "destQ2")
>   AssertionError
> 
> The problem is due to a race condition in the LinkRegistry code.  When a new connection event occurs for a federation Link, the LinkRegistry attempts to find a Link instance that is attempting to connect to the remote in order to assign the connection.  The problem is due to the fact that the search for the target link is done under a lock, but the assignment is done outside of the lock (to prevent lock inversion).
> 
> The proposed fix has LinkRegistry hold all disconnected Links in a separate container, and perform the search of that container (and the removal on match) while holding a lock.
> 
> 
> This addresses bug QPID-4193.
>     https://issues.apache.org/jira/browse/QPID-4193
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.h 1368984 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/Connection.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/UpdateClient.cpp 1368984 
> 
> Diff: https://reviews.apache.org/r/6396/diff/
> 
> 
> Testing
> -------
> 
> Federation and cluster unit tests.
> Ran test_federation_multilink_failover repeatedly with no crash.
> 
> 
> Thanks,
> 
> Kenneth Giusti
> 
>


Re: Review Request: Race condition in LinkRegistry.cpp

Posted by Alan Conway <ac...@redhat.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6396/#review9965
-----------------------------------------------------------


Where is "the assignment is done outside of the lock" in the old code? The assignment in notifyConnection seems to have the same locking in both old and new code.

- Alan Conway


On Aug. 6, 2012, 5:51 p.m., Kenneth Giusti wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/6396/
> -----------------------------------------------------------
> 
> (Updated Aug. 6, 2012, 5:51 p.m.)
> 
> 
> Review request for qpid, Alan Conway, Gordon Sim, and Ted Ross.
> 
> 
> Description
> -------
> 
> Occasionally, the cluster tests will fail in the test_federation_multilink_failover test with the following error:
> 
> cluster_tests.ShortTests.test_federation_multilink_failover ................................................................................................................... fail
> Error during test:  Traceback (most recent call last):
>     File "/home/kgiusti/Desktop/work/qpid/trunk/build/qpid/src/tests/python/commands/qpid-python-test", line 340, in run
>       phase()
>     File "/home/kgiusti/Desktop/work/qpid/trunk/qpid/cpp/src/tests/cluster_tests.py", line 992, in test_federation_multilink_failover
>       assert self._verify_federation(src_cluster[1], "FedX/two", dst_cluster[1], "destQ2")
>   AssertionError
> 
> The problem is due to a race condition in the LinkRegistry code.  When a new connection event occurs for a federation Link, the LinkRegistry attempts to find a Link instance that is attempting to connect to the remote in order to assign the connection.  The problem is due to the fact that the search for the target link is done under a lock, but the assignment is done outside of the lock (to prevent lock inversion).
> 
> The proposed fix has LinkRegistry hold all disconnected Links in a separate container, and perform the search of that container (and the removal on match) while holding a lock.
> 
> 
> This addresses bug QPID-4193.
>     https://issues.apache.org/jira/browse/QPID-4193
> 
> 
> Diffs
> -----
> 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.h 1368984 
>   /trunk/qpid/cpp/src/qpid/broker/LinkRegistry.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/Connection.cpp 1368984 
>   /trunk/qpid/cpp/src/qpid/cluster/UpdateClient.cpp 1368984 
> 
> Diff: https://reviews.apache.org/r/6396/diff/
> 
> 
> Testing
> -------
> 
> Federation and cluster unit tests.
> Ran test_federation_multilink_failover repeatedly with no crash.
> 
> 
> Thanks,
> 
> Kenneth Giusti
> 
>