You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2014/12/19 04:19:13 UTC

[jira] [Commented] (QPID-6278) HA broker abort in TXN soak test

    [ https://issues.apache.org/jira/browse/QPID-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252852#comment-14252852 ] 

ASF subversion and git services commented on QPID-6278:
-------------------------------------------------------

Commit 1646618 from [~aconway] in branch 'qpid/trunk'
[ https://svn.apache.org/r1646618 ]

QPID-6278: HA broker abort in TXN soak test

The crash appears to be a race condition in async completion exposed by the HA
TX code code as follows:

1. Message received and placed on tx-replication queue, completion delayed till backups ack.
   Completion count goes up for each backup then down as each backup acks.
2. Prepare received, message placed on primary's local persistent queue.
   Completion count goes up one then down one for local store completion (null store in this case).

The race is something like this:
- last backup ack arrives (on backup IO thread) and drops completion count to 0.
- prepare arrives (on client thread) null store bumps count to 1 and immediately drops to 0.
- both threads try to invoke the completion callback, one deletes it while the other is still invoking.

The old completion logic assumed that only one thread can see the atomic counter
go to 0.  It does not handle the count going to 0 in one thread and concurrently
being increased and decreased back to 0 in another. This case is introduced by
HA transactions because the same message is put onto a tx-replication queue and
then put again onto another persistent local queue, so there are two cycles of
completion.

The new logic fixes this only one call to completion callback is possible in all cases.

Also fixed missing lock in ha/Primary.cpp.

>  HA broker abort in TXN soak test
> ---------------------------------
>
>                 Key: QPID-6278
>                 URL: https://issues.apache.org/jira/browse/QPID-6278
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.30
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>         Attachments: ha-tx-race.diff
>
>
> see also https://bugzilla.redhat.com/show_bug.cgi?id=1145386
> I have a repeatable crash in primary HA broker, by doing a soak test on TXNs.
> This is with trunk code new as of an hour ago:
>   
> URL: https://svn.apache.org/repos/asf/qpid/trunk/qpid/cpp
> Repository Root: https://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 1626916
> Node Kind: directory
> Schedule: normal
> Last Changed Author: aconway
> Last Changed Rev: 1626887
> I did a standard build, first of proton and then of qpidd -- except that I had them install themselves in /usr instead of /usr/local .
> Here are the scripts I use.
> script 1
> starting the HA cluster
> {
> #! /bin/bash
> export PYTHONPATH=/home/mick/trunk/qpid/python
> QPIDD=/usr/sbin/qpidd
> QPID_HA=/home/mick/trunk/qpid/tools/src/py/qpid-ha
> # This is where I put the log files.
> rm -rf /tmp/mick
> mkdir /tmp/mick
> for N in 1 2 3
> do
>   $QPIDD                                                          \
>     --auth=no                                                     \
>     --no-module-dir                                               \
>     --load-module /usr/lib64/qpid/daemon/ha.so                    \
>     --log-enable debug+:ha::                                      \
>     --ha-cluster=yes                                              \
>     --ha-replicate=all                                            \
>     --ha-brokers-url=localhost:5801,localhost:5802,localhost:5803 \
>     --ha-public-url=localhost:5801,localhost:5802,localhost:5803  \
>     -p 580$N                                                      \
>     --data-dir /tmp/mick/data_$N                                  \
>     --log-to-file /tmp/mick/qpidd_$N.log                          \
>     --mgmt-enable=yes                                             \
>     -d
>   echo "============================================"
>   echo "started broker $N from $QPIDD"
>   echo "============================================"
>   sleep 1
> done
> # Now promote one broker to primary.
> echo "Promoting broker 5801..."
> ${QPID_HA} promote --cluster-manager -b localhost:5801
> echo "done."
> }
> script 2
> create the tx queues, and load the first one with 1000 messages
> {
>   #! /bin/bash
> TXTEST2=/usr/libexec/qpid/tests/qpid-txtest2
> echo "Loading data to queues..."
> ${TXTEST2} --init=yes --transfer=no --check=no                           \
>            --port 5801                                                   \
>            --total-messages 1000 --connection-options '{reconnect:true}' \
>            --messages-per-tx 10 --tx-count 100                           \
>            --queue-base-name=tx --fetch-timeout=1
> }
> script 3
> now beat the heck out of the TXN code
> {
>   #! /bin/bash
> TXTEST2=/usr/libexec/qpid/tests/qpid-txtest2
> echo "starting transfers..."
> ${TXTEST2} --init=no --transfer=yes --check=no                           \
>            --port 5801                                                   \
>            --total-messages 5000000 --connection-options '{reconnect:true}' \
>            --messages-per-tx 10 --tx-count 500000                          \
>            --queue-base-name=tx --fetch-timeout=1
> }
> I do *not* do any failovers.  Just let that TXN-exercising script run until the primary broker dies.  
> It took quite a while.  In my most recent test, I got through something like 300,000 transactions (10 messages each) before the broker became brokest.
> I then tried the same test on a standalone broker and it got all the way through.
> Here is the traceback:
> #0  0x0000003186a328a5 in raise () from /lib64/libc.so.6
> #1  0x0000003186a34085 in abort () from /lib64/libc.so.6
> #2  0x0000003186a2ba1e in __assert_fail_base () from /lib64/libc.so.6
> #3  0x0000003186a2bae0 in __assert_fail () from /lib64/libc.so.6
> #4  0x00007f6bb72b4f16 in operator-> (this=0x7f6b48378060, sync=<value optimized out>)
>     at /usr/include/boost/smart_ptr/intrusive_ptr.hpp:166
> #5  qpid::broker::SessionState::IncompleteIngressMsgXfer::completed (this=0x7f6b48378060, 
>     sync=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/SessionState.cpp:409
> #6  0x00007f6bb726d670 in invokeCallback (this=<value optimized out>, msg=<value optimized out>)
>     at /home/mick/trunk/qpid/cpp/src/qpid/broker/AsyncCompletion.h:117
> #7  finishCompleter (this=<value optimized out>, msg=<value optimized out>)
>     at /home/mick/trunk/qpid/cpp/src/qpid/broker/AsyncCompletion.h:158
> #8  enqueueComplete (this=<value optimized out>, msg=<value optimized out>)
>     at /home/mick/trunk/qpid/cpp/src/qpid/broker/PersistableMessage.h:76
> #9  qpid::broker::NullMessageStore::enqueue (this=<value optimized out>, msg=<value optimized out>)
>     at /home/mick/trunk/qpid/cpp/src/qpid/broker/NullMessageStore.cpp:97
> #10 0x00007f6bb71f4578 in qpid::broker::Queue::enqueue (this=0x7f6b4801ef90, ctxt=0x7f6b6821bdf0, msg=...)
>     at /home/mick/trunk/qpid/cpp/src/qpid/broker/Queue.cpp:910
> #11 0x00007f6bb71f46db in qpid::broker::Queue::TxPublish::prepare (this=0x7f6b48435c70, 
>     ctxt=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/Queue.cpp:159
> #12 0x00007f6bb72c8b3d in qpid::broker::TxBuffer::prepare (this=0x7f6b68549120, ctxt=0x7f6b6821bdf0)
>     at /home/mick/trunk/qpid/cpp/src/qpid/broker/TxBuffer.cpp:42
> #13 0x00007f6bb72c9dbe in qpid::broker::TxBuffer::startCommit (this=0x7f6b68549120, 
>     store=<value optimized out>) at /home/mick/trunk/qpid/cpp/src/qpid/broker/TxBuffer.cpp:73
> #14 0x00007f6bb7298c74 in qpid::broker::SemanticState::commit (this=0x7f6b6c567fb8, store=0x2460440)
>     at /home/mick/trunk/qpid/cpp/src/qpid/broker/SemanticState.cpp:198
> #15 0x00007f6bb6c5886e in invoke<qpid::framing::AMQP_ServerOperations::TxHandler> (this=0x7f6b8bffd4a0, 
>     body=<value optimized out>) at /home/mick/trunk/qpid/cpp/build/src/qpid/framing/TxCommitBody.h:53
> #16 qpid::framing::AMQP_ServerOperations::TxHandler::Invoker::visit (this=0x7f6b8bffd4a0, 
>     body=<value optimized out>) at /home/mick/trunk/qpid/cpp/build/src/qpid/framing/ServerInvoker.cpp:582
> #17 0x00007f6bb6c5cc41 in qpid::framing::AMQP_ServerOperations::Invoker::visit (this=0x7f6b8bffd670, body=...)
>     at /home/mick/trunk/qpid/cpp/build/src/qpid/framing/ServerInvoker.cpp:278
> #18 0x00007f6bb72b504c in invoke<qpid::broker::SessionAdapter> (this=<value optimized out>, 
>     method=0x7f6b68130790) at /home/mick/trunk/qpid/cpp/src/qpid/framing/Invoker.h:67
> #19 qpid::broker::SessionState::handleCommand (this=<value optimized out>, method=0x7f6b68130790)
>     at /home/mick/trunk/qpid/cpp/src/qpid/broker/SessionState.cpp:198
> #20 0x00007f6bb72b6235 in qpid::broker::SessionState::handleIn (this=0x7f6b6c567df0, frame=...)
>     at /home/mick/trunk/qpid/cpp/src/qpid/broker/SessionState.cpp:295
> #21 0x00007f6bb6cd5291 in qpid::amqp_0_10::SessionHandler::handleIn (this=0x7f6b6c4e2120, f=...)
>     at /home/mick/trunk/qpid/cpp/src/qpid/amqp_0_10/SessionHandler.cpp:93
> #22 0x00007f6bb722692b in operator() (this=0x7f6b500ab840, frame=...)
>     at /home/mick/trunk/qpid/cpp/src/qpid/framing/Handler.h:39
> #23 qpid::broker::ConnectionHandler::handle (this=0x7f6b500ab840, frame=...)
>     at /home/mick/trunk/qpid/cpp/src/qpid/broker/ConnectionHandler.cpp:94
> #24 0x00007f6bb7221ba8 in qpid::broker::amqp_0_10::Connection::received (this=0x7f6b500ab660, frame=...)
>     at /home/mick/trunk/qpid/cpp/src/qpid/broker/amqp_0_10/Connection.cpp:198
> #25 0x00007f6bb71aea4d in qpid::amqp_0_10::Connection::decode (this=0x7f6b5005d770, 
>     buffer=<value optimized out>, size=<value optimized out>)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org