You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@qpid.apache.org by Igor Natanzon <ig...@gmail.com> on 2019/05/28 19:34:46 UTC

[Broker-J 7.1.3] Message loss during failover

Hi, we have a 3-node cluster defined, with MASTER set to SYNC and REPLICAS
set to WRITE_NO_SYNC.
Today we did a stress test, sending 200k+ messages on each of 3 queues.
Some time during the transmission I performed a failover of Master to
another node (RCO_1_FIX_VHN). The node was in 'waiting' state for about 20
seconds before it became a master.

Once the queues emptied, we noticed we lost 4 messages. Looking into qpid
server log, I only see the following exception:

2019-05-28 13:20:56,581 WARN  [Broker-Config]
(o.a.q.s.v.b.BDBHAVirtualHostNodeImpl) - Transfer master did not complete
within 100ms. Node may still be elected master at a later time.
...
2019-05-28 13:21:27,842 INFO  [VirtualHostNode-RCO_1_FIX_VHN-Config]
(o.a.q.s.v.SynchronousMessageStoreRecoverer) - Discarded 1 orphaned
message(s).

There are no other errors or issues in any logs.

I'm not sure what the orphaned message is, and I'm not sure if I need to
set all replicas to be SYNC in addition to the master to handle this
scenar. Is there anything I can look at to track down what happened to the
missing messages?

Thanks!

Re: [Broker-J 7.1.3] Message loss during failover

Posted by Oleksandr Rudyy <or...@gmail.com>.

Hi Igor,

Thanks for the update. Please, let us know if you will be able to reproduce
the problem with JMS client version 0.43.

Kind Regards,
Alex

On Fri, 14 Jun 2019 at 21:03, Igor Natanzon <ig...@gmail.com> wrote:

> Hi Alex, we were using 0.41 client and AMQP 1.0. I since switched to 0.43
> client (with updated Netty/Proton) and so far haven't been able to
> replicate the issue. We do use sync publishing.
> I am still testing trying to replicate this problem, but so far without
> success.
>
> Thanks!
>

Re: [Broker-J 7.1.3] Message loss during failover

Posted by Igor Natanzon <ig...@gmail.com>.

Hi Alex, we were using 0.41 client and AMQP 1.0. I since switched to 0.43
client (with updated Netty/Proton) and so far haven't been able to
replicate the issue. We do use sync publishing.
I am still testing trying to replicate this problem, but so far without
success.

Thanks!

Re: [Broker-J 7.1.3] Message loss during failover

Posted by Oleksandr Rudyy <or...@gmail.com>.

Hi Igor,

Could you please clarify what client are you using to publish messages and
how exactly messages were published?

The log about orphaned message indicates that on Virtual Host recovery
(after node became a Master), the queue entry record was not found for the
message in the message store.
VirtualHostNode-RCO_1_FIX_VHN-Config]
(o.a.q.s.v.SynchronousMessageStoreRecoverer) - Discarded 1 orphaned
message(s).

As result, the broker discarded that message.

The above might happen when BDB JE transaction for message header and
content was replicated over, but message enquiuing was not replicated. It
looks like a switch to a new Master occurred somewhere after message
arrived to the broker but before finishing an enqueuing operation.

When message is published asynchronously, the client send operation does
not wait for the message to arrive to the broker and there is a possibility
here for the message loss. For example, this might happen when message was
publishing using a legacy JMS client for AMQP 0-x. By default, this client
is publishing messages in asynchronous way.

If synchronous publishing mode is used, the publishing operation should
fail with exception. In order to exclude any possibility for a message
loss, you need to use transactions.

Kind Regards,
Alex

On Tue, 28 May 2019 at 20:35, Igor Natanzon <ig...@gmail.com> wrote:

> Hi, we have a 3-node cluster defined, with MASTER set to SYNC and REPLICAS
> set to WRITE_NO_SYNC.
> Today we did a stress test, sending 200k+ messages on each of 3 queues.
> Some time during the transmission I performed a failover of Master to
> another node (RCO_1_FIX_VHN). The node was in 'waiting' state for about 20
> seconds before it became a master.
>
> Once the queues emptied, we noticed we lost 4 messages. Looking into qpid
> server log, I only see the following exception:
>
> 2019-05-28 13:20:56,581 WARN  [Broker-Config]
> (o.a.q.s.v.b.BDBHAVirtualHostNodeImpl) - Transfer master did not complete
> within 100ms. Node may still be elected master at a later time.
> ...
> 2019-05-28 13:21:27,842 INFO  [VirtualHostNode-RCO_1_FIX_VHN-Config]
> (o.a.q.s.v.SynchronousMessageStoreRecoverer) - Discarded 1 orphaned
> message(s).
>
> There are no other errors or issues in any logs.
>
> I'm not sure what the orphaned message is, and I'm not sure if I need to
> set all replicas to be SYNC in addition to the master to handle this
> scenar. Is there anything I can look at to track down what happened to the
> missing messages?
>
> Thanks!
>