You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@qpid.apache.org by Bryan Dixon <br...@bldixon.net> on 2020/01/24 16:30:07 UTC

Broker-J 7.0.2 BDB_HA not replicating messages correctly

We've been using Broker-J 7.0.2 BDB_HA successfully (on Windows servers) for
a couple years now without issue.  I was working on upgrading to Broker-J
7.1.6 and performing some tests in our Test Broker-J cluster prior to the
upgrade and noticed that BDB_HA message replication wasn't working properly
and we've never had an issue like this before.  By not working properly,
what is happening is when publishing messages to the master virtual host
node and then forcing (via the admin console) another virtual host node to
become the master, the messages I just added are not in the queue in the new
master.  And if I publish messages in the new master, those messages aren't
in the queue in the next node I make the master.

One difference in our Test environment from our Dev and Production
environment (which all use 3 nodes/servers) is that a couple months ago we
replaced 2 nodes (2 VMs - swapped out Windows 2008 Server with Windows 2016
Server for those 2 VMs).   For that process I just changed the Virtual Host
Node to add the 2 new VMs and then removed the old nodes (old Windows2008
VMs).  Everything appeared to be fine after that change but I didn't run the
tests I am running now so perhaps that is what is causing this issue.

To see if this replication issue was for our existing BDB_HA virtual host
(the VH that existed prior to the Windows VM replacement discussed
previously) I created a brand new virtual host (dixonbroker) in our Test
Qpid environment and added a virtual host node on each of our VMs in Test. 
I am still getting the same behavior where messages aren't replicating
correctly to each node when that node becomes master.  To help with
troubleshooting I used this new dixonbroker virtual host for the tests I've
documented below.

I've attached log files from all 3 nodes with some debugging enabled (I can
add more debugging and run the tests again if it helps).    


Here are the steps of my test:
1) Node1 was master.  I created a new durable queue named app_test.  I have
a little Java app using qpid-jms-client .31 (I tried .44 and it still
produced the same bad results).  I use Spring's JmsTemplate to publish 50
messages.  This was done at 11:00 (to help see the flow in the log files).
2) I made Node2 the master and there were no messages in app_test queue (see
log message 2020-01-24 11:00:00,243 INFO  [Queue Recoverer : app_test (vh:
dixonbroker)] (q.m.t.recovered) - [Broker]
[vh(/dixonbroker)/ms(ProvidedBDBMessageStore)] TXN-1005 : Recovered 0
messages for queue app_test in the node2 log file).
3) I made Node3 the master.  The 50 messages DID exist in the app_test
queue.  I then added 50 more messages to app_test queue (so total of 100)
while Node3 was the master.  This was at 11:02.
4) I made Node1 the master.  It only had 50 messages in app_test queue -
should have had 100.
5) I made Node2 the master. It still didn't have any messages in app_test
queue - should have had 100.



qpid-node1.log
<http://qpid.2158936.n2.nabble.com/file/t396334/qpid-node1.log>  
qpid-node2.log
<http://qpid.2158936.n2.nabble.com/file/t396334/qpid-node2.log>  
qpid-node3.log
<http://qpid.2158936.n2.nabble.com/file/t396334/qpid-node3.log>  



--
Sent from: http://qpid.2158936.n2.nabble.com/Apache-Qpid-users-f2158936.html

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org

Re: Broker-J 7.0.2 BDB_HA not replicating messages correctly

Posted by Oleksandr Rudyy <or...@gmail.com>.

Hi Bryan,

I believe I found the root cause of the issue with asynchronous recovery in
BDB HA.
The fix have been pushed into a master branch as part of JIRA QPID-8407 [1].
I will merge the changes into 7.1.x branch shortly for the following their
inclusion into 7.1.8 release.

Kind regards,
Alex

[1] https://issues.apache.org/jira/browse/QPID-8407


On Mon, 27 Jan 2020 at 19:47, Bryan Dixon <br...@bldixon.net> wrote:

> Thanks Alex.  I removed that context variable and replication did indeed
> work
> correctly.  What's odd is that we've been using that context variable for
> quite a while and I hadn't noticed any issue with replication.  In fact I
> ran these same tests multiple times in our Dev and Prod environments and
> they always worked successfully.  The only difference with this Test
> environment from our Dev and Prod ones is that I did switch out 2 of the
> virtual host nodes.  So maybe that will help you guys track down the bug.
>
> Bryan
>
>
>
> --
> Sent from:
> http://qpid.2158936.n2.nabble.com/Apache-Qpid-users-f2158936.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>

Re: Broker-J 7.0.2 BDB_HA not replicating messages correctly

Posted by Bryan Dixon <br...@bldixon.net>.

Thanks Alex.  I removed that context variable and replication did indeed work
correctly.  What's odd is that we've been using that context variable for
quite a while and I hadn't noticed any issue with replication.  In fact I
ran these same tests multiple times in our Dev and Prod environments and
they always worked successfully.  The only difference with this Test
environment from our Dev and Prod ones is that I did switch out 2 of the
virtual host nodes.  So maybe that will help you guys track down the bug.

Bryan



--
Sent from: http://qpid.2158936.n2.nabble.com/Apache-Qpid-users-f2158936.html

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org

Re: Broker-J 7.0.2 BDB_HA not replicating messages correctly

Posted by Oleksandr Rudyy <or...@gmail.com>.

Hi Bryan,
It is indeed a defect with an asynchronous message recovery.
Somehow the asynchronous recover does not recover the replicated messages.

Thanks for reporting the issue.
I will have a look into what is causing it.
In meantime you can try switching back to synchronous recoverer if it is
feasible. I have not seen this issue with the synchronous recoverer.

Kind Regards,
Alex


On Mon, 27 Jan 2020 at 17:25, Bryan Dixon <br...@bldixon.net> wrote:

> Alex, I always use the highest durability sync policy for Master and
> Replica:
> SYNC.
>
> Also, for the virtual host, I am using this custom context variable:
> use_async_message_store_recovery=true
>
> We use that because we used to have some queues with a large amount of
> messages.  I wanted to bring it up in case it matters.  But we've been
> using
> that setting for a while and have never had this issue.
>
>
> In node2's log file you can see the following messages after I made node2
> the master that concern me:
>
> 2020-01-24 11:00:00,211 DEBUG [Feeder Output for dixon01]
> (c.s.j.r.i.n.Feeder) - dixon02 Feeder output thread for replica dixon01
> started at VLSN 7,024 master at 7,023 (DTVLSN:7,022) VLSN delta=-1
> socket=(dixon01(1))com.sleepycat.je.rep.utilint.net
> .SimpleDataChannel@34d03b0c
> 2020-01-24 11:00:00,243 INFO  [StateChange-dixonbroker:dixon02]
> (q.m.h.role_changed) - [Broker] [grp(/dixonbroker)] HA-1010 : Role change
> reported: Node : 'dixon02' (spgmqtst2:5011) : from 'WAITING' to 'MASTER'
> 2020-01-24 11:00:00,243 INFO  [Queue Recoverer : app_test (vh:
> dixonbroker)]
> (q.m.t.recovered) - [Broker] [vh(/dixonbroker)/ms(ProvidedBDBMessageStore)]
> TXN-1005 : Recovered 0 messages for queue app_test
>
>
> See that last log message - Recovered 0 messages for queue app_test.
>  Isn't
> that an issue?  I would expect to see 50 messages recovered.  Perhaps I'm
> misunderstanding the log messages.
>
> Thanks
> Bryan
>
>
>
>
> --
> Sent from:
> http://qpid.2158936.n2.nabble.com/Apache-Qpid-users-f2158936.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>

Re: Broker-J 7.0.2 BDB_HA not replicating messages correctly

Posted by Bryan Dixon <br...@bldixon.net>.

Alex, I always use the highest durability sync policy for Master and Replica:
SYNC.

Also, for the virtual host, I am using this custom context variable:
use_async_message_store_recovery=true

We use that because we used to have some queues with a large amount of
messages.  I wanted to bring it up in case it matters.  But we've been using
that setting for a while and have never had this issue.


In node2's log file you can see the following messages after I made node2
the master that concern me:

2020-01-24 11:00:00,211 DEBUG [Feeder Output for dixon01]
(c.s.j.r.i.n.Feeder) - dixon02 Feeder output thread for replica dixon01
started at VLSN 7,024 master at 7,023 (DTVLSN:7,022) VLSN delta=-1
socket=(dixon01(1))com.sleepycat.je.rep.utilint.net.SimpleDataChannel@34d03b0c
2020-01-24 11:00:00,243 INFO  [StateChange-dixonbroker:dixon02]
(q.m.h.role_changed) - [Broker] [grp(/dixonbroker)] HA-1010 : Role change
reported: Node : 'dixon02' (spgmqtst2:5011) : from 'WAITING' to 'MASTER'
2020-01-24 11:00:00,243 INFO  [Queue Recoverer : app_test (vh: dixonbroker)]
(q.m.t.recovered) - [Broker] [vh(/dixonbroker)/ms(ProvidedBDBMessageStore)]
TXN-1005 : Recovered 0 messages for queue app_test


See that last log message - Recovered 0 messages for queue app_test.   Isn't
that an issue?  I would expect to see 50 messages recovered.  Perhaps I'm
misunderstanding the log messages.

Thanks
Bryan




--
Sent from: http://qpid.2158936.n2.nabble.com/Apache-Qpid-users-f2158936.html

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org

Re: Broker-J 7.0.2 BDB_HA not replicating messages correctly

Posted by Oleksandr Rudyy <or...@gmail.com>.

Hi Bryan,

I briefly looked through the provided logs.
Based on information reported in the logs the "node1" (which was in MASTER
role) sent the replicated data to both node2 and node3 and ended up the
replication on VLSN  7,024.

2020-01-24 10:59:59,554 DEBUG [ServiceDispatcherPool_dixon01(-1)]
(c.s.j.r.i.n.MasterTransfer) - dixon01 Start Master Transfer for 60000
msec, targeting: [dixon02]
2020-01-24 10:59:59,554 DEBUG [ServiceDispatcherPool_dixon01(-1)]
(c.s.j.r.i.n.MasterTransfer) - dixon01 Master Transfer progress: dixon02,
7,023, phase: 2, endVLSN: 7,023
...
2020-01-24 11:00:00,195 DEBUG [Feeder Output for dixon02]
(c.s.j.r.i.n.Feeder) - dixon01 Feeder output for replica dixon02 shutdown.
feeder VLSN: 7,024 currentTxnEndVLSN: 7,023
2020-01-24 11:00:00,195 DEBUG [Feeder Output for dixon03]
(c.s.j.r.i.n.Feeder) - dixon01 Feeder output for replica dixon03 shutdown.
feeder VLSN: 7,024 currentTxnEndVLSN: 7,023
...
2020-01-24 11:00:00,195 DEBUG [MASTER dixon01(1)]
(c.s.j.r.i.n.FeederManager) - dixon01 Feeder manager exited. CurrentTxnEnd
VLSN: 7,023

Based on collected logs for "node2", it seems that "node2" received the
replicated data up to VLSN 7023:
2020-01-24 11:00:00,196 DEBUG [REPLICA dixon02(2)] (c.s.j.r.i.n.Replica) -
dixon02 ReplicaOutputThread soft shutdown initiated.
2020-01-24 11:00:00,196 DEBUG [ReplicaOutputThread] (c.s.j.r.i.RepImpl) -
dixon02 Thread[ReplicaOutputThread,5,main]exited. Singleton acks sent:49
Grouped acks sent:4 Max pending acks:3
2020-01-24 11:00:00,196 DEBUG [REPLICA dixon02(2)] (c.s.j.r.i.n.Replica) -
dixon02 Exiting inner Replica loop.
2020-01-24 11:00:00,196 DEBUG [REPLICA dixon02(2)] (c.s.j.r.i.n.Replica) -
dixon02 Replica stats - Lag waits: 0 Lag wait time: 0ms.  VLSN waits: 0 Lag
wait time: 0ms.

The node resumed VLSN from 7,024 as expected and started replication feed
from VLSN 7,024 to both node3 and node1:

2020-01-24 11:00:00,196 DEBUG [Feeder Input]
(c.s.j.r.s.FeederReplicaHandshake) - dixon02 Feeder-replica dixon03
handshake completed. Replica dixon03 Versions JE: 7.4.5 Log: 14 Protocol: 6
Stream Log: 13
2020-01-24 11:00:00,196 DEBUG [Feeder Input for dixon03]
(c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon03 syncup
started. Feeder range: first=1 last=7,023 sync=7,023 txnEnd=7,023
...
2020-01-24 11:00:00,211 DEBUG [Feeder Input for dixon03]
(c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon03 start
stream at VLSN: 7,024
2020-01-24 11:00:00,211 DEBUG [Feeder Input for dixon03]
(c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon03 syncup
ended. Elapsed time: 15ms
...
2020-01-24 11:00:00,211 DEBUG [Feeder Input]
(c.s.j.r.s.FeederReplicaHandshake) - dixon02 Feeder-replica dixon01
handshake completed. Replica dixon01 Versions JE: 7.4.5 Log: 14 Protocol: 6
Stream Log: 13
2020-01-24 11:00:00,211 DEBUG [Feeder Input for dixon01]
(c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon01 syncup
started. Feeder range: first=1 last=7,023 sync=7,023 txnEnd=7,023
...
2020-01-24 11:00:00,211 DEBUG [Feeder Input for dixon01]
(c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon01 start
stream at VLSN: 7,024
2020-01-24 11:00:00,211 DEBUG [Feeder Input for dixon01]
(c.s.j.r.s.FeederReplicaSyncup) - dixon02 Feeder-replica dixon01 syncup
ended. Elapsed time: 0ms

The logs above suggest that replication was successful. The node2 should
have received 50 messages published into queue "app_test" on "dixonbroker".

Just out of curiosity (I did not find that in the logs), what are the sync
policies for master/replicas and acknowledgement policy is configured in
your group?
I just would like to verify that replicated data is properly synced into
disk. Perhaps, you might need to use stronger syncing policy....

Kind Regards,
Alex

On Fri, 24 Jan 2020 at 16:30, Bryan Dixon <br...@bldixon.net> wrote:

> We've been using Broker-J 7.0.2 BDB_HA successfully (on Windows servers)
> for
> a couple years now without issue.  I was working on upgrading to Broker-J
> 7.1.6 and performing some tests in our Test Broker-J cluster prior to the
> upgrade and noticed that BDB_HA message replication wasn't working properly
> and we've never had an issue like this before.  By not working properly,
> what is happening is when publishing messages to the master virtual host
> node and then forcing (via the admin console) another virtual host node to
> become the master, the messages I just added are not in the queue in the
> new
> master.  And if I publish messages in the new master, those messages aren't
> in the queue in the next node I make the master.
>
> One difference in our Test environment from our Dev and Production
> environment (which all use 3 nodes/servers) is that a couple months ago we
> replaced 2 nodes (2 VMs - swapped out Windows 2008 Server with Windows 2016
> Server for those 2 VMs).   For that process I just changed the Virtual Host
> Node to add the 2 new VMs and then removed the old nodes (old Windows2008
> VMs).  Everything appeared to be fine after that change but I didn't run
> the
> tests I am running now so perhaps that is what is causing this issue.
>
> To see if this replication issue was for our existing BDB_HA virtual host
> (the VH that existed prior to the Windows VM replacement discussed
> previously) I created a brand new virtual host (dixonbroker) in our Test
> Qpid environment and added a virtual host node on each of our VMs in Test.
> I am still getting the same behavior where messages aren't replicating
> correctly to each node when that node becomes master.  To help with
> troubleshooting I used this new dixonbroker virtual host for the tests I've
> documented below.
>
> I've attached log files from all 3 nodes with some debugging enabled (I can
> add more debugging and run the tests again if it helps).
>
>
> Here are the steps of my test:
> 1) Node1 was master.  I created a new durable queue named app_test.  I have
> a little Java app using qpid-jms-client .31 (I tried .44 and it still
> produced the same bad results).  I use Spring's JmsTemplate to publish 50
> messages.  This was done at 11:00 (to help see the flow in the log files).
> 2) I made Node2 the master and there were no messages in app_test queue
> (see
> log message 2020-01-24 11:00:00,243 INFO  [Queue Recoverer : app_test (vh:
> dixonbroker)] (q.m.t.recovered) - [Broker]
> [vh(/dixonbroker)/ms(ProvidedBDBMessageStore)] TXN-1005 : Recovered 0
> messages for queue app_test in the node2 log file).
> 3) I made Node3 the master.  The 50 messages DID exist in the app_test
> queue.  I then added 50 more messages to app_test queue (so total of 100)
> while Node3 was the master.  This was at 11:02.
> 4) I made Node1 the master.  It only had 50 messages in app_test queue -
> should have had 100.
> 5) I made Node2 the master. It still didn't have any messages in app_test
> queue - should have had 100.
>
>
>
> qpid-node1.log
> <http://qpid.2158936.n2.nabble.com/file/t396334/qpid-node1.log>
> qpid-node2.log
> <http://qpid.2158936.n2.nabble.com/file/t396334/qpid-node2.log>
> qpid-node3.log
> <http://qpid.2158936.n2.nabble.com/file/t396334/qpid-node3.log>
>
>
>
> --
> Sent from:
> http://qpid.2158936.n2.nabble.com/Apache-Qpid-users-f2158936.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>