You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@activemq.apache.org by Neha Sareen <ne...@oracle.com> on 2018/07/17 21:01:49 UTC

Potential message loss seen with HA topology in Artemis 2.6.2 on failback

Hi,

 

We are setting up a cluster of 6 brokers using Artemis 2.6.2.

 

The cluster has 3 groups.

- Each group has one master, and one slave broker pair.

- The HA uses replication.

- Each master broker configuration has the flag 'check-for-live-server' set to true.

- Each slave broker configuration has the flag 'allow-failback' set to true.

- We use static connectors for allowing cluster topology discovery.

- Each broker's static connector list includes the connectors to the other 5 servers in the cluster.

- Each broker declares its acceptor.

- Each broker exports its own connector information via the  'connector-ref' configuration element.

- The acceptor and the connector URLs for each broker are identical with respect to the host and port information

 

We have a standalone test application that creates producers and 

consumers to write messages and receive messages respectively using a transacted JMS session.

 

> We are trying to execute an automatic failover test case followed by failback as follows:

TestCase -1

Step1: Master & Standby Alive

Step2: Producer Send Message , say 9 messages

Step3: Kill Master

Step4: Producer Send Message , say another 9 messages

Step5: Kill Standby

Step6: Start Master 

Step7: Start Standby. 

What we see is that it sync with Master discarding its internal state , and we are able to consume only 9 messages, leading to a loss of 9 messages

 

 

Test Case - 2

Step1: Master & Standby Alive

Step2: Producer Send Message 

Step3: Kill Master

Step4: Producer Send Message 

Step5: Kill Standby

Step6: Start Standby ( it waits for Master )

Step7: Start Master (Question does it wait for slave ??)

Step8: Consume Message

 

Can someone provide any insights here regarding the potential message loss?

Also are there alternatives to a different topology we may use here to get around this issue?

 

Thanks

Neha

RE: Potential message loss seen with HA topology in Artemis 2.6.2 on failback

Posted by Udayan Sahu <ud...@oracle.com>.

I also thought that bringing slave first before master would solve the problem, but it didn’t ...

Slave waits with a message 
"AMQ221109: Apache ActiveMQ Artemis Backup Server version 2.6.2 [null] started, waiting live to fail before it gets active"

as soon as master is started it says

AMQ221024: Backup server ActiveMQServerImpl::serverUUID=e0c8c135-8834-11e8-a326-0a0027000014 is synchronized with live-server.
AMQ221031: backup announced


As we want fail-back functionality, we have used following in slave

                             <max-saved-replicated-journals-size>0</max-saved-replicated-journals-size>

I have strong feeling that this may be messing it up, please confirm

Thanks

--- Udayan Sahu

-----Original Message-----
From: Clebert Suconic [mailto:clebert.suconic@gmail.com] 
Sent: Wednesday, July 18, 2018 6:28 AM
To: Udayan Sahu <ud...@oracle.com>
Cc: users@activemq.apache.org
Subject: Re: Potential message loss seen with HA topology in Artemis 2.6.2 on failback

You could have another passive backup that would assume when M1 is killed and it could become the backup.

But if the node is alone and you killed it. you need to start it first.

On Wed, Jul 18, 2018 at 9:27 AM, Clebert Suconic <cl...@gmail.com> wrote:
> At the moment you have to start the latest server to be alive first.
>
> I know there's a task to compare age of the journals before 
> synchronizing it.. but it's not done yet.
>
> On Tue, Jul 17, 2018 at 6:48 PM, Udayan Sahu <ud...@oracle.com> wrote:
>> Its simple HA subsystem, with a simple ask in replicated state 
>> system, it should start from last committed state…
>>
>>
>>
>> Step1: Master (M1) & Standby (S1) Alive
>>
>> Step2: Producer Send 10 Message à M1 receives it and replicates it to 
>> S1
>>
>> Step3: Kill Master ( M1) à It makes S1 as New Master
>>
>> Step4: Producer Send 10 Message à S1 receives messages and is not 
>> replicated as M1 is Down
>>
>> Step5: Kill Standby ( S1 )
>>
>> Step6: Start Master ( M1 )
>>
>> Step7: Start Standby (S1) ( it sync with Master (M1) discarding its 
>> internal state )
>>
>> This is wrong. M1 should sync with S1 since S1 represents the current 
>> state of the queue.
>>
>>
>>
>> How can we protect Step 4 Messages being lost… We are using 
>> transacted session and calling commit to make sure messages are persisted..
>>
>>
>>
>> --- Udayan Sahu
>>
>>
>>
>>
>>
>> From: Clebert Suconic [mailto:clebert.suconic@gmail.com]
>> Sent: Tuesday, July 17, 2018 2:50 PM
>> To: users@activemq.apache.org
>> Cc: Udayan Sahu <ud...@oracle.com>
>> Subject: Re: Potential message loss seen with HA topology in Artemis 
>> 2.6.2 on failback
>>
>>
>>
>> Ha is about preserving the journals between failures.
>>
>>
>>
>> When you read and send messages you may still have an failure during 
>> the reading.  I would need to understand what you do in case of a 
>> failure with your consumer and producer.
>>
>>
>>
>> Retries on send and duplicate detection are key for your case.
>>
>>
>>
>> You could also play with XA and a transaction manager.
>>
>>
>>
>> On Tue, Jul 17, 2018 at 5:01 PM Neha Sareen <ne...@oracle.com> wrote:
>>
>> Hi,
>>
>>
>>
>> We are setting up a cluster of 6 brokers using Artemis 2.6.2.
>>
>>
>>
>> The cluster has 3 groups.
>>
>> - Each group has one master, and one slave broker pair.
>>
>> - The HA uses replication.
>>
>> - Each master broker configuration has the flag 
>> 'check-for-live-server' set to true.
>>
>> - Each slave broker configuration has the flag 'allow-failback' set to true.
>>
>> - We use static connectors for allowing cluster topology discovery.
>>
>> - Each broker's static connector list includes the connectors to the 
>> other 5 servers in the cluster.
>>
>> - Each broker declares its acceptor.
>>
>> - Each broker exports its own connector information via the  'connector-ref'
>> configuration element.
>>
>> - The acceptor and the connector URLs for each broker are identical 
>> with respect to the host and port information
>>
>>
>>
>> We have a standalone test application that creates producers and
>>
>> consumers to write messages and receive messages respectively using a 
>> transacted JMS session.
>>
>>
>>
>>> We are trying to execute an automatic failover test case followed by 
>>> failback as follows:
>>
>> TestCase -1
>>
>> Step1: Master & Standby Alive
>>
>> Step2: Producer Send Message , say 9 messages
>>
>> Step3: Kill Master
>>
>> Step4: Producer Send Message , say another 9 messages
>>
>> Step5: Kill Standby
>>
>> Step6: Start Master
>>
>> Step7: Start Standby.
>>
>> What we see is that it sync with Master discarding its internal state 
>> , and we are able to consume only 9 messages, leading to a loss of 9 
>> messages
>>
>>
>>
>>
>>
>> Test Case - 2
>>
>> Step1: Master & Standby Alive
>>
>> Step2: Producer Send Message
>>
>> Step3: Kill Master
>>
>> Step4: Producer Send Message
>>
>> Step5: Kill Standby
>>
>> Step6: Start Standby ( it waits for Master )
>>
>> Step7: Start Master (Question does it wait for slave ??)
>>
>> Step8: Consume Message
>>
>>
>>
>> Can someone provide any insights here regarding the potential message loss?
>>
>> Also are there alternatives to a different topology we may use here 
>> to get around this issue?
>>
>>
>>
>> Thanks
>>
>> Neha
>>
>>
>>
>> --
>>
>> Clebert Suconic
>
>
>
> --
> Clebert Suconic



--
Clebert Suconic

Re: Potential message loss seen with HA topology in Artemis 2.6.2 on failback

Posted by Clebert Suconic <cl...@gmail.com>.

You could have another passive backup that would assume when M1 is
killed and it could become the backup.

But if the node is alone and you killed it. you need to start it first.

On Wed, Jul 18, 2018 at 9:27 AM, Clebert Suconic
<cl...@gmail.com> wrote:
> At the moment you have to start the latest server to be alive first.
>
> I know there's a task to compare age of the journals before
> synchronizing it.. but it's not done yet.
>
> On Tue, Jul 17, 2018 at 6:48 PM, Udayan Sahu <ud...@oracle.com> wrote:
>> Its simple HA subsystem, with a simple ask in replicated state system, it
>> should start from last committed state…
>>
>>
>>
>> Step1: Master (M1) & Standby (S1) Alive
>>
>> Step2: Producer Send 10 Message à M1 receives it and replicates it to S1
>>
>> Step3: Kill Master ( M1) à It makes S1 as New Master
>>
>> Step4: Producer Send 10 Message à S1 receives messages and is not replicated
>> as M1 is Down
>>
>> Step5: Kill Standby ( S1 )
>>
>> Step6: Start Master ( M1 )
>>
>> Step7: Start Standby (S1) ( it sync with Master (M1) discarding its internal
>> state )
>>
>> This is wrong. M1 should sync with S1 since S1 represents the current state
>> of the queue.
>>
>>
>>
>> How can we protect Step 4 Messages being lost… We are using transacted
>> session and calling commit to make sure messages are persisted..
>>
>>
>>
>> --- Udayan Sahu
>>
>>
>>
>>
>>
>> From: Clebert Suconic [mailto:clebert.suconic@gmail.com]
>> Sent: Tuesday, July 17, 2018 2:50 PM
>> To: users@activemq.apache.org
>> Cc: Udayan Sahu <ud...@oracle.com>
>> Subject: Re: Potential message loss seen with HA topology in Artemis 2.6.2
>> on failback
>>
>>
>>
>> Ha is about preserving the journals between failures.
>>
>>
>>
>> When you read and send messages you may still have an failure during the
>> reading.  I would need to understand what you do in case of a failure with
>> your consumer and producer.
>>
>>
>>
>> Retries on send and duplicate detection are key for your case.
>>
>>
>>
>> You could also play with XA and a transaction manager.
>>
>>
>>
>> On Tue, Jul 17, 2018 at 5:01 PM Neha Sareen <ne...@oracle.com> wrote:
>>
>> Hi,
>>
>>
>>
>> We are setting up a cluster of 6 brokers using Artemis 2.6.2.
>>
>>
>>
>> The cluster has 3 groups.
>>
>> - Each group has one master, and one slave broker pair.
>>
>> - The HA uses replication.
>>
>> - Each master broker configuration has the flag 'check-for-live-server' set
>> to true.
>>
>> - Each slave broker configuration has the flag 'allow-failback' set to true.
>>
>> - We use static connectors for allowing cluster topology discovery.
>>
>> - Each broker's static connector list includes the connectors to the other 5
>> servers in the cluster.
>>
>> - Each broker declares its acceptor.
>>
>> - Each broker exports its own connector information via the  'connector-ref'
>> configuration element.
>>
>> - The acceptor and the connector URLs for each broker are identical with
>> respect to the host and port information
>>
>>
>>
>> We have a standalone test application that creates producers and
>>
>> consumers to write messages and receive messages respectively using a
>> transacted JMS session.
>>
>>
>>
>>> We are trying to execute an automatic failover test case followed by
>>> failback as follows:
>>
>> TestCase -1
>>
>> Step1: Master & Standby Alive
>>
>> Step2: Producer Send Message , say 9 messages
>>
>> Step3: Kill Master
>>
>> Step4: Producer Send Message , say another 9 messages
>>
>> Step5: Kill Standby
>>
>> Step6: Start Master
>>
>> Step7: Start Standby.
>>
>> What we see is that it sync with Master discarding its internal state , and
>> we are able to consume only 9 messages, leading to a loss of 9 messages
>>
>>
>>
>>
>>
>> Test Case - 2
>>
>> Step1: Master & Standby Alive
>>
>> Step2: Producer Send Message
>>
>> Step3: Kill Master
>>
>> Step4: Producer Send Message
>>
>> Step5: Kill Standby
>>
>> Step6: Start Standby ( it waits for Master )
>>
>> Step7: Start Master (Question does it wait for slave ??)
>>
>> Step8: Consume Message
>>
>>
>>
>> Can someone provide any insights here regarding the potential message loss?
>>
>> Also are there alternatives to a different topology we may use here to get
>> around this issue?
>>
>>
>>
>> Thanks
>>
>> Neha
>>
>>
>>
>> --
>>
>> Clebert Suconic
>
>
>
> --
> Clebert Suconic



-- 
Clebert Suconic

Re: Potential message loss seen with HA topology in Artemis 2.6.2 on failback

Posted by Clebert Suconic <cl...@gmail.com>.

At the moment you have to start the latest server to be alive first.

I know there's a task to compare age of the journals before
synchronizing it.. but it's not done yet.

On Tue, Jul 17, 2018 at 6:48 PM, Udayan Sahu <ud...@oracle.com> wrote:
> Its simple HA subsystem, with a simple ask in replicated state system, it
> should start from last committed state…
>
>
>
> Step1: Master (M1) & Standby (S1) Alive
>
> Step2: Producer Send 10 Message à M1 receives it and replicates it to S1
>
> Step3: Kill Master ( M1) à It makes S1 as New Master
>
> Step4: Producer Send 10 Message à S1 receives messages and is not replicated
> as M1 is Down
>
> Step5: Kill Standby ( S1 )
>
> Step6: Start Master ( M1 )
>
> Step7: Start Standby (S1) ( it sync with Master (M1) discarding its internal
> state )
>
> This is wrong. M1 should sync with S1 since S1 represents the current state
> of the queue.
>
>
>
> How can we protect Step 4 Messages being lost… We are using transacted
> session and calling commit to make sure messages are persisted..
>
>
>
> --- Udayan Sahu
>
>
>
>
>
> From: Clebert Suconic [mailto:clebert.suconic@gmail.com]
> Sent: Tuesday, July 17, 2018 2:50 PM
> To: users@activemq.apache.org
> Cc: Udayan Sahu <ud...@oracle.com>
> Subject: Re: Potential message loss seen with HA topology in Artemis 2.6.2
> on failback
>
>
>
> Ha is about preserving the journals between failures.
>
>
>
> When you read and send messages you may still have an failure during the
> reading.  I would need to understand what you do in case of a failure with
> your consumer and producer.
>
>
>
> Retries on send and duplicate detection are key for your case.
>
>
>
> You could also play with XA and a transaction manager.
>
>
>
> On Tue, Jul 17, 2018 at 5:01 PM Neha Sareen <ne...@oracle.com> wrote:
>
> Hi,
>
>
>
> We are setting up a cluster of 6 brokers using Artemis 2.6.2.
>
>
>
> The cluster has 3 groups.
>
> - Each group has one master, and one slave broker pair.
>
> - The HA uses replication.
>
> - Each master broker configuration has the flag 'check-for-live-server' set
> to true.
>
> - Each slave broker configuration has the flag 'allow-failback' set to true.
>
> - We use static connectors for allowing cluster topology discovery.
>
> - Each broker's static connector list includes the connectors to the other 5
> servers in the cluster.
>
> - Each broker declares its acceptor.
>
> - Each broker exports its own connector information via the  'connector-ref'
> configuration element.
>
> - The acceptor and the connector URLs for each broker are identical with
> respect to the host and port information
>
>
>
> We have a standalone test application that creates producers and
>
> consumers to write messages and receive messages respectively using a
> transacted JMS session.
>
>
>
>> We are trying to execute an automatic failover test case followed by
>> failback as follows:
>
> TestCase -1
>
> Step1: Master & Standby Alive
>
> Step2: Producer Send Message , say 9 messages
>
> Step3: Kill Master
>
> Step4: Producer Send Message , say another 9 messages
>
> Step5: Kill Standby
>
> Step6: Start Master
>
> Step7: Start Standby.
>
> What we see is that it sync with Master discarding its internal state , and
> we are able to consume only 9 messages, leading to a loss of 9 messages
>
>
>
>
>
> Test Case - 2
>
> Step1: Master & Standby Alive
>
> Step2: Producer Send Message
>
> Step3: Kill Master
>
> Step4: Producer Send Message
>
> Step5: Kill Standby
>
> Step6: Start Standby ( it waits for Master )
>
> Step7: Start Master (Question does it wait for slave ??)
>
> Step8: Consume Message
>
>
>
> Can someone provide any insights here regarding the potential message loss?
>
> Also are there alternatives to a different topology we may use here to get
> around this issue?
>
>
>
> Thanks
>
> Neha
>
>
>
> --
>
> Clebert Suconic



-- 
Clebert Suconic

RE: Potential message loss seen with HA topology in Artemis 2.6.2 on failback

Posted by Udayan Sahu <ud...@oracle.com>.

Its simple HA subsystem, with a simple ask in replicated state system, it should start from last committed state…

 

Step1: Master (M1) & Standby (S1) Alive

Step2: Producer Send 10 Message à M1 receives it and replicates it to S1

Step3: Kill Master ( M1) à It makes S1 as New Master 

Step4: Producer Send 10 Message à S1 receives messages and is not replicated as M1 is Down

Step5: Kill Standby ( S1 )

Step6: Start Master ( M1 ) 

Step7: Start Standby (S1) ( it sync with Master (M1) discarding its internal state )

This is wrong. M1 should sync with S1 since S1 represents the current state of the queue.

 

How can we protect Step 4 Messages being lost… We are using transacted session and calling commit to make sure messages are persisted..

 

--- Udayan Sahu

 

 

From: Clebert Suconic [mailto:clebert.suconic@gmail.com] 
Sent: Tuesday, July 17, 2018 2:50 PM
To: users@activemq.apache.org
Cc: Udayan Sahu <ud...@oracle.com>
Subject: Re: Potential message loss seen with HA topology in Artemis 2.6.2 on failback

 

Ha is about preserving the journals between failures. 

 

When you read and send messages you may still have an failure during the reading.  I would need to understand what you do in case of a failure with your consumer and producer.  

 

Retries on send and duplicate detection are key for your case.  

 

You could also play with XA and a transaction manager.  

 

On Tue, Jul 17, 2018 at 5:01 PM Neha Sareen <HYPERLINK "mailto:neha.sareen@oracle.com"neha.sareen@oracle.com> wrote:

Hi,



We are setting up a cluster of 6 brokers using Artemis 2.6.2.



The cluster has 3 groups.

- Each group has one master, and one slave broker pair.

- The HA uses replication.

- Each master broker configuration has the flag 'check-for-live-server' set to true.

- Each slave broker configuration has the flag 'allow-failback' set to true.

- We use static connectors for allowing cluster topology discovery.

- Each broker's static connector list includes the connectors to the other 5 servers in the cluster.

- Each broker declares its acceptor.

- Each broker exports its own connector information via the  'connector-ref' configuration element.

- The acceptor and the connector URLs for each broker are identical with respect to the host and port information



We have a standalone test application that creates producers and 

consumers to write messages and receive messages respectively using a transacted JMS session.



> We are trying to execute an automatic failover test case followed by failback as follows:

TestCase -1

Step1: Master & Standby Alive

Step2: Producer Send Message , say 9 messages

Step3: Kill Master

Step4: Producer Send Message , say another 9 messages

Step5: Kill Standby

Step6: Start Master 

Step7: Start Standby. 

What we see is that it sync with Master discarding its internal state , and we are able to consume only 9 messages, leading to a loss of 9 messages





Test Case - 2

Step1: Master & Standby Alive

Step2: Producer Send Message 

Step3: Kill Master

Step4: Producer Send Message 

Step5: Kill Standby

Step6: Start Standby ( it waits for Master )

Step7: Start Master (Question does it wait for slave ??)

Step8: Consume Message



Can someone provide any insights here regarding the potential message loss?

Also are there alternatives to a different topology we may use here to get around this issue?



Thanks

Neha





-- 

Clebert Suconic

Re: Potential message loss seen with HA topology in Artemis 2.6.2 on failback

Posted by Clebert Suconic <cl...@gmail.com>.

Ha is about preserving the journals between failures.

When you read and send messages you may still have an failure during the
reading.  I would need to understand what you do in case of a failure with
your consumer and producer.

Retries on send and duplicate detection are key for your case.

You could also play with XA and a transaction manager.

On Tue, Jul 17, 2018 at 5:01 PM Neha Sareen <ne...@oracle.com> wrote:

> Hi,
>
>
>
> We are setting up a cluster of 6 brokers using Artemis 2.6.2.
>
>
>
> The cluster has 3 groups.
>
> - Each group has one master, and one slave broker pair.
>
> - The HA uses replication.
>
> - Each master broker configuration has the flag 'check-for-live-server'
> set to true.
>
> - Each slave broker configuration has the flag 'allow-failback' set to
> true.
>
> - We use static connectors for allowing cluster topology discovery.
>
> - Each broker's static connector list includes the connectors to the other
> 5 servers in the cluster.
>
> - Each broker declares its acceptor.
>
> - Each broker exports its own connector information via the
> 'connector-ref' configuration element.
>
> - The acceptor and the connector URLs for each broker are identical with
> respect to the host and port information
>
>
>
> We have a standalone test application that creates producers and
>
> consumers to write messages and receive messages respectively using a
> transacted JMS session.
>
>
>
> > We are trying to execute an automatic failover test case followed by
> failback as follows:
>
> TestCase -1
>
> Step1: Master & Standby Alive
>
> Step2: Producer Send Message , say 9 messages
>
> Step3: Kill Master
>
> Step4: Producer Send Message , say another 9 messages
>
> Step5: Kill Standby
>
> Step6: Start Master
>
> Step7: Start Standby.
>
> What we see is that it sync with Master discarding its internal state ,
> and we are able to consume only 9 messages, leading to a loss of 9 messages
>
>
>
>
>
> Test Case - 2
>
> Step1: Master & Standby Alive
>
> Step2: Producer Send Message
>
> Step3: Kill Master
>
> Step4: Producer Send Message
>
> Step5: Kill Standby
>
> Step6: Start Standby ( it waits for Master )
>
> Step7: Start Master (Question does it wait for slave ??)
>
> Step8: Consume Message
>
>
>
> Can someone provide any insights here regarding the potential message loss?
>
> Also are there alternatives to a different topology we may use here to get
> around this issue?
>
>
>
> Thanks
>
> Neha
>
>
>
>
> --
Clebert Suconic