You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@activemq.apache.org by warm-sun <wa...@tutanota.com> on 2019/06/19 22:05:28 UTC

Re: Artemis Disaster Recovery options

I have a very similar scenario to the original post. (Multi data center
replication is required)
I have read all the documentation -- but am unclear about a couple of
points:

1) RedHat AMQ 7 (which is using Artemis under the hood) in their
"configuring broker" documentation recommend NOT using [HA replication]
across data centers.
What is the Artemis position (not on AMQ, but if using Artemis)?
Is this HA replication always: synchronous/blocking or is there an
asynchronous version too?
If the network goes down between master and slave: what happens to the
service the prod master brokers provide (does it block clients)?

2) For high performance scenarios: is it still the recommendation to use
asynchronous DRBD
(https://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device) across
data centers?

3) Using asynchronous replication can lead to small message loss and
imperfect replication of the journal. How resilient is Artemis to these
small corruptions of the journal? Can it start the broker and ignore the
"corrupt"/incomplete replica blocks?

4) Is there any existing documentation on this? This is what I found:
https://www.linbit.com/downloads/tech-guides/DRBD8_ActiveMQ_HA_and_DR_on_RHEL7.pdf
(ActiveMQ but not Artemis)
https://www.rabbitmq.com/pacemaker.html (DRBD as well)




--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html

Re: Artemis Disaster Recovery options

Posted by Justin Bertram <jb...@apache.org>.

> So when the network is down between master and slave (eg slaves network
card fails)... the master will keep ACK-ing messages it receieves in this
case?

In general, yes. Like I said before, the master can be configured to
initiate a quorum vote and will shut itself down if it's isolated.

> If so -- is there some timeout setting that tells the master how long it
should wait before considering the slave as failed?

As soon as the network between the master and slave fails then the master
considers the slave as failed. This is not configurable.

> Aren't [messages, acks, etc] stored as transactions within the broker
journal

Only records which *need* to be stored as part of a transaction are stored
as such. Lots of records stand on their own, not as a part of a
transaction. All that said, I'm not sure it's really relevant.

> So if there is an incomplete/corrupt transaction in the journal wouldn't
the broker just roll it back and ignore it?

Essentially, yes. At least that's what I understand from reading the code
[1].


Justin

[1]
https://github.com/apache/activemq-artemis/blob/master/artemis-journal/src/main/java/org/apache/activemq/artemis/core/journal/impl/JournalImpl.java#L488

On Sun, Jun 23, 2019 at 7:24 PM warm-sun <wa...@tutanota.com> wrote:

> >>> Technically speaking, replication is asynchronous. However, the broker
> will not send a response to the client until it has received a reply from
> the slave that the data has been received.
> ...
> >>> If the network between the master and slave goes down then by default
> >>> the master continues like nothing happened.
>
> So when the network is down between master and slave (eg slaves network
> card
> fails)... the master will keep ACK-ing messages it receieves in this case?
> If so -- is there some timeout setting that tells the master how long it
> should wait before considering the slave as failed?
>
> -----
> >>> For what it's worth, I've found that data integrity and high
> performance
> >>> and generally at odds with each other.
> Agreed. A balance has to be found. We will be replicating with the built-in
> replication on the LAN within the same data centers... and replicating via
> asynchronous DRBD across data centers (best effort -- but not guaranteed)
>
> -----
> >>> I'm not sure the kind of data loss you're describing has ever been
> >>> tested. As far as I know, Artemis expects the data it writes to the
> >>> journal to still be in the journal when it is re-loaded. In general, I
> >>> would expect that any solution designed to ensure data integrity would
> >>> consider message loss or imperfect replication an unacceptable failure.
>
> A small amount of message loss may be acceptable across data center
> replication -- but corrupt journals that a broker cannot use may not be. A
> Couple of questions on this very important point:
> 1)
> Aren't [messages, acks, etc] stored as transactions within the broker
> journal ie:
> begin
>   message
> end
>
> So if there is an incomplete/corrupt transaction in the journal wouldn't
> the
> broker just roll it back and ignore it?
> This would lead to message loss but not journal corruption. Is this how the
> journal / broker work?
>
> 2)
> What if only journal blocks that were full were replicated to the other
> data
> center... would this ensure an uncorrupted journal? ie does the broker /
> journal only store complete messages at the end of journal blocks (as the
> journal block is getting full)? Another way to say this: can the journal
> store 1/2 a message at the end of 1 journal block and the other 1/2 at the
> next journal block?
>
> Case 1) would seem to be the preferable solution (imho)... is a feature
> request for this likely to be implemented?
>
>
>
> --
> Sent from:
> http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
>

Re: Artemis Disaster Recovery options

Posted by Clebert Suconic <cl...@gmail.com>.

Ohhh.. you're talking about rsync... perhaps if we disabled reclaim..
it would always be a new file... as right now we reuse a file
if we disabled reclaim for your case.. it wouldn't need to reuse
files... and it wouldn't have a case of corruption.

On Mon, Jun 24, 2019 at 2:40 PM Clebert Suconic
<cl...@gmail.com> wrote:
>
> I don't understand what you're talking about with small corruptions of
> the journal?
> We write on the backup.. and wait for confirmation.. so clients are
> blocked until the backup has a copy of the data.
>
>
> On Sun, Jun 23, 2019 at 8:24 PM warm-sun <wa...@tutanota.com> wrote:
> >
> > >>> Technically speaking, replication is asynchronous. However, the broker
> > will not send a response to the client until it has received a reply from
> > the slave that the data has been received.
> > ...
> > >>> If the network between the master and slave goes down then by default
> > >>> the master continues like nothing happened.
> >
> > So when the network is down between master and slave (eg slaves network card
> > fails)... the master will keep ACK-ing messages it receieves in this case?
> > If so -- is there some timeout setting that tells the master how long it
> > should wait before considering the slave as failed?
> >
> > -----
> > >>> For what it's worth, I've found that data integrity and high performance
> > >>> and generally at odds with each other.
> > Agreed. A balance has to be found. We will be replicating with the built-in
> > replication on the LAN within the same data centers... and replicating via
> > asynchronous DRBD across data centers (best effort -- but not guaranteed)
> >
> > -----
> > >>> I'm not sure the kind of data loss you're describing has ever been
> > >>> tested. As far as I know, Artemis expects the data it writes to the
> > >>> journal to still be in the journal when it is re-loaded. In general, I
> > >>> would expect that any solution designed to ensure data integrity would
> > >>> consider message loss or imperfect replication an unacceptable failure.
> >
> > A small amount of message loss may be acceptable across data center
> > replication -- but corrupt journals that a broker cannot use may not be. A
> > Couple of questions on this very important point:
> > 1)
> > Aren't [messages, acks, etc] stored as transactions within the broker
> > journal ie:
> > begin
> >   message
> > end
> >
> > So if there is an incomplete/corrupt transaction in the journal wouldn't the
> > broker just roll it back and ignore it?
> > This would lead to message loss but not journal corruption. Is this how the
> > journal / broker work?
> >
> > 2)
> > What if only journal blocks that were full were replicated to the other data
> > center... would this ensure an uncorrupted journal? ie does the broker /
> > journal only store complete messages at the end of journal blocks (as the
> > journal block is getting full)? Another way to say this: can the journal
> > store 1/2 a message at the end of 1 journal block and the other 1/2 at the
> > next journal block?
> >
> > Case 1) would seem to be the preferable solution (imho)... is a feature
> > request for this likely to be implemented?
> >
> >
> >
> > --
> > Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
>
>
>
> --
> Clebert Suconic



-- 
Clebert Suconic

Re: Artemis Disaster Recovery options

Posted by Clebert Suconic <cl...@gmail.com>.

I don't understand what you're talking about with small corruptions of
the journal?
We write on the backup.. and wait for confirmation.. so clients are
blocked until the backup has a copy of the data.


On Sun, Jun 23, 2019 at 8:24 PM warm-sun <wa...@tutanota.com> wrote:
>
> >>> Technically speaking, replication is asynchronous. However, the broker
> will not send a response to the client until it has received a reply from
> the slave that the data has been received.
> ...
> >>> If the network between the master and slave goes down then by default
> >>> the master continues like nothing happened.
>
> So when the network is down between master and slave (eg slaves network card
> fails)... the master will keep ACK-ing messages it receieves in this case?
> If so -- is there some timeout setting that tells the master how long it
> should wait before considering the slave as failed?
>
> -----
> >>> For what it's worth, I've found that data integrity and high performance
> >>> and generally at odds with each other.
> Agreed. A balance has to be found. We will be replicating with the built-in
> replication on the LAN within the same data centers... and replicating via
> asynchronous DRBD across data centers (best effort -- but not guaranteed)
>
> -----
> >>> I'm not sure the kind of data loss you're describing has ever been
> >>> tested. As far as I know, Artemis expects the data it writes to the
> >>> journal to still be in the journal when it is re-loaded. In general, I
> >>> would expect that any solution designed to ensure data integrity would
> >>> consider message loss or imperfect replication an unacceptable failure.
>
> A small amount of message loss may be acceptable across data center
> replication -- but corrupt journals that a broker cannot use may not be. A
> Couple of questions on this very important point:
> 1)
> Aren't [messages, acks, etc] stored as transactions within the broker
> journal ie:
> begin
>   message
> end
>
> So if there is an incomplete/corrupt transaction in the journal wouldn't the
> broker just roll it back and ignore it?
> This would lead to message loss but not journal corruption. Is this how the
> journal / broker work?
>
> 2)
> What if only journal blocks that were full were replicated to the other data
> center... would this ensure an uncorrupted journal? ie does the broker /
> journal only store complete messages at the end of journal blocks (as the
> journal block is getting full)? Another way to say this: can the journal
> store 1/2 a message at the end of 1 journal block and the other 1/2 at the
> next journal block?
>
> Case 1) would seem to be the preferable solution (imho)... is a feature
> request for this likely to be implemented?
>
>
>
> --
> Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html



-- 
Clebert Suconic

Re: Artemis Disaster Recovery options

Posted by warm-sun <wa...@tutanota.com>.

>>> Technically speaking, replication is asynchronous. However, the broker
will not send a response to the client until it has received a reply from
the slave that the data has been received.
...
>>> If the network between the master and slave goes down then by default
>>> the master continues like nothing happened.

So when the network is down between master and slave (eg slaves network card
fails)... the master will keep ACK-ing messages it receieves in this case?
If so -- is there some timeout setting that tells the master how long it
should wait before considering the slave as failed?

-----
>>> For what it's worth, I've found that data integrity and high performance
>>> and generally at odds with each other.
Agreed. A balance has to be found. We will be replicating with the built-in
replication on the LAN within the same data centers... and replicating via
asynchronous DRBD across data centers (best effort -- but not guaranteed)

-----
>>> I'm not sure the kind of data loss you're describing has ever been
>>> tested. As far as I know, Artemis expects the data it writes to the
>>> journal to still be in the journal when it is re-loaded. In general, I
>>> would expect that any solution designed to ensure data integrity would
>>> consider message loss or imperfect replication an unacceptable failure.

A small amount of message loss may be acceptable across data center
replication -- but corrupt journals that a broker cannot use may not be. A
Couple of questions on this very important point:
1)
Aren't [messages, acks, etc] stored as transactions within the broker
journal ie:
begin
  message
end

So if there is an incomplete/corrupt transaction in the journal wouldn't the
broker just roll it back and ignore it?
This would lead to message loss but not journal corruption. Is this how the
journal / broker work?

2)
What if only journal blocks that were full were replicated to the other data
center... would this ensure an uncorrupted journal? ie does the broker /
journal only store complete messages at the end of journal blocks (as the
journal block is getting full)? Another way to say this: can the journal
store 1/2 a message at the end of 1 journal block and the other 1/2 at the
next journal block?

Case 1) would seem to be the preferable solution (imho)... is a feature
request for this likely to be implemented?



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html

Re: Artemis Disaster Recovery options

Posted by Justin Bertram <jb...@apache.org>.

> RedHat AMQ 7 (which is using Artemis under the hood) in their
"configuring broker" documentation recommend NOT using [HA replication]
across data centers.What is the Artemis position (not on AMQ, but if using
Artemis)?

Replication was designed to be used across a low-latency, high-performance
network connection which generally is not the kind of connection you have
between data centers. It doesn't matter whether you're using Artemis
standalone or embedded into something else.

> Is this HA replication always: synchronous/blocking or is there an
asynchronous version too?

Technically speaking, replication is asynchronous. However, the broker will
not send a response to the client until it has received a reply from the
slave that the data has been received. Therefore, once the client receives
the response from the broker it is assured that both the master and the
slave have the data.

> If the network goes down between master and slave: what happens to the
service the prod master brokers provide (does it block clients)?

If the network between the master and slave goes down then by default the
master continues like nothing happened. The master can be configured to
initiate a quorum vote in this situation to determine whether or not it
should remain alive or shut itself down (e.g. in the case that it's
isolated).

> For high performance scenarios: is it still the recommendation to use
asynchronous DRBD (
https://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device) across
data centers?

If you're referring to what I said in the original response on this thread
then I think "recommendation" is too strong of a word. I was simply
offering an idea of what I believed might work.

For what it's worth, I've found that data integrity and high performance
and generally at odds with each other.

> Using asynchronous replication can lead to small message loss and
imperfect replication of the journal. How resilient is Artemis to these
small corruptions of the journal? Can it start the broker and ignore the
"corrupt"/incomplete replica blocks?

I'm not sure the kind of data loss you're describing has ever been tested.
As far as I know, Artemis expects the data it writes to the journal to
still be in the journal when it is re-loaded.

In general, I would expect that any solution designed to ensure data
integrity would consider message loss or imperfect replication an
unacceptable failure.

> Is there any existing documentation on this?

I'm not aware of any.

Justin

On Wed, Jun 19, 2019 at 5:05 PM warm-sun <wa...@tutanota.com> wrote:

> I have a very similar scenario to the original post. (Multi data center
> replication is required)
> I have read all the documentation -- but am unclear about a couple of
> points:
>
> 1) RedHat AMQ 7 (which is using Artemis under the hood) in their
> "configuring broker" documentation recommend NOT using [HA replication]
> across data centers.
> What is the Artemis position (not on AMQ, but if using Artemis)?
> Is this HA replication always: synchronous/blocking or is there an
> asynchronous version too?
> If the network goes down between master and slave: what happens to the
> service the prod master brokers provide (does it block clients)?
>
> 2) For high performance scenarios: is it still the recommendation to use
> asynchronous DRBD
> (https://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device) across
> data centers?
>
> 3) Using asynchronous replication can lead to small message loss and
> imperfect replication of the journal. How resilient is Artemis to these
> small corruptions of the journal? Can it start the broker and ignore the
> "corrupt"/incomplete replica blocks?
>
> 4) Is there any existing documentation on this? This is what I found:
>
> https://www.linbit.com/downloads/tech-guides/DRBD8_ActiveMQ_HA_and_DR_on_RHEL7.pdf
> (ActiveMQ but not Artemis)
> https://www.rabbitmq.com/pacemaker.html (DRBD as well)
>
>
>
>
> --
> Sent from:
> http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
>