You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/12/27 02:41:00 UTC
[jira] [Commented] (ARTEMIS-1570) SharedNothingBackup does not
replicate all journal from live
[ https://issues.apache.org/jira/browse/ARTEMIS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16304151#comment-16304151 ]
ASF GitHub Bot commented on ARTEMIS-1570:
-----------------------------------------
GitHub user shoukunhuai opened a pull request:
https://github.com/apache/activemq-artemis/pull/1742
ARTEMIS-1570 Flush appendExecutor before take journal snapshot
When live start replication, it must make sure there is
no pending write in message & bindings journal, or we may
lost journal records during initial replication.
So we need flush append executor after acquire StorageManager's
write lock, before Journal's write lock.
Also we set a 10 seconds timeout when flush, the same as
Journal::flushExecutor. If we failed to flush in 10 seconds,
we abort replication, backup will try again later.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/shoukunhuai/activemq-artemis flush-journal-executor
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/activemq-artemis/pull/1742.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1742
----
commit 3d4c45925f2fe274579b262cedd63103ef29cb4e
Author: shoukun <sh...@...>
Date: 2017-12-27T02:23:33Z
Flush appendExecutor before take journal snapshot
When live start replication, it must make sure there is
no pending write in message & bindings journal, or we may
lost journal records during initial replication.
So we need flush append executor after acquire StorageManager's
write lock, before Journal's write lock.
Also we set a 10 seconds timeout when flush, the same as
Journal::flushExecutor. If we failed to flush in 10 seconds,
we abort replication, backup will try again later.
----
> SharedNothingBackup does not replicate all journal from live
> ------------------------------------------------------------
>
> Key: ARTEMIS-1570
> URL: https://issues.apache.org/jira/browse/ARTEMIS-1570
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker
> Affects Versions: 2.4.0
> Environment: i'm running unit test on windows.
> Reporter: shoukun huai
> Priority: Critical
> Attachments: SharedNothingReplicationTest.java
>
>
> I try to test replication when live is in heavy IO load.
> Attached is my junit test.
> The test use a slow message persister to simulate live is busy on IO, so JournalImpl's `appendExecutor` is busy.
> After start live server, send 5 messages each with a property `delay` of 5000 ms, then start the backup server, wait until it is replicated. Then send more messages without delay.
> Stop live and backup after all message sent, then check message journal.
> Backup will miss 2 message/journal entry.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)