You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2016/02/12 02:38:18 UTC

[jira] [Commented] (GEODE-657) Two backups occurring simultaneously can cause the BackupLock to be obtained but not released

    [ https://issues.apache.org/jira/browse/GEODE-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143864#comment-15143864 ] 

ASF subversion and git services commented on GEODE-657:
-------------------------------------------------------

Commit 3adb0b8e17436af555e1acbe2bf729bf72365305 in incubator-geode's branch refs/heads/develop from [~upthewaterspout]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=3adb0b8 ]

GEODE-657: Add an abort flag to the FinishBackupMessage

If the prepare phase of a backup fails, it can leave the backup in a
prepared state on several nodes. The FinishBackupMessage should always
be sent to cleanup the backup. Adding an abort flag to
FinishBackupMessage and always sending that message regardless of
failures during the prepare.


> Two backups occurring simultaneously can cause the BackupLock to be obtained but not released
> ---------------------------------------------------------------------------------------------
>
>                 Key: GEODE-657
>                 URL: https://issues.apache.org/jira/browse/GEODE-657
>             Project: Geode
>          Issue Type: Bug
>          Components: persistence
>            Reporter: Barry Oglesby
>            Assignee: Dan Smith
>
> Backup is done in three phases, namely {{FlushToDiskRequest}}, {{PrepareBackupRequest}} and {{FinishBackupRequest}}.
> Here is some debugging in the locator showing the backup phases:
> {noformat}
> [warning 2015/12/01 14:37:38.783 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers STARTED
> [warning 2015/12/01 14:37:38.794 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send FlushToDiskRequest
> [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to FlushToDiskRequest
> [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send PrepareBackupRequest
> [warning 2015/12/01 14:37:38.984 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to PrepareBackupRequest
> [warning 2015/12/01 14:37:38.985 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send FinishBackupRequest
> [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to FinishBackupRequest
> [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers COMPLETED
> {noformat}
> Here is some debugging in one of the members. The others are similar.
> {noformat}
> [warning 2015/12/01 14:37:38.795 PST  <Pooled Message Processor 2> tid=0x66] FlushToDiskRequest createResponse received FlushToDiskRequest
> [warning 2015/12/01 14:37:38.974 PST  <Pooled Message Processor 2> tid=0x66] FlushToDiskRequest createResponse sending FlushToDiskResponse
> [warning 2015/12/01 14:37:38.978 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupRequest createResponse received PrepareBackupRequest
> [warning 2015/12/01 14:37:38.978 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupResponse createResponse: About to start backup
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupResponse createResponse: About to prepare backup
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] BackupLock lockForBackup invoked
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupRequest createResponse sending PrepareBackupResponse
> [warning 2015/12/01 14:37:38.986 PST  <Pooled Message Processor 2> tid=0x66] FinishBackupRequest createResponse received FinishBackupRequest
> [warning 2015/12/01 14:37:39.970 PST  <Pooled Message Processor 2> tid=0x66] BackupLock unlockForBackup invoked
> [warning 2015/12/01 14:37:49.885 PST  <Pooled Message Processor 2> tid=0x66] FinishBackupRequest createResponse sending FinishBackupResponse
> {noformat}
> Processing the {{PrepareBackupRequest}} is where the {{BackupLock}} is taken (in {{BackupLock lockForBackup}}). Processing the {{FinishBackupRequest}} is where it is released (in {{BackupLock unlockForBackup}}).
> All members operate independently. If one completes its backup before the others, the issue can happen. That member is available to start a new backup, but the others aren't.
> In this case, the member successfully processes the {{PrepareBackupRequest}} and takes the {{BackupLock}}.
> The other members do not successfully process the {{PrepareBackupRequest}} and do not take the {{BackupLock}}:
> {noformat}
> [warning 2015/12/01 14:46:02.447 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupRequest createResponse received PrepareBackupRequest
> [warning 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupResponse createResponse: About to start backup
> [error 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] Error processing request class com.gemstone.gemfire.admin.internal.PrepareBackupRequest.
> java.io.IOException: Backup already in progress
> 	at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.startBackup(GemFireCacheImpl.java:4315)
> 	at com.gemstone.gemfire.admin.internal.PrepareBackupRequest.createResponse(PrepareBackupRequest.java:87)
> 	at com.gemstone.gemfire.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:28)
> 	at com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386)
> 	at com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692)
> 	at com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:963)
> 	at java.lang.Thread.run(Thread.java:745)
> [warning 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupResponse createResponse: Returning exception java.io.IOException: Backup already in progress
> {noformat}
> Once this happens, the members are out of sync. One has taken the {{BackupLock}}; the others have not. The member that has taken the {{BackupLock}} will never release it.
> One solution would be to prevent any backup from starting if one is in progress (in the locator). Another would be to force the member to release the {{BackupLock}} if other members fail to get it. I think this would require a new message, though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)