You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Dan Smith (JIRA)" <ji...@apache.org> on 2016/02/10 19:51:18 UTC

[jira] [Assigned] (GEODE-657) Two backups occurring simultaneously can cause the BackupLock to be obtained but not released

     [ https://issues.apache.org/jira/browse/GEODE-657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dan Smith reassigned GEODE-657:
-------------------------------

    Assignee: Dan Smith

> Two backups occurring simultaneously can cause the BackupLock to be obtained but not released
> ---------------------------------------------------------------------------------------------
>
>                 Key: GEODE-657
>                 URL: https://issues.apache.org/jira/browse/GEODE-657
>             Project: Geode
>          Issue Type: Bug
>          Components: persistence
>            Reporter: Barry Oglesby
>            Assignee: Dan Smith
>
> Backup is done in three phases, namely {{FlushToDiskRequest}}, {{PrepareBackupRequest}} and {{FinishBackupRequest}}.
> Here is some debugging in the locator showing the backup phases:
> {noformat}
> [warning 2015/12/01 14:37:38.783 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers STARTED
> [warning 2015/12/01 14:37:38.794 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send FlushToDiskRequest
> [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to FlushToDiskRequest
> [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send PrepareBackupRequest
> [warning 2015/12/01 14:37:38.984 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to PrepareBackupRequest
> [warning 2015/12/01 14:37:38.985 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send FinishBackupRequest
> [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to FinishBackupRequest
> [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers COMPLETED
> {noformat}
> Here is some debugging in one of the members. The others are similar.
> {noformat}
> [warning 2015/12/01 14:37:38.795 PST  <Pooled Message Processor 2> tid=0x66] FlushToDiskRequest createResponse received FlushToDiskRequest
> [warning 2015/12/01 14:37:38.974 PST  <Pooled Message Processor 2> tid=0x66] FlushToDiskRequest createResponse sending FlushToDiskResponse
> [warning 2015/12/01 14:37:38.978 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupRequest createResponse received PrepareBackupRequest
> [warning 2015/12/01 14:37:38.978 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupResponse createResponse: About to start backup
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupResponse createResponse: About to prepare backup
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] BackupLock lockForBackup invoked
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupRequest createResponse sending PrepareBackupResponse
> [warning 2015/12/01 14:37:38.986 PST  <Pooled Message Processor 2> tid=0x66] FinishBackupRequest createResponse received FinishBackupRequest
> [warning 2015/12/01 14:37:39.970 PST  <Pooled Message Processor 2> tid=0x66] BackupLock unlockForBackup invoked
> [warning 2015/12/01 14:37:49.885 PST  <Pooled Message Processor 2> tid=0x66] FinishBackupRequest createResponse sending FinishBackupResponse
> {noformat}
> Processing the {{PrepareBackupRequest}} is where the {{BackupLock}} is taken (in {{BackupLock lockForBackup}}). Processing the {{FinishBackupRequest}} is where it is released (in {{BackupLock unlockForBackup}}).
> All members operate independently. If one completes its backup before the others, the issue can happen. That member is available to start a new backup, but the others aren't.
> In this case, the member successfully processes the {{PrepareBackupRequest}} and takes the {{BackupLock}}.
> The other members do not successfully process the {{PrepareBackupRequest}} and do not take the {{BackupLock}}:
> {noformat}
> [warning 2015/12/01 14:46:02.447 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupRequest createResponse received PrepareBackupRequest
> [warning 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupResponse createResponse: About to start backup
> [error 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] Error processing request class com.gemstone.gemfire.admin.internal.PrepareBackupRequest.
> java.io.IOException: Backup already in progress
> 	at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.startBackup(GemFireCacheImpl.java:4315)
> 	at com.gemstone.gemfire.admin.internal.PrepareBackupRequest.createResponse(PrepareBackupRequest.java:87)
> 	at com.gemstone.gemfire.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:28)
> 	at com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386)
> 	at com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692)
> 	at com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:963)
> 	at java.lang.Thread.run(Thread.java:745)
> [warning 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupResponse createResponse: Returning exception java.io.IOException: Backup already in progress
> {noformat}
> Once this happens, the members are out of sync. One has taken the {{BackupLock}}; the others have not. The member that has taken the {{BackupLock}} will never release it.
> One solution would be to prevent any backup from starting if one is in progress (in the locator). Another would be to force the member to release the {{BackupLock}} if other members fail to get it. I think this would require a new message, though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)