You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Barry Oglesby (JIRA)" <ji...@apache.org> on 2015/12/11 01:56:11 UTC
[jira] [Created] (GEODE-657) Two backups occurring simultaneously can cause the BackupLock to be obtained but not released

Barry Oglesby created GEODE-657:
-----------------------------------

             Summary: Two backups occurring simultaneously can cause the BackupLock to be obtained but not released
                 Key: GEODE-657
                 URL: https://issues.apache.org/jira/browse/GEODE-657
             Project: Geode
          Issue Type: Bug
          Components: persistence
            Reporter: Barry Oglesby


Backup is done in three phases, namely {{FlushToDiskRequest}}, {{PrepareBackupRequest}} and {{FinishBackupRequest}}.

Here is some debugging in the locator showing the backup phases:
{noformat}
[warning 2015/12/01 14:37:38.783 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers STARTED

[warning 2015/12/01 14:37:38.794 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send FlushToDiskRequest

[warning 2015/12/01 14:37:38.977 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to FlushToDiskRequest

[warning 2015/12/01 14:37:38.977 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send PrepareBackupRequest

[warning 2015/12/01 14:37:38.984 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to PrepareBackupRequest

[warning 2015/12/01 14:37:38.985 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers about to send FinishBackupRequest

[warning 2015/12/01 14:37:51.184 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers received all responses to FinishBackupRequest

[warning 2015/12/01 14:37:51.184 PST locator <RMI TCP Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl backupAllMembers COMPLETED
{noformat}
Here is some debugging in one of the members. The others are similar.
{noformat}
[warning 2015/12/01 14:37:38.795 PST  <Pooled Message Processor 2> tid=0x66] FlushToDiskRequest createResponse received FlushToDiskRequest

[warning 2015/12/01 14:37:38.974 PST  <Pooled Message Processor 2> tid=0x66] FlushToDiskRequest createResponse sending FlushToDiskResponse

[warning 2015/12/01 14:37:38.978 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupRequest createResponse received PrepareBackupRequest

[warning 2015/12/01 14:37:38.978 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupResponse createResponse: About to start backup

[warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupResponse createResponse: About to prepare backup

[warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] BackupLock lockForBackup invoked

[warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] PrepareBackupRequest createResponse sending PrepareBackupResponse

[warning 2015/12/01 14:37:38.986 PST  <Pooled Message Processor 2> tid=0x66] FinishBackupRequest createResponse received FinishBackupRequest

[warning 2015/12/01 14:37:39.970 PST  <Pooled Message Processor 2> tid=0x66] BackupLock unlockForBackup invoked

[warning 2015/12/01 14:37:49.885 PST  <Pooled Message Processor 2> tid=0x66] FinishBackupRequest createResponse sending FinishBackupResponse
{noformat}
Processing the {{PrepareBackupRequest}} is where the {{BackupLock}} is taken (in {{BackupLock lockForBackup}}). Processing the {{FinishBackupRequest}} is where it is released (in {{BackupLock unlockForBackup}}).

All members operate independently. If one completes its backup before the others, the issue can happen. That member is available to start a new backup, but the others aren't.

In this case, the member successfully processes the {{PrepareBackupRequest}} and takes the {{BackupLock}}.

The other members do not successfully process the {{PrepareBackupRequest}} and do not take the {{BackupLock}}:
{noformat}
[warning 2015/12/01 14:46:02.447 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupRequest createResponse received PrepareBackupRequest

[warning 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupResponse createResponse: About to start backup

[error 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] Error processing request class com.gemstone.gemfire.admin.internal.PrepareBackupRequest.
java.io.IOException: Backup already in progress
	at com.gemstone.gemfire.internal.cache.GemFireCacheImpl.startBackup(GemFireCacheImpl.java:4315)
	at com.gemstone.gemfire.admin.internal.PrepareBackupRequest.createResponse(PrepareBackupRequest.java:87)
	at com.gemstone.gemfire.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:28)
	at com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386)
	at com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692)
	at com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:963)
	at java.lang.Thread.run(Thread.java:745)

[warning 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] PrepareBackupResponse createResponse: Returning exception java.io.IOException: Backup already in progress
{noformat}
Once this happens, the members are out of sync. One has taken the {{BackupLock}}; the others have not. The member that has taken the {{BackupLock}} will never release it.

One solution would be to prevent any backup from starting if one is in progress (in the locator). Another would be to force the member to release the {{BackupLock}} if other members fail to get it. I think this would require a new message, though.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)