You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2018/10/22 18:47:00 UTC

[jira] [Created] (HBASE-21358) Snapshot procedure fails but SnapshotManager thinks it is still running

Andrew Purtell created HBASE-21358:
--------------------------------------

Summary: Snapshot procedure fails but SnapshotManager thinks it is still running
Key: HBASE-21358
URL: https://issues.apache.org/jira/browse/HBASE-21358
Project: HBase
Issue Type: Bug
Components: snapshots
Affects Versions: 1.3.2
Reporter: Andrew Purtell

A snapshot procedure fails due to chaotic test action but the snapshot manager still thinks it is running. The test client spins needlessly checking for something that will never actually complete. We give up eventually but we could be failing this a lot faster.

On the integration client we are checking and re-checking:

2018-10-20 01:06:11,718 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: Getting current status of snapshot from master...
2018-10-20 01:06:11,719 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: (#40) Sleeping: 8571ms while waiting for snapshot completion.

This is what it looks like on the master side each time the client checks in:

2018-10-20 01:04:54,565 DEBUG [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] master.MasterRpcServices: Checking to see if snapshot from request:{ ss=IntegrationTestBigLinkedList-it-1539997289258 table=IntegrationTestBigLinkedList type=FLUSH } is done
2018-10-20 01:04:54,565 DEBUG [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] snapshot.SnapshotManager: Snapshoting '{ ss=IntegrationTestBigLinkedList-it-1539997289258 table=IntegrationTestBigLinkedList type=FLUSH }' is still in progress!

There is no running procedure for the snapshot. The procedure has failed. The snapshot manager does not take any useful action afterward but believes the snapshot to still be in progress.

I see related complaint from the hfile archiver task afterward, empty directories, failure to parse protobuf in descriptor files... Seems like there was junk in the filesystem left over from the failed snapshot. The master was soon restarted by chaos action, and now I don't see these complaints, so that partially complete snapshot may have been cleaned up.

This is with 1.3.2, but patched to include the multithreaded hfile archiving improvements from later versions.

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)