You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Tak Lon (Stephen) Wu (JIRA)" <ji...@apache.org> on 2018/11/12 21:10:01 UTC
[jira] [Commented] (HBASE-21358) Snapshot procedure fails but SnapshotManager thinks it is still running

    [ https://issues.apache.org/jira/browse/HBASE-21358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684364#comment-16684364 ] 

Tak Lon (Stephen) Wu commented on HBASE-21358:
----------------------------------------------

@[~apurtell] may I try this one?  

In addition, do you know if there is any way to test this problem on a live cluster ? or would it be fine if I write a test to cover this case ?

> Snapshot procedure fails but SnapshotManager thinks it is still running
> -----------------------------------------------------------------------
>
>                 Key: HBASE-21358
>                 URL: https://issues.apache.org/jira/browse/HBASE-21358
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 1.3.2
>            Reporter: Andrew Purtell
>            Priority: Major
>
> A snapshot procedure fails due to chaotic test action but the snapshot manager still thinks it is running. The test client spins needlessly checking for something that will never actually complete. We give up eventually but we could be failing this a lot faster. 
> On the integration client we are checking and re-checking: 
> 2018-10-20 01:06:11,718 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: Getting current status of snapshot from master... 
> 2018-10-20 01:06:11,719 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: (#40) Sleeping: 8571ms while waiting for snapshot completion. 
> This is what it looks like on the master side each time the client checks in: 
> 2018-10-20 01:04:54,565 DEBUG [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] master.MasterRpcServices: Checking to see if snapshot from request:{ ss=IntegrationTestBigLinkedList-it-1539997289258 table=IntegrationTestBigLinkedList type=FLUSH } is done 
> 2018-10-20 01:04:54,565 DEBUG [RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100] snapshot.SnapshotManager: Snapshoting '{ ss=IntegrationTestBigLinkedList-it-1539997289258 table=IntegrationTestBigLinkedList type=FLUSH }' is still in progress! 
> There is no running procedure for the snapshot. The procedure has failed. The snapshot manager does not take any useful action afterward but believes the snapshot to still be in progress.
> I see related complaint from the hfile archiver task afterward, empty directories, failure to parse protobuf in descriptor files... Seems like there was junk in the filesystem left over from the failed snapshot. The master was soon restarted by chaos action, and now I don't see these complaints, so that partially complete snapshot may have been cleaned up.
> This is with 1.3.2, but patched to include the multithreaded hfile archiving improvements from later versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)