You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org> on 2016/10/07 09:29:21 UTC

[jira] [Updated] (HBASE-16788) Race in compacted file deletion between HStore close() and closeAndArchiveCompactedFiles()

     [ https://issues.apache.org/jira/browse/HBASE-16788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-16788:
-------------------------------------------
    Attachment: HBASE-16788_1.patch

[~ghelmling]
Pls check this patch attached here. 
Am now making the removecompactedFiles() single threaded. Not adding any synchronized or new lock. Because if we do that then when we do clearCompactedfiles we again need the write lock. So if we go with any new locks releasing the lock will be problem without causing deadlock. Hence went with wait-notify way. I think making removeCompactedFiles() single threaded may be fine because in the case described here, it is better the close() works on the updated list and its better we make the close wait. Because what the close is going to do is being done by another thread. 
We could easily move removecompactedFiles() inside readLock to avoid this problem but there are cases where we frequently obtain writeLocks and all those wil be blocked because removecompactedFiles() is quite heavy. 
One thing to see is if the current discharge handler thread is not completed and it comes again with the same set of file list, then how are we going to handle it? We could do one improvement here like if for some reason removeStoreFiles() in HRegionFileSystem throws an error and it happens only for say 1 among 10 files it tries to remove, then we should collect the failed list and only that has to be retried next time. Remaining can be removed. I can raise a JIRA for that. 

> Race in compacted file deletion between HStore close() and closeAndArchiveCompactedFiles()
> ------------------------------------------------------------------------------------------
>
>                 Key: HBASE-16788
>                 URL: https://issues.apache.org/jira/browse/HBASE-16788
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 1.3.0
>            Reporter: Gary Helmling
>            Assignee: Gary Helmling
>            Priority: Blocker
>         Attachments: HBASE-16788_1.patch
>
>
> HBASE-13082 changed the way that compacted files are archived from being done inline on compaction completion to an async cleanup by the CompactedHFilesDischarger chore.  It looks like the changes to HStore to support this introduced a race condition in the compacted HFile archiving.
> In the following sequence, we can wind up with two separate threads trying to archive the same HFiles, causing a regionserver abort:
> # compaction completes normally and the compacted files are added to {{compactedfiles}} in HStore's DefaultStoreFileManager
> # *threadA*: CompactedHFilesDischargeHandler runs in a RS executor service, calling closeAndArchiveCompactedFiles()
> ## obtains HStore readlock
> ## gets a copy of compactedfiles
> ## releases readlock
> # *threadB*: calls HStore.close() as part of region close
> ## obtains HStore writelock
> ## calls DefaultStoreFileManager.clearCompactedfiles(), getting a copy of same compactedfiles
> # *threadA*: calls HStore.removeCompactedfiles(compactedfiles)
> ## archives files in {compactedfiles} in HRegionFileSystem.removeStoreFiles()
> ## call HStore.clearCompactedFiles()
> ## waits on write lock
> # *threadB*: continues with close()
> ## calls removeCompactedfiles(compactedfiles)
> ## calls HRegionFIleSystem.removeStoreFiles() -> HFileArchiver.archiveStoreFiles()
> ## receives FileNotFoundException because the files have already been archived by threadA
> ## throws IOException
> # RS aborts
> I think the combination of fetching the compactedfiles list and removing the files needs to be covered by locking.  Options I see are:
> * Modify HStore.closeAndArchiveCompactedFiles(): use writelock instead of readlock and move the call to removeCompactedfiles() inside the lock.  This means the read operations will be blocked while the files are being archived, which is bad.
> * Synchronize closeAndArchiveCompactedFiles() and modify close() to call it instead of calling removeCompactedfiles() directly
> * Add a separate lock for compacted files removal and use in closeAndArchiveCompactedFiles() and close()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)