You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Cosmin Lehene (JIRA)" <ji...@apache.org> on 2010/04/22 20:13:50 UTC

[jira] Updated: (HBASE-2437) Refactor HLog splitLog

     [ https://issues.apache.org/jira/browse/HBASE-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cosmin Lehene updated HBASE-2437:
---------------------------------

    Attachment: HBASE-2437_for_HBase-0.21_with_unit_tests_for_HDFS-0.21.patch

The patch is not final, so intended for trunk, but I'd appreciate a code review.

some of the changes:
* splitLog was refactored - the logic can be followed easier now
* logs are left in place is something goes wrong
* if split is interrupted, or crashes, the second split will start from zero (having all original log files), hence will delete any oldlogfile.log found  under the regionserver if any. 
* protect from zombie HRS that writes some more to hlog after split started (using recoverLog)
* protect from deleting a log file that was created by a zombie HRS after split has started.
* skip.errors=true means whenever something goes wrong and might lose edits we abort leaving logs in place
* skip.errors=false tolerate some errors: if a corrupted hlog file is encountered, read what you can and continue, then archive the corrupted log file.
* deal with empty log files
* etc.

* added unit test for the above mentioned
* unit test class has tools to generate log files, leave them open, corrupt them, etc.

The unit (or rather integration) tests are designed for hdfs-0.21, but could be adapted with small changes.

I initially did the refactoring trying to avoid the recoverLog method (that opens the file for append, then closes it to make sure a file is closed) because it took to long to wait for the lease. However if a regionserver that was considered dead (zombie) keeps writing to those files, the only way to work around that so we won't lose edits is to make sure it's closed (Trying to rename the file before splitting it will allow a writer thread to keep writing even after the rename for a few seconds.) I created testLogCannotBeWrittenOnceParsed for this.


In unit tests I set the lease period for a file to 100ms in the setUp method to avoid waiting 60 seconds in the unit tests. 
getDFSCluster().getNamesystem().leaseManager.setLeasePeriod(100, 50000);

Apparently on hdfs-0.20 getNameSystem is not available.

Also I use hflush() in unit test to write data to a log file and then leave it open. If not flushed and left open the changes might not be seen by the reader.  hflush() could be avoided if the open file scenarios could be ignored. 


> Refactor HLog splitLog
> ----------------------
>
>                 Key: HBASE-2437
>                 URL: https://issues.apache.org/jira/browse/HBASE-2437
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.21.0
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>             Fix For: 0.21.0
>
>         Attachments: HBASE-2437_for_HBase-0.21_with_unit_tests_for_HDFS-0.21.patch
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> the HLog.splitLog got really long and complex and hard to verify for correctness. 
> I started to refactor it and also ported changes from hbase-2337 that deals with premature deletion of log files in case of errors. Further improvements will be possible, however the scope of this issue is to clean the code and make it behave correctly (i.e. not lose any edits)  
> Added a suite of unit tests that might be ported to 0.20 as well.
> Added a setting (hbase.skip.errors - feel free to suggest a better name) that, when set to false will make the process less tolerant to failures or corrupted files:  in case a log file is corrupted or an error stops the process from consistently splitting the log, will abort the entire operation to avoid losing any edits. When hbase.skip.errors is on any corrupted files will be partially parsed and then moved to the corrupted logs archive (see hbase-2337). 
> Like hbase-2337 the splitLog method will first split all the logs and then proceed to archive them. If any splitted log file (oldlogfile.log) that is the result of an earlier splitLog attempt is found in the region directory, it will be deleted - this is safe since we won't move the original log files until the splitLog process completes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.