You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Nick Dimiduk (JIRA)" <ji...@apache.org> on 2013/02/27 21:09:14 UTC
[jira] [Commented] (HBASE-5210) HFiles are missing from an
incremental load
[ https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13588702#comment-13588702 ]
Nick Dimiduk commented on HBASE-5210:
-------------------------------------
Can this issue be reproduced in a more modern HBase? Can we close this as WON'T FIX as we sunset the 0.90 line?
> HFiles are missing from an incremental load
> -------------------------------------------
>
> Key: HBASE-5210
> URL: https://issues.apache.org/jira/browse/HBASE-5210
> Project: HBase
> Issue Type: Bug
> Components: mapreduce
> Affects Versions: 0.90.2
> Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync). RHEL 2.6.18-164.15.1.el5. 4 node cluster (1 master, 3 slaves)
> Reporter: Lawrence Simpson
> Attachments: HBASE-5210-crazy-new-getRandomFilename.patch
>
>
> We run an overnight map/reduce job that loads data from an external source and adds that data to an existing HBase table. The input files have been loaded into hdfs. The map/reduce job uses the HFileOutputFormat (and the TotalOrderPartitioner) to create HFiles which are subsequently added to the HBase table. On at least two separate occasions (that we know of), a range of output would be missing for a given day. The range of keys for the missing values corresponded to those of a particular region. This implied that a complete HFile somehow went missing from the job. Further investigation revealed the following:
> * Two different reducers (running in separate JVMs and thus separate class loaders)
> * in the same server can end up using the same file names for their
> * HFiles. The scenario is as follows:
> * 1. Both reducers start near the same time.
> * 2. The first reducer reaches the point where it wants to write its first file.
> * 3. It uses the StoreFile class which contains a static Random object
> * which is initialized by default using a timestamp.
> * 4. The file name is generated using the random number generator.
> * 5. The file name is checked against other existing files.
> * 6. The file is written into temporary files in a directory named
> * after the reducer attempt.
> * 7. The second reduce task reaches the same point, but its StoreClass
> * (which is now in the file system's cache) gets loaded within the
> * time resolution of the OS and thus initializes its Random()
> * object with the same seed as the first task.
> * 8. The second task also checks for an existing file with the name
> * generated by the random number generator and finds no conflict
> * because each task is writing files in its own temporary folder.
> * 9. The first task finishes and gets its temporary files committed
> * to the "real" folder specified for output of the HFiles.
> * 10. The second task then reaches its own conclusion and commits its
> * files (moveTaskOutputs). The released Hadoop code just overwrites
> * any files with the same name. No warning messages or anything.
> * The first task's HFiles just go missing.
> *
> * Note: The reducers here are NOT different attempts at the same
> * reduce task. They are different reduce tasks so data is
> * really lost.
> I am currently testing a fix in which I have added code to the Hadoop
> FileOutputCommitter.moveTaskOutputs method to check for a conflict with
> an existing file in the final output folder and to rename the HFile if
> needed. This may not be appropriate for all uses of FileOutputFormat.
> So I have put this into a new class which is then used by a subclass of
> HFileOutputFormat. Subclassing of FileOutputCommitter itself was a bit
> more of a problem due to private declarations.
> I don't know if my approach is the best fix for the problem. If someone
> more knowledgeable than myself deems that it is, I will be happy to share
> what I have done and by that time I may have some information on the
> results.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira