You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jerry He (JIRA)" <ji...@apache.org> on 2014/03/30 01:04:16 UTC

[jira] [Commented] (HBASE-8073) HFileOutputFormat support for offline operation

    [ https://issues.apache.org/jira/browse/HBASE-8073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954518#comment-13954518 ] 

Jerry He commented on HBASE-8073:
---------------------------------

Hi, [~ndimiduk]

Sampling the input data to provide split point is a big plus, and will involve more work. 

How do you think the following improvement to give an option to first get rid of the dependency on an online HBase?

Instead of replying on a live HTable from the table name, we can just go to the file system disk to get the table descriptor and region infos.
>From there we have all the info we need, e.g compression, region boundaries, etc.  
In this case, we will only reply on a hbase.root.dir and the table name.





> HFileOutputFormat support for offline operation
> -----------------------------------------------
>
>                 Key: HBASE-8073
>                 URL: https://issues.apache.org/jira/browse/HBASE-8073
>             Project: HBase
>          Issue Type: Sub-task
>          Components: mapreduce
>            Reporter: Nick Dimiduk
>
> When using HFileOutputFormat to generate HFiles, it inspects the region topology of the target table. The split points from that table are used to guide the TotalOrderPartitioner. If the target table does not exist, it is first created. This imposes an unnecessary dependence on an online HBase and existing table.
> If the table exists, it can be used. However, the job can be smarter. For example, if there's far more data going into the HFiles than the table currently contains, the table regions aren't very useful for data split points. Instead, the input data can be sampled to produce split points more meaningful to the dataset. LoadIncrementalHFiles is already capable of handling divergence between HFile boundaries and table regions, so this should not pose any additional burdon at load time.
> The proper method of sampling the data likely requires a custom input format and an additional map-reduce job perform the sampling. See a relevant implementation: https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/manning/hip/ch4/sampler/ReservoirSamplerInputFormat.java



--
This message was sent by Atlassian JIRA
(v6.2#6252)