You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Lars Hofhansl (JIRA)" <ji...@apache.org> on 2012/08/30 00:12:09 UTC

[jira] [Commented] (HBASE-3936) Incremental bulk load support for Increments

    [ https://issues.apache.org/jira/browse/HBASE-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13444470#comment-13444470 ] 

Lars Hofhansl commented on HBASE-3936:
--------------------------------------

This looks like a very specific usecase, Andy. Are you still interested in working on this? Should we keep it open?
                
> Incremental bulk load support for Increments
> --------------------------------------------
>
>                 Key: HBASE-3936
>                 URL: https://issues.apache.org/jira/browse/HBASE-3936
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>             Fix For: 0.96.0
>
>
> From http://hbase.apache.org/bulk-loads.html: "The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the data files into a running cluster. Using bulk load will use less CPU and network than going via the HBase API."
> I have been working with a specific implementation of, and can envision, a class of applications that reduce data into a large collection of counters, perhaps building projections of the data in many dimensions in the process. One can use Hadoop MapReduce as the engine to accomplish this for a given data set and use LoadIncrementalHFiles to move the result into place for live serving. MR is natural for summation over very large counter sets: emit counter increments for the data set and projections thereof in mappers, use combiners for partial aggregation, use reducers to do final summation into HFiles.
> However, it is not possible to then merge in a set of updates to an existing table built in the manner above without either 1) joining the table data and the update set into a large MR temporary set, followed by a complete rewrite of the table; or 2) posting all of the updates as Increments via the HBase API, impacting any other concurrent users of the HBase service, and perhaps taking 10-100 times longer than if updates could be computed directly into HFiles like the original import. Both of these alternatives are expensive in terms of CPU and time; one is also expensive in terms of disk.
> I propose adding incremental bulk load support for Increments. Here is a sketch of a possible implementation:
> * Add a KV type for Increment
> * Modify HFile main, LoadIncrementalHFiles, and others that work with HFiles directly to handle the new KV type
> * Bulk load API can move the files to be merged into the Stores as before.
> * Implement an alternate compaction algorithm or modify the existing. Need to identify Increments and apply them to an existing most recent version of a value, or create the value if it does not exist.
>   ** Use KeyValueHeap as is to merge value-sets by row as before.
>   ** For each row, use a KV-keyed Map for in memory update of values.
>   ** If there is an existing value and it is not a serialized long, ignore the Increment and log at INFO level.
>   ** Use the persistent HashMapWrapper from Hive's CommonJoinOperator, with an appropriate memory limit, so work for overlarge rows will spill to disk. Can be local disk, not HDFS.
> * Never return an Increment KV to a client doing a Get or Scan. 
>   ** Before the merge is complete, if we find an Increment KV when searching Store files for a value, continue searching back in the Store files until we find a Put KV for the value, adding up Increments as they are encountered, then applying them to the Put value; or until search ends, in which case the Increment is treated as a Put.
>   ** If there is an existing value and it is not a serialized long, ignore the Increment and log at INFO level.
> * As a beneficial side effect, with Increments as just another KV type we can unify Put and Increment handling.
> Because this is a core concern I'd prefer discussing this as a possible enhancement of core as opposed to a Coprocessor-based extension. However it could be possible to implement all but the KV changes within the Coprocessor framework.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira