You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Andrew Kyle Purtell (Jira)" <ji...@apache.org> on 2022/06/12 18:51:00 UTC

[jira] [Resolved] (HBASE-3936) Incremental bulk load support for Increments

     [ https://issues.apache.org/jira/browse/HBASE-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Kyle Purtell resolved HBASE-3936.
----------------------------------------
    Resolution: Won't Fix

> Incremental bulk load support for Increments
> --------------------------------------------
>
>                 Key: HBASE-3936
>                 URL: https://issues.apache.org/jira/browse/HBASE-3936
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>
> From http://hbase.apache.org/bulk-loads.html: "The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the data files into a running cluster. Using bulk load will use less CPU and network than going via the HBase API."
> I have been working with a specific implementation of, and can envision, a class of applications that reduce data into a large collection of counters, perhaps building projections of the data in many dimensions in the process. One can use Hadoop MapReduce as the engine to accomplish this for a given data set and use LoadIncrementalHFiles to move the result into place for live serving. MR is natural for summation over very large counter sets: emit counter increments for the data set and projections thereof in mappers, use combiners for partial aggregation, use reducers to do final summation into HFiles.
> However, it is not possible to then merge in a set of updates to an existing table built in the manner above without either 1) joining the table data and the update set into a large MR temporary set, followed by a complete rewrite of the table; or 2) posting all of the updates as Increments via the HBase API, impacting any other concurrent users of the HBase service, and perhaps taking 10-100 times longer than if updates could be computed directly into HFiles like the original import. Both of these alternatives are expensive in terms of CPU and time; one is also expensive in terms of disk.
> I propose adding incremental bulk load support for Increments. Here is a sketch of a possible implementation:
> * Add a KV type for Increment
> * Modify HFile main, LoadIncrementalHFiles, and others that work with HFiles directly to handle the new KV type
> * Bulk load API can move the files to be merged into the Stores as before.
> * Implement an alternate compaction algorithm or modify the existing. Need to identify Increments and apply them to an existing most recent version of a value, or create the value if it does not exist.
>   ** Use KeyValueHeap as is to merge value-sets by row as before.
>   ** For each row, use a KV-keyed Map for in memory update of values.
>   ** If there is an existing value and it is not a serialized long, ignore the Increment and log at INFO level.
>   ** Use the persistent HashMapWrapper from Hive's CommonJoinOperator, with an appropriate memory limit, so work for overlarge rows will spill to disk. Can be local disk, not HDFS.
> * Never return an Increment KV to a client doing a Get or Scan. 
>   ** Before the merge is complete, if we find an Increment KV when searching Store files for a value, continue searching back in the Store files until we find a Put KV for the value, adding up Increments as they are encountered, then applying them to the Put value; or until search ends, in which case the Increment is treated as a Put.
>   ** If there is an existing value and it is not a serialized long, ignore the Increment and log at INFO level.
> * As a beneficial side effect, with Increments as just another KV type we can unify Put and Increment handling.
> Because this is a core concern I'd prefer discussing this as a possible enhancement of core as opposed to a Coprocessor-based extension. However it could be possible to implement all but the KV changes within the Coprocessor framework.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)