You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2010/05/20 21:03:17 UTC
[jira] Commented: (HBASE-1923) Bulk incremental load into an existing table

    [ https://issues.apache.org/jira/browse/HBASE-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869724#action_12869724 ] 

Todd Lipcon commented on HBASE-1923:
------------------------------------

h1. Basic Design:

h2. Changes to HFileOutputFormat:

Should only need changes during job initialization:

# scan all regions of table from .META.
# configure TotalOrderPartitioner based on existing region key boundaries
# If the number of reducers > number of regions, we could
  (a) recursively split table until this is not true (degenerate case: incremental load into table with one row?)
  (b) simply split keyspace by taking the lexical "halfway" of the region (two HFiles go into one region in load stage)
  (c) add API to regionserver to get estimate of region midpoint (assuming that new data has similar distribution to old data)

I plan to do either (a) or (b) initially.

We should provide at least some sample code, if not good utility classes/methods to do this task.

h3. Job Running

Should be unaffected

h3. Data Loader

Note that the partitions output by the MR job no longer necessarily correspond to the region boundaries (regions could have split or merged). I think the algorithm looks like:

{code}
for each reducer output:
  inspect hfile to find lowest key and highest key
  look up region name/startkey/endkey corresponding to first key in hfile
  if HFile's low<->high is entirely contained within regions low<->high:
    send RPC to RS: loadIncremental(region name, "/path/to/hfile")
  else:
    # this is the inefficient path, if the region split during the MR job
    On the loading side, manually split the HFile into two physical HFiles
      in a tmp directory
    recurse on the split files
{code}

The "inefficient" path should occur in a minority of cases. In the future we can implement this path using reference files that would be cleaned at next compaction. I don't plan to do this in the first pass.


The above functionality would be implemented in a client side script/program (currently a ruby script, though I will probably just write in Java)

h3. RegionServer Side

Need to implement the "loadIncremental" RPC. This function needs to do the following reasonably simple steps:
# ensure that the region being accessed is the same one being hosted (including timestamp, etc)
# move the HFile into the correct store directory
# briefly lock the storefile list and add the HFile

Probably need some other interaction with concurrent scanners, etc - will look at this carefully during implementation.


> Bulk incremental load into an existing table
> --------------------------------------------
>
>                 Key: HBASE-1923
>                 URL: https://issues.apache.org/jira/browse/HBASE-1923
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, mapred, regionserver, scripts
>    Affects Versions: 0.21.0
>            Reporter: anty.rao
>            Assignee: Todd Lipcon
>
> hbase-48 is about bulk load of a new table,maybe it's more practicable to bulk load aganist a existing table.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.