You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Nicolas Spiegelberg (JIRA)" <ji...@apache.org> on 2010/11/17 04:10:15 UTC

[jira] Commented: (HBASE-1861) Multi-Family support for bulk upload tools (HFileOutputFormat / loadtable.rb)

    [ https://issues.apache.org/jira/browse/HBASE-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932794#action_12932794 ] 

Nicolas Spiegelberg commented on HBASE-1861:
--------------------------------------------

IRC chat with Todd about this issue.  I figured that areas I was confused about, others might be as well.  Highlights:

1. Multi-family support is almost done, just need someone to verify (yay, me)
2. Remember that this code handles new table + import code
3. HLog rolling (for HFileOutputFormat) only makes sense in the new table case
3. splits are accounted for in LoadIncrementalHFiles,

--------------------------------------------------
[6:07pm] nspiegelberg: so, I'm trying to understand why multiple CF doesn't work already
[6:08pm] nspiegelberg: Also, since we're pre-splitting, I'm trying to find all the code where bulk importing tries to split for you
[6:08pm] tlipcon: i don't think it's that far off from working
[6:08pm] tlipcon: just needs a little "accounting" to make sure that the hfiles all split at the same boundaries
[6:08pm] tlipcon: I think it's just in LoadIncrementalHFiles
[6:08pm] tlipcon: HFOF itself will split on max.region.size boundaries I think
[6:09pm] nspiegelberg: from what I can tell, you're basically worried about when to roll HFiles and to make sure you don't roll them on the same Row
[6:09pm] tlipcon: right
[6:09pm] tlipcon: rolling HFiles is basically important when you underestimate the number of reducers you should be making
[6:10pm] nspiegelberg: don't you do 1 reducer/region?
[6:10pm] nspiegelberg: if there's no edits to a region, then that reducer is idle
[6:11pm] tlipcon: you're only thinking about pre-created case
[6:11pm] tlipcon: but HFOF also works for the new table case
[6:11pm] tlipcon: and with reducer skew in that case, you'd prefer one reducer to maybe make two regions
[6:12pm] nspiegelberg: wouldn't that be accomplished by doing a split in LoadIncrementalHFiles?
[6:12pm] tlipcon: yea, but that split is very slow
[6:12pm] tlipcon: it's a physical split
[6:12pm] tlipcon: it's only there to take care of the case where you've got some splits in between running the MR and loading the hfiles
[6:13pm] nspiegelberg: k.  it sounds like my life is simplified by pre-split regions
[6:13pm] nspiegelberg: just need to not mess up split case
[6:25pm] nspiegelberg: hey, I am still a little fuzzy on how you're handling the case where a split happens between configureIncrementalLoad() and bulkLoadHFile()
[6:26pm] tlipcon: nspiegelberg: the completebulkload (LoadIncrementalHFiles) deals with it
[6:26pm] tlipcon: it's ugly, it physically splits the hfile on the new boundary
[6:26pm] tlipcon: and adds the new ones to a queue
[6:28pm] nspiegelberg: so... if splitting doesn't happen until LoadIncrementalHFiles, why do you need to worry that you've hit a new row when you roll HFiles?
[6:29pm] nspiegelberg: it only makes sense to not roll HFiles until the next row when you want to use that as a split point
[6:30pm] tlipcon: it's for new tables
[6:30pm] tlipcon: for the actual incremental case, the rolling done by HFOF doesn't really buy you anything
[6:30pm] tlipcon: except that it minimizes the amount of work
[6:30pm] tlipcon: but for a new table, you might want 10 reducers, but maybe you have skew, so one of the reducers gets 5x as much data as the others
[6:31pm] tlipcon: it should still make regions that fit your region size
[6:31pm] nspiegelberg: well, in our case, we need to add a threshold to PutSortReducer so we don't try to put too many entries in an in-memory Map 
[6:31pm] nspiegelberg: so rolling the hfiles does make sense
[6:35pm] tlipcon: recall that reducer only runs per row
[6:35pm] tlipcon: it doesn't shove multiple rowsin the map
[6:35pm] tlipcon: just multiple columns
[6:36pm] nspiegelberg: the Iterable<Put> is not a stream?
[6:36pm] • nspiegelberg is a MR n00b
[6:36pm] tlipcon: it is streamed, but all those puts are for the same row


> Multi-Family support for bulk upload tools (HFileOutputFormat / loadtable.rb)
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-1861
>                 URL: https://issues.apache.org/jira/browse/HBASE-1861
>             Project: HBase
>          Issue Type: Improvement
>          Components: mapreduce
>    Affects Versions: 0.20.0
>            Reporter: Jonathan Gray
>            Assignee: Nicolas Spiegelberg
>             Fix For: 0.92.0
>
>         Attachments: HBASE1861-incomplete.patch
>
>
> Add multi-family support to bulk upload tools from HBASE-48.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.