You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jon Stewart <jo...@lightboxtechnologies.com> on 2011/05/24 23:26:08 UTC

One map task to two HFiles

I have a map task that's extracting documents from a flat file and
writing them into an HBase table as individual records; the key is
based off the path of the file (idempotent) but balances key-space
distribution with locality of reference. Additionally, I have a
secondary table where the key is the hash of a file's contents (e.g.,
MD5), and indexes back into the primary table (along with other data).
Rows aren't subject to deletion, which makes life easy.

I've successfully used HFileOutputFormat and KeyValueSortReducer on a
related task that prepopulates data into the secondary table and this
works great. I'd like to convert my extraction task over to writing
HFiles out in bulk, for both tables.

I have enough control over the keys for the primary table that the map
task could write rows to the primary table in order, making it
map-side only (assuming one HFile per task). The map task could then
emit KeyValue objects for the secondary hash table and let
HFileOutpuFormat/KeyValueSortReducer do its thing.

The question is, how do I write an HFile from a map task?
HFile.Writer? What are the gotchas?

Thanks in advance,

Jon
-- 
Jon Stewart, Principal
(646) 719-0317 | jon@lightboxtechnologies.com | Arlington, VA

Re: One map task to two HFiles

Posted by Stack <st...@duboce.net>.
On Tue, May 24, 2011 at 9:04 PM, Stack <st...@duboce.net> wrote:
> Nothing fancier than open on map task init, append, append, append,
> inside in the map, and then be sure to close it in the map close.
> Study HFileOutputFormat I'd say.
>

But then also study the HFile loader script, LoadIncrementalHFiles.
Notice how slices and dices the hfile output to make them slot within
existing region boundaries.
St.Ack

Re: One map task to two HFiles

Posted by Stack <st...@duboce.net>.
On Tue, May 24, 2011 at 2:26 PM, Jon Stewart
<jo...@lightboxtechnologies.com> wrote:
> I have enough control over the keys for the primary table that the map
> task could write rows to the primary table in order, making it
> map-side only (assuming one HFile per task). The map task could then
> emit KeyValue objects for the secondary hash table and let
> HFileOutpuFormat/KeyValueSortReducer do its thing.
>
> The question is, how do I write an HFile from a map task?
> HFile.Writer? What are the gotchas?
>

Nothing fancier than open on map task init, append, append, append,
inside in the map, and then be sure to close it in the map close.
Study HFileOutputFormat I'd say.

St.Ack