You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Leif Wickland <le...@gmail.com> on 2011/08/01 21:50:24 UTC

HFiles and MapReduce

A few questions about HFiles and MapReduce:

1. Is there any case where it's a bad idea to use HFileOutputFormat instead
of TableOutputFormat when writing to HBase from MapReduce?

2. What are the failure modes for LoadIncrementalHFiles.doBulkLoad?  Is it
possible some regions will be adopted and others fail?

3. I think I'd like to create HFiles as the output of my MapReduce, then use
the HFiles as the input to a MapReduce to calculate some new aggregates, and
then doBulkLoad on the HFiles.  Is there any easy way to use a directory of
HFiles as the input to a MapReduce?  Is this inadvisable?  It seems like
this would be a more sensible approach than scanning for columns with
timestamps in an interval to find the freshly written columns.

Thanks for any feedback.

Leif Wickland

Re: HFiles and MapReduce

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Mmmm I really need to learn Scala...

I would love to see a Java version of it, as past discussions on this
list show that people would use it.

Thanks,

J-D

On Tue, Aug 2, 2011 at 7:45 AM, Leif Wickland <le...@gmail.com> wrote:
> I ended up writing an HFileInputFormat in Scala.
> https://gist.github.com/1120311  Feel free to take a look and tell me if I
> did something obviously wrong.  Would you be interested in having a Java
> analog to add to the project?
>

Re: HFiles and MapReduce

Posted by Leif Wickland <le...@gmail.com>.

I ended up writing an HFileInputFormat in Scala.
https://gist.github.com/1120311  Feel free to take a look and tell me if I
did something obviously wrong.  Would you be interested in having a Java
analog to add to the project?

Re: HFiles and MapReduce

Posted by Jean-Daniel Cryans <jd...@apache.org>.

On Mon, Aug 1, 2011 at 2:15 PM, Leif Wickland <le...@gmail.com> wrote:
> Thanks for the reply, J-D.

Happy to help.

> Well, no, but I'm new to this stuff.

I can't think of any either :)

> What would recovery from that scenario look like?  Would the un-adopted
> HFiles remain in the directory that they were written to by the
> HFileOutputFormat?

Looking at the code it doesn't seem like there's a rollback function,
so the files would stay there. All files that were successfully loaded
would be moved there, meaning that you could just re-run the import
and it would continue (given that the error is resolved).

> It looks like Tatsuya Kawano offered to write an
> HFileInputFormat<http://mail-archives.apache.org/mod_mbox/hbase-dev/201101.mbox/%3CF02D6CF8-189A-4084-AD54-01EAC064C1CF@gmail.com%3E>back
> in January. Does anyone know if he ended up sharing that?

Not me.

J-D

Re: HFiles and MapReduce

Posted by Leif Wickland <le...@gmail.com>.

Thanks for the reply, J-D.


> > 1. Is there any case where it's a bad idea to use HFileOutputFormat
> instead
> > of TableOutputFormat when writing to HBase from MapReduce?
>
> Can you think of any?
>

Well, no, but I'm new to this stuff.


> That process isn't atomic, so to be sure you could end up with a
> region failing for some reason (network issues, bug, whatever) and my
> understanding of the code is that it would fail and return immediately
> after any IOException.
>

What would recovery from that scenario look like?  Would the un-adopted
HFiles remain in the directory that they were written to by the
HFileOutputFormat?



> You'd need to write an HFileInputFormat, that's pretty much it.
>

It looks like Tatsuya Kawano offered to write an
HFileInputFormat<http://mail-archives.apache.org/mod_mbox/hbase-dev/201101.mbox/%3CF02D6CF8-189A-4084-AD54-01EAC064C1CF@gmail.com%3E>back
in January. Does anyone know if he ended up sharing that?

Re: HFiles and MapReduce

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Inline.

J-D

On Mon, Aug 1, 2011 at 12:50 PM, Leif Wickland <le...@gmail.com> wrote:
> 1. Is there any case where it's a bad idea to use HFileOutputFormat instead
> of TableOutputFormat when writing to HBase from MapReduce?

Can you think of any?

>
> 2. What are the failure modes for LoadIncrementalHFiles.doBulkLoad?  Is it
> possible some regions will be adopted and others fail?

That process isn't atomic, so to be sure you could end up with a
region failing for some reason (network issues, bug, whatever) and my
understanding of the code is that it would fail and return immediately
after any IOException.

>
> 3. I think I'd like to create HFiles as the output of my MapReduce, then use
> the HFiles as the input to a MapReduce to calculate some new aggregates, and
> then doBulkLoad on the HFiles.  Is there any easy way to use a directory of
> HFiles as the input to a MapReduce?  Is this inadvisable?  It seems like
> this would be a more sensible approach than scanning for columns with
> timestamps in an interval to find the freshly written columns.

You'd need to write an HFileInputFormat, that's pretty much it.