You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jason Rutherglen <ja...@gmail.com> on 2011/05/06 21:27:20 UTC

MapReduce job reading directly from the HBase files in HDFS

Is there an issue open or any particular reason that an MR job needs to access
the HBase data directly from the region server? It seems possible to also
provide functionality such that MR can execute over the HFile(s) stored in
HDFS, thereby giving similar performance characteristics comparable to typical
MR jobs that execute against files in HDFS.

Jason

Re: MapReduce job reading directly from the HBase files in HDFS

Posted by Jason Rutherglen <ja...@gmail.com>.
Right, that's to be expected with bulk batch jobs.  The alternative is
keeping duplicate files in HDFS, and not being able to easily create or
manage them.  Snapshots in the HBase format'll be fine.
On May 6, 2011 5:19 PM, "Bill Graham" <bi...@gmail.com> wrote:
> One big reason is that there will be updates in the memory store that
aren't
> yet written to HFiles. You'll miss these.
>
> On Fri, May 6, 2011 at 12:27 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> Is there an issue open or any particular reason that an MR job needs to
>> access
>> the HBase data directly from the region server? It seems possible to also
>> provide functionality such that MR can execute over the HFile(s) stored
in
>> HDFS, thereby giving similar performance characteristics comparable to
>> typical
>> MR jobs that execute against files in HDFS.
>>
>> Jason
>>

Re: MapReduce job reading directly from the HBase files in HDFS

Posted by Bill Graham <bi...@gmail.com>.
One big reason is that there will be updates in the memory store that aren't
yet written to HFiles. You'll miss these.

On Fri, May 6, 2011 at 12:27 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> Is there an issue open or any particular reason that an MR job needs to
> access
> the HBase data directly from the region server? It seems possible to also
> provide functionality such that MR can execute over the HFile(s) stored in
> HDFS, thereby giving similar performance characteristics comparable to
> typical
> MR jobs that execute against files in HDFS.
>
> Jason
>