You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Saptarshi Guha <sa...@gmail.com> on 2010/11/18 05:30:48 UTC

TableInputFormat vs. a map of table regions (data locality)

Hello,

I'm fairly new to HBase and would appreciate your comments.

[1] One way compute across an HBase dataset would be to run as many
maps as regions,
for each map, run a scan across the region row limits (within the map
method). This approach does not use TableInputFormat.In the reduce (if needed),
directly write (using put) to the table.


[2] In the *second* approach I could use the TableInputFormat and
TableOutputFormat.

My hypotheses:

H1: As for TableOutputFormat, I think both approaches, performance-wise are
equivalent. Correct me if I'm wrong.

H2: As for TableInputFormat vs. approach[1]. A quick glance through the
TableSplit source reveals location information. At first blush I can imagine in
approach [1] I scan from row_start to row_end all the data of which
resides on a computer different from the compute node on which the split is
being run. Since TableInputFormat (approach [2]) uses region information, my
guess (not sure at all) is that Hadoop Mapreduce will assign the computation to
the node where the region lies and so when the scan is issued the queries will
be issued against local data - achieving data locality. So it makes sense to
take advantage of (at the least) the TableSplit information.

Are my hypotheses correct?

Thanks
Joy

Re: TableInputFormat vs. a map of table regions (data locality)

Posted by Saptarshi Guha <sa...@gmail.com>.
Hi Lars,

Perfect. Thanks for confirming. I have some existing code for which I
want to add HBase support
with minimal modifications to the original code base. I think i need
to provide InputFormat containing
TableSplit.

On a side note, i feel the Key and Values in map, reduce, record
reader methods should be interfaces
and not classes (I guess there is a reason for the change).
Keys/Values should conform to  a contract
but do they need to sit in a class hierarchy?

Cheers
Joy


On Wed, Nov 17, 2010 at 11:55 PM, Lars George <la...@gmail.com> wrote:
> Hi Joy,
>
> [1] is what [2] does. They are just a thin wrapper around the raw API.
>
> And as Alex pointed out and you noticed too, [2] adds the benefit to
> have locality support. If you were to add this to [1] then you have
> [2].
>
> Lars
>
> On Thu, Nov 18, 2010 at 5:30 AM, Saptarshi Guha
> <sa...@gmail.com> wrote:
>> Hello,
>>
>> I'm fairly new to HBase and would appreciate your comments.
>>
>> [1] One way compute across an HBase dataset would be to run as many
>> maps as regions,
>> for each map, run a scan across the region row limits (within the map
>> method). This approach does not use TableInputFormat.In the reduce (if needed),
>> directly write (using put) to the table.
>>
>>
>> [2] In the *second* approach I could use the TableInputFormat and
>> TableOutputFormat.
>>
>> My hypotheses:
>>
>> H1: As for TableOutputFormat, I think both approaches, performance-wise are
>> equivalent. Correct me if I'm wrong.
>>
>> H2: As for TableInputFormat vs. approach[1]. A quick glance through the
>> TableSplit source reveals location information. At first blush I can imagine in
>> approach [1] I scan from row_start to row_end all the data of which
>> resides on a computer different from the compute node on which the split is
>> being run. Since TableInputFormat (approach [2]) uses region information, my
>> guess (not sure at all) is that Hadoop Mapreduce will assign the computation to
>> the node where the region lies and so when the scan is issued the queries will
>> be issued against local data - achieving data locality. So it makes sense to
>> take advantage of (at the least) the TableSplit information.
>>
>> Are my hypotheses correct?
>>
>> Thanks
>> Joy
>>
>

Re: TableInputFormat vs. a map of table regions (data locality)

Posted by Lars George <la...@gmail.com>.
Hi Joy,

[1] is what [2] does. They are just a thin wrapper around the raw API.

And as Alex pointed out and you noticed too, [2] adds the benefit to
have locality support. If you were to add this to [1] then you have
[2].

Lars

On Thu, Nov 18, 2010 at 5:30 AM, Saptarshi Guha
<sa...@gmail.com> wrote:
> Hello,
>
> I'm fairly new to HBase and would appreciate your comments.
>
> [1] One way compute across an HBase dataset would be to run as many
> maps as regions,
> for each map, run a scan across the region row limits (within the map
> method). This approach does not use TableInputFormat.In the reduce (if needed),
> directly write (using put) to the table.
>
>
> [2] In the *second* approach I could use the TableInputFormat and
> TableOutputFormat.
>
> My hypotheses:
>
> H1: As for TableOutputFormat, I think both approaches, performance-wise are
> equivalent. Correct me if I'm wrong.
>
> H2: As for TableInputFormat vs. approach[1]. A quick glance through the
> TableSplit source reveals location information. At first blush I can imagine in
> approach [1] I scan from row_start to row_end all the data of which
> resides on a computer different from the compute node on which the split is
> being run. Since TableInputFormat (approach [2]) uses region information, my
> guess (not sure at all) is that Hadoop Mapreduce will assign the computation to
> the node where the region lies and so when the scan is issued the queries will
> be issued against local data - achieving data locality. So it makes sense to
> take advantage of (at the least) the TableSplit information.
>
> Are my hypotheses correct?
>
> Thanks
> Joy
>

Re: TableInputFormat vs. a map of table regions (data locality)

Posted by Alex Baranau <al...@gmail.com>.
What are the benefits you are looking for with the first option?
With TableInputFormat it'll start as many map tasks as you have regions and
data processing will benefit from data locality. From javadoc (
http://hbase.apache.org/docs/r0.20.6/api/org/apache/hadoop/hbase/mapreduce/package-summary.html
):

"Reading from hbase, the TableInputFormat asks hbase for the list of regions
and makes a map-per-region or mapred.map.tasks maps, whichever is
smaller[...]. Maps will run on the adjacent TaskTracker if you are running a
TaskTracer and RegionServer per node."

Alex Baranau
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase

On Thu, Nov 18, 2010 at 6:30 AM, Saptarshi Guha <sa...@gmail.com>wrote:

> Hello,
>
> I'm fairly new to HBase and would appreciate your comments.
>
> [1] One way compute across an HBase dataset would be to run as many
> maps as regions,
> for each map, run a scan across the region row limits (within the map
> method). This approach does not use TableInputFormat.In the reduce (if
> needed),
> directly write (using put) to the table.
>
>
> [2] In the *second* approach I could use the TableInputFormat and
> TableOutputFormat.
>
> My hypotheses:
>
> H1: As for TableOutputFormat, I think both approaches, performance-wise are
> equivalent. Correct me if I'm wrong.
>
> H2: As for TableInputFormat vs. approach[1]. A quick glance through the
> TableSplit source reveals location information. At first blush I can
> imagine in
> approach [1] I scan from row_start to row_end all the data of which
> resides on a computer different from the compute node on which the split is
> being run. Since TableInputFormat (approach [2]) uses region information,
> my
> guess (not sure at all) is that Hadoop Mapreduce will assign the
> computation to
> the node where the region lies and so when the scan is issued the queries
> will
> be issued against local data - achieving data locality. So it makes sense
> to
> take advantage of (at the least) the TableSplit information.
>
> Are my hypotheses correct?
>
> Thanks
> Joy
>