You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by sdnetwork <sd...@gmail.com> on 2012/04/04 13:41:12 UTC

hbase map/reduce questions

Hello,

I started working with hadoop / HBase and I have a question about the
distribution of map / reduce on a htable through the different nodes of the
cluster.

If I understand the map is subdivided by region (TableInputFormat) and each
map are executed on the node taht containing the region.

But a row is always stored on a single region so if I implements a custom
org.apache.hadoop.mapreduce.InputFormat  that split a row and one column
family in parameter, the job will be executed on a single node regardless
the number of column qualifier?

if this is true i must change my data schema. or maybe i can manually
distribute the job through the cluster.

I can not find documentation that clearly explains how the map are
distributed across the cluster.
maybe somebody have it ?

thanks in advance.
-- 
View this message in context: http://old.nabble.com/hbase-map-reduce-questions-tp33554779p33554779.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: hbase map/reduce questions

Posted by arnaud but <sd...@gmail.com>.

thank you very much, i will take a look at these links but i think that 
i understand in fact I did not know the getlocation roles in the 
distrubtion of the map task.

Le 09/04/2012 19:45, Suraj Varma a écrit :
> Take a look at InputSplit:
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/InputSplit.java#InputSplit.getLocations%28%29
>
> Then take a look at how TableSplit is implemented (getLocations method
> in particular):
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.5/org/apache/hadoop/hbase/mapreduce/TableSplit.java#TableSplit.getLocations%28%29
>
> Also look at TableInputFormatBase#getSplits method to see how the
> region locations are populated.
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.4/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#TableInputFormatBase.getSplits%28org.apache.hadoop.hbase.mapreduce.JobContext%29
>
> In your case, if you want to run your maps on all available nodes
> regardless of the fact that only two of those nodes contain your
> regions ... you would implement a custom InputSplit that returns an
> empty String[] in the getLocations() method.
> --Suraj
>
> On Mon, Apr 9, 2012 at 1:29 AM, arnaud but<sd...@gmail.com>  wrote:
>> ok thanks,
>>
>>
>>> Yes - if you do a custom split, and have sufficient map slots in your
>>> cluster
>>
>> if I understand well even if the lines are stored on only two nodes of my
>> luster I can distribute the "map tasks" on the other nodes?
>>
>> eg
>> i have 10 nodes in the cluster i done a custom split that split every 100
>> rows.
>> All rows are stored on only two nodes, my map/reduce task generate 10 map
>> task because i have 1000 rows.
>> is that all nodes will receive a map task has executed ? or only the two
>> nodes where is stored the 1000 rows.
>>
>>
>>> you can parallelize the map tasks to run on other nodes as
>>> well
>>
>> How i can do that ? i do not see how i can say this split Will Be execute on
>> this node programmatically?
>>
>> Le 08/04/2012 18:37, Suraj Varma a écrit :
>>
>>>> if i do a custom input that split the table by 100 rows, can i
>>>> distribute manually each part  on a node   regardless where the data
>>>> is ?
>>>
>>>
>>> Yes - if you do a custom split, and have sufficient map slots in your
>>> cluster, you can parallelize the map tasks to run on other nodes as
>>> well. But if you are using HBase as the sink / source, these map tasks
>>> will still reach back to the region server node holding that row. So -
>>> if you have all your rows in two nodes, all the map tasks will still
>>> reach out to those two nodes. Depending on what your map tasks are
>>> doing (intensive crunching vs I/O) this may or may not help with what
>>> you are doing.
>>> --Suraj
>>>
>>>
>>>
>>> On Thu, Apr 5, 2012 at 6:44 AM, Arnaud Le-roy<sd...@gmail.com>    wrote:
>>>>
>>>> yes i know but it's just an exemple we can do the same exemple with
>>>> one billion but effectivelly you could say me in this case the rows
>>>> would be stored on all node.
>>>>
>>>> maybe it's not possible to distributed manually the task through the
>>>> cluster ?
>>>> and maybe it's not a good idea but  I would like to know in order to
>>>> make the best schema for my data.
>>>>
>>>> Le 5 avril 2012 15:08, Doug Meil<do...@explorysmedical.com>    a écrit
>>>> :
>>>>>
>>>>>
>>>>> If you only have 1000 rows, why use MapReduce?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 4/5/12 6:37 AM, "Arnaud Le-roy"<sd...@gmail.com>    wrote:
>>>>>
>>>>>> but do you think that i can change the default behavior ?
>>>>>>
>>>>>> for exemple i have ten nodes in my cluster and my table is stored only
>>>>>> on two nodes this table have 1000 rows.
>>>>>> with the default behavior only two nodes will work for a map/reduce
>>>>>> task., isn't it ?
>>>>>>
>>>>>> if i do a custom input that split the table by 100 rows, can i
>>>>>> distribute manually each part  on a node   regardless where the data
>>>>>> is ?
>>>>>>
>>>>>> Le 5 avril 2012 00:36, Doug Meil<do...@explorysmedical.com>    a
>>>>>> écrit :
>>>>>>>
>>>>>>>
>>>>>>> The default behavior is that the input splits are where the data is
>>>>>>> stored.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 4/4/12 5:24 PM, "sdnetwork"<sd...@gmail.com>    wrote:
>>>>>>>
>>>>>>>> ok thanks,
>>>>>>>>
>>>>>>>> but i don't find the information that tell me how the result of the
>>>>>>>> split
>>>>>>>> is
>>>>>>>> distrubuted across the different node of the cluster ?
>>>>>>>>
>>>>>>>> 1) randomely ?
>>>>>>>> 2) where the data is stored ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>
>>
>

Re: hbase map/reduce questions

Posted by Suraj Varma <sv...@gmail.com>.

Take a look at InputSplit:
http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/InputSplit.java#InputSplit.getLocations%28%29

Then take a look at how TableSplit is implemented (getLocations method
in particular):
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.5/org/apache/hadoop/hbase/mapreduce/TableSplit.java#TableSplit.getLocations%28%29

Also look at TableInputFormatBase#getSplits method to see how the
region locations are populated.

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.4/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#TableInputFormatBase.getSplits%28org.apache.hadoop.hbase.mapreduce.JobContext%29

In your case, if you want to run your maps on all available nodes
regardless of the fact that only two of those nodes contain your
regions ... you would implement a custom InputSplit that returns an
empty String[] in the getLocations() method.
--Suraj

On Mon, Apr 9, 2012 at 1:29 AM, arnaud but <sd...@gmail.com> wrote:
> ok thanks,
>
>
>> Yes - if you do a custom split, and have sufficient map slots in your
>> cluster
>
> if I understand well even if the lines are stored on only two nodes of my
> luster I can distribute the "map tasks" on the other nodes?
>
> eg
> i have 10 nodes in the cluster i done a custom split that split every 100
> rows.
> All rows are stored on only two nodes, my map/reduce task generate 10 map
> task because i have 1000 rows.
> is that all nodes will receive a map task has executed ? or only the two
> nodes where is stored the 1000 rows.
>
>
>> you can parallelize the map tasks to run on other nodes as
>> well
>
> How i can do that ? i do not see how i can say this split Will Be execute on
> this node programmatically?
>
> Le 08/04/2012 18:37, Suraj Varma a écrit :
>
>>> if i do a custom input that split the table by 100 rows, can i
>>> distribute manually each part  on a node   regardless where the data
>>> is ?
>>
>>
>> Yes - if you do a custom split, and have sufficient map slots in your
>> cluster, you can parallelize the map tasks to run on other nodes as
>> well. But if you are using HBase as the sink / source, these map tasks
>> will still reach back to the region server node holding that row. So -
>> if you have all your rows in two nodes, all the map tasks will still
>> reach out to those two nodes. Depending on what your map tasks are
>> doing (intensive crunching vs I/O) this may or may not help with what
>> you are doing.
>> --Suraj
>>
>>
>>
>> On Thu, Apr 5, 2012 at 6:44 AM, Arnaud Le-roy<sd...@gmail.com>  wrote:
>>>
>>> yes i know but it's just an exemple we can do the same exemple with
>>> one billion but effectivelly you could say me in this case the rows
>>> would be stored on all node.
>>>
>>> maybe it's not possible to distributed manually the task through the
>>> cluster ?
>>> and maybe it's not a good idea but  I would like to know in order to
>>> make the best schema for my data.
>>>
>>> Le 5 avril 2012 15:08, Doug Meil<do...@explorysmedical.com>  a écrit
>>> :
>>>>
>>>>
>>>> If you only have 1000 rows, why use MapReduce?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 4/5/12 6:37 AM, "Arnaud Le-roy"<sd...@gmail.com>  wrote:
>>>>
>>>>> but do you think that i can change the default behavior ?
>>>>>
>>>>> for exemple i have ten nodes in my cluster and my table is stored only
>>>>> on two nodes this table have 1000 rows.
>>>>> with the default behavior only two nodes will work for a map/reduce
>>>>> task., isn't it ?
>>>>>
>>>>> if i do a custom input that split the table by 100 rows, can i
>>>>> distribute manually each part  on a node   regardless where the data
>>>>> is ?
>>>>>
>>>>> Le 5 avril 2012 00:36, Doug Meil<do...@explorysmedical.com>  a
>>>>> écrit :
>>>>>>
>>>>>>
>>>>>> The default behavior is that the input splits are where the data is
>>>>>> stored.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 4/4/12 5:24 PM, "sdnetwork"<sd...@gmail.com>  wrote:
>>>>>>
>>>>>>> ok thanks,
>>>>>>>
>>>>>>> but i don't find the information that tell me how the result of the
>>>>>>> split
>>>>>>> is
>>>>>>> distrubuted across the different node of the cluster ?
>>>>>>>
>>>>>>> 1) randomely ?
>>>>>>> 2) where the data is stored ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>
>
>

Re: hbase map/reduce questions

Posted by arnaud but <sd...@gmail.com>.

ok thanks,

 > Yes - if you do a custom split, and have sufficient map slots in your
 > cluster

if I understand well even if the lines are stored on only two nodes of 
my luster I can distribute the "map tasks" on the other nodes?

eg
i have 10 nodes in the cluster i done a custom split that split every 
100 rows.
All rows are stored on only two nodes, my map/reduce task generate 10 
map task because i have 1000 rows.
is that all nodes will receive a map task has executed ? or only the two 
nodes where is stored the 1000 rows.

 > you can parallelize the map tasks to run on other nodes as
 > well

How i can do that ? i do not see how i can say this split Will Be 
execute on this node programmatically?

Le 08/04/2012 18:37, Suraj Varma a écrit :
>> if i do a custom input that split the table by 100 rows, can i
>> distribute manually each part  on a node   regardless where the data
>> is ?
>
> Yes - if you do a custom split, and have sufficient map slots in your
> cluster, you can parallelize the map tasks to run on other nodes as
> well. But if you are using HBase as the sink / source, these map tasks
> will still reach back to the region server node holding that row. So -
> if you have all your rows in two nodes, all the map tasks will still
> reach out to those two nodes. Depending on what your map tasks are
> doing (intensive crunching vs I/O) this may or may not help with what
> you are doing.
> --Suraj
>
>
>
> On Thu, Apr 5, 2012 at 6:44 AM, Arnaud Le-roy<sd...@gmail.com>  wrote:
>> yes i know but it's just an exemple we can do the same exemple with
>> one billion but effectivelly you could say me in this case the rows
>> would be stored on all node.
>>
>> maybe it's not possible to distributed manually the task through the cluster ?
>> and maybe it's not a good idea but  I would like to know in order to
>> make the best schema for my data.
>>
>> Le 5 avril 2012 15:08, Doug Meil<do...@explorysmedical.com>  a écrit :
>>>
>>> If you only have 1000 rows, why use MapReduce?
>>>
>>>
>>>
>>>
>>>
>>> On 4/5/12 6:37 AM, "Arnaud Le-roy"<sd...@gmail.com>  wrote:
>>>
>>>> but do you think that i can change the default behavior ?
>>>>
>>>> for exemple i have ten nodes in my cluster and my table is stored only
>>>> on two nodes this table have 1000 rows.
>>>> with the default behavior only two nodes will work for a map/reduce
>>>> task., isn't it ?
>>>>
>>>> if i do a custom input that split the table by 100 rows, can i
>>>> distribute manually each part  on a node   regardless where the data
>>>> is ?
>>>>
>>>> Le 5 avril 2012 00:36, Doug Meil<do...@explorysmedical.com>  a écrit :
>>>>>
>>>>> The default behavior is that the input splits are where the data is
>>>>> stored.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 4/4/12 5:24 PM, "sdnetwork"<sd...@gmail.com>  wrote:
>>>>>
>>>>>> ok thanks,
>>>>>>
>>>>>> but i don't find the information that tell me how the result of the
>>>>>> split
>>>>>> is
>>>>>> distrubuted across the different node of the cluster ?
>>>>>>
>>>>>> 1) randomely ?
>>>>>> 2) where the data is stored ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>

Re: hbase map/reduce questions

Posted by Suraj Varma <sv...@gmail.com>.

> if i do a custom input that split the table by 100 rows, can i
> distribute manually each part  on a node   regardless where the data
> is ?

Yes - if you do a custom split, and have sufficient map slots in your
cluster, you can parallelize the map tasks to run on other nodes as
well. But if you are using HBase as the sink / source, these map tasks
will still reach back to the region server node holding that row. So -
if you have all your rows in two nodes, all the map tasks will still
reach out to those two nodes. Depending on what your map tasks are
doing (intensive crunching vs I/O) this may or may not help with what
you are doing.
--Suraj



On Thu, Apr 5, 2012 at 6:44 AM, Arnaud Le-roy <sd...@gmail.com> wrote:
> yes i know but it's just an exemple we can do the same exemple with
> one billion but effectivelly you could say me in this case the rows
> would be stored on all node.
>
> maybe it's not possible to distributed manually the task through the cluster ?
> and maybe it's not a good idea but  I would like to know in order to
> make the best schema for my data.
>
> Le 5 avril 2012 15:08, Doug Meil <do...@explorysmedical.com> a écrit :
>>
>> If you only have 1000 rows, why use MapReduce?
>>
>>
>>
>>
>>
>> On 4/5/12 6:37 AM, "Arnaud Le-roy" <sd...@gmail.com> wrote:
>>
>>>but do you think that i can change the default behavior ?
>>>
>>>for exemple i have ten nodes in my cluster and my table is stored only
>>>on two nodes this table have 1000 rows.
>>>with the default behavior only two nodes will work for a map/reduce
>>>task., isn't it ?
>>>
>>>if i do a custom input that split the table by 100 rows, can i
>>>distribute manually each part  on a node   regardless where the data
>>>is ?
>>>
>>>Le 5 avril 2012 00:36, Doug Meil <do...@explorysmedical.com> a écrit :
>>>>
>>>> The default behavior is that the input splits are where the data is
>>>>stored.
>>>>
>>>>
>>>>
>>>>
>>>> On 4/4/12 5:24 PM, "sdnetwork" <sd...@gmail.com> wrote:
>>>>
>>>>>ok thanks,
>>>>>
>>>>>but i don't find the information that tell me how the result of the
>>>>>split
>>>>>is
>>>>>distrubuted across the different node of the cluster ?
>>>>>
>>>>>1) randomely ?
>>>>>2) where the data is stored ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>

Re: hbase map/reduce questions

Posted by Arnaud Le-roy <sd...@gmail.com>.

yes i know but it's just an exemple we can do the same exemple with
one billion but effectivelly you could say me in this case the rows
would be stored on all node.

maybe it's not possible to distributed manually the task through the cluster ?
and maybe it's not a good idea but  I would like to know in order to
make the best schema for my data.

Le 5 avril 2012 15:08, Doug Meil <do...@explorysmedical.com> a écrit :
>
> If you only have 1000 rows, why use MapReduce?
>
>
>
>
>
> On 4/5/12 6:37 AM, "Arnaud Le-roy" <sd...@gmail.com> wrote:
>
>>but do you think that i can change the default behavior ?
>>
>>for exemple i have ten nodes in my cluster and my table is stored only
>>on two nodes this table have 1000 rows.
>>with the default behavior only two nodes will work for a map/reduce
>>task., isn't it ?
>>
>>if i do a custom input that split the table by 100 rows, can i
>>distribute manually each part  on a node   regardless where the data
>>is ?
>>
>>Le 5 avril 2012 00:36, Doug Meil <do...@explorysmedical.com> a écrit :
>>>
>>> The default behavior is that the input splits are where the data is
>>>stored.
>>>
>>>
>>>
>>>
>>> On 4/4/12 5:24 PM, "sdnetwork" <sd...@gmail.com> wrote:
>>>
>>>>ok thanks,
>>>>
>>>>but i don't find the information that tell me how the result of the
>>>>split
>>>>is
>>>>distrubuted across the different node of the cluster ?
>>>>
>>>>1) randomely ?
>>>>2) where the data is stored ?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>

Re: hbase map/reduce questions

Posted by Doug Meil <do...@explorysmedical.com>.

If you only have 1000 rows, why use MapReduce?





On 4/5/12 6:37 AM, "Arnaud Le-roy" <sd...@gmail.com> wrote:

>but do you think that i can change the default behavior ?
>
>for exemple i have ten nodes in my cluster and my table is stored only
>on two nodes this table have 1000 rows.
>with the default behavior only two nodes will work for a map/reduce
>task., isn't it ?
>
>if i do a custom input that split the table by 100 rows, can i
>distribute manually each part  on a node   regardless where the data
>is ?
>
>Le 5 avril 2012 00:36, Doug Meil <do...@explorysmedical.com> a écrit :
>>
>> The default behavior is that the input splits are where the data is
>>stored.
>>
>>
>>
>>
>> On 4/4/12 5:24 PM, "sdnetwork" <sd...@gmail.com> wrote:
>>
>>>ok thanks,
>>>
>>>but i don't find the information that tell me how the result of the
>>>split
>>>is
>>>distrubuted across the different node of the cluster ?
>>>
>>>1) randomely ?
>>>2) where the data is stored ?
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: hbase map/reduce questions

Posted by Arnaud Le-roy <sd...@gmail.com>.

but do you think that i can change the default behavior ?

for exemple i have ten nodes in my cluster and my table is stored only
on two nodes this table have 1000 rows.
with the default behavior only two nodes will work for a map/reduce
task., isn't it ?

if i do a custom input that split the table by 100 rows, can i
distribute manually each part  on a node   regardless where the data
is ?

Le 5 avril 2012 00:36, Doug Meil <do...@explorysmedical.com> a écrit :
>
> The default behavior is that the input splits are where the data is stored.
>
>
>
>
> On 4/4/12 5:24 PM, "sdnetwork" <sd...@gmail.com> wrote:
>
>>ok thanks,
>>
>>but i don't find the information that tell me how the result of the split
>>is
>>distrubuted across the different node of the cluster ?
>>
>>1) randomely ?
>>2) where the data is stored ?
>>
>>
>>
>>
>>
>
>

Re: hbase map/reduce questions

Posted by Doug Meil <do...@explorysmedical.com>.

The default behavior is that the input splits are where the data is stored.

On 4/4/12 5:24 PM, "sdnetwork" <sd...@gmail.com> wrote:

>ok thanks,
>
>but i don't find the information that tell me how the result of the split
>is 
>distrubuted across the different node of the cluster ?
>
>1) randomely ?
>2) where the data is stored ?
>
>
>
>
>

Re: hbase map/reduce questions

Posted by sdnetwork <sd...@gmail.com>.

ok thanks,

but i don't find the information that tell me how the result of the split is 
distrubuted across the different node of the cluster ?

1) randomely ?
2) where the data is stored ?

Re: hbase map/reduce questions

Posted by Doug Meil <do...@explorysmedical.com>.

Hi there, you probably want to see this..

http://hbase.apache.org/book.html#splitter

... as well as this...

http://hbase.apache.org/book.html#regions.arch.locality

... as the latter describes data locality.




On 4/4/12 7:41 AM, "sdnetwork" <sd...@gmail.com> wrote:

>
>Hello,
>
>I started working with hadoop / HBase and I have a question about the
>distribution of map / reduce on a htable through the different nodes of
>the
>cluster.
>
>If I understand the map is subdivided by region (TableInputFormat) and
>each
>map are executed on the node taht containing the region.
>
>But a row is always stored on a single region so if I implements a custom
>org.apache.hadoop.mapreduce.InputFormat  that split a row and one column
>family in parameter, the job will be executed on a single node regardless
>the number of column qualifier?
>
>if this is true i must change my data schema. or maybe i can manually
>distribute the job through the cluster.
>
>I can not find documentation that clearly explains how the map are
>distributed across the cluster.
>maybe somebody have it ?
>
>thanks in advance.
>-- 
>View this message in context:
>http://old.nabble.com/hbase-map-reduce-questions-tp33554779p33554779.html
>Sent from the HBase User mailing list archive at Nabble.com.
>
>