You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by john smith <js...@gmail.com> on 2009/08/20 18:42:36 UTC

Doubt in HBase

Hi all ,

I have one small doubt . Kindly answer it even if it sounds silly.

Iam using Map Reduce in HBase in distributed mode .  I have a table which
spans across 5 region servers . I am using TableInputFormat to read the data
from the tables in the map . When i run the program , by default how many
map regions are created ? Is it one per region server or more ?

Also after the map task is over.. reduce task is taking a bit more time . Is
it due to moving the map output across the regionservers? i.e, moving the
values of same key to a particular reduce phase to start the reducer? Is
there any way i can optimize the code (e.g. by storing data of same reducer
nearby )

Thanks :)

Re: Doubt in HBase

Posted by Jonathan Gray <jl...@streamy.com>.

What Amandeep said.

Also, one clarification for you.  You mentioned the reduce task moving 
map output across regionservers.  Remember, HBase is just a MapReduce 
input source or output sink.  The sort/shuffle/reduce is a part of 
Hadoop MapReduce and has nothing to do with HBase directly.  It is 
utilizing the JobTracker/TaskTrackers, not the RegionServers.

Like AK said, you can increase the number of reducers, or reduce the 
amount of data you output from the maps.

JG

Amandeep Khurana wrote:
> On Thu, Aug 20, 2009 at 9:42 AM, john smith <js...@gmail.com> wrote:
> 
>> Hi all ,
>>
>> I have one small doubt . Kindly answer it even if it sounds silly.
>>
> 
> No questions are silly.. Dont worry
> 
> 
>> Iam using Map Reduce in HBase in distributed mode .  I have a table which
>> spans across 5 region servers . I am using TableInputFormat to read the
>> data
>> from the tables in the map . When i run the program , by default how many
>> map regions are created ? Is it one per region server or more ?
>>
> 
> If you set the number of map tasks to a high number, it automatically spawns
> one map task for each region (not region server). Otherwise, it'll spawn the
> number you have explicitly specified in the job.
> 
> 
>> Also after the map task is over.. reduce task is taking a bit more time .
>> Is
>> it due to moving the map output across the regionservers? i.e, moving the
>> values of same key to a particular reduce phase to start the reducer? Is
>> there any way i can optimize the code (e.g. by storing data of same reducer
>> nearby )
>>
> 
> Increase the number of reducers. Each reducer will have lesser data to move.
> 
> 
>> Thanks :)
>>
>

Re: Doubt in HBase

Posted by Amandeep Khurana <am...@gmail.com>.

On Thu, Aug 20, 2009 at 9:42 AM, john smith <js...@gmail.com> wrote:

> Hi all ,
>
> I have one small doubt . Kindly answer it even if it sounds silly.
>

No questions are silly.. Dont worry

>
> Iam using Map Reduce in HBase in distributed mode .  I have a table which
> spans across 5 region servers . I am using TableInputFormat to read the
> data
> from the tables in the map . When i run the program , by default how many
> map regions are created ? Is it one per region server or more ?
>

If you set the number of map tasks to a high number, it automatically spawns
one map task for each region (not region server). Otherwise, it'll spawn the
number you have explicitly specified in the job.

>
> Also after the map task is over.. reduce task is taking a bit more time .
> Is
> it due to moving the map output across the regionservers? i.e, moving the
> values of same key to a particular reduce phase to start the reducer? Is
> there any way i can optimize the code (e.g. by storing data of same reducer
> nearby )
>

Increase the number of reducers. Each reducer will have lesser data to move.

>
> Thanks :)
>