You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Thibaut_ <tb...@blue.lu> on 2008/11/16 20:21:06 UTC

How to detect when the mapper is called the last time?

Hi, 

As each row of my hbase table can take a lot of time to process (waiting on
answeres from other hosts), I would like to create a few threads to process
that data in parallel. I would then use the last call to the map function to
wait for all threads to finish their job and only return the last call to
the map function when everything is done and all threads exited.

How do I know when the last row is passed to my mapper function? (I'm
extending TableMap for my mapper as also done in the wiki examples) I didn't
find any function to check this.

Another possibility would be to create more mapper jobs and let the hadoop
framework do the processing in parallel. However I read somewhere that each
mapper get's an entire region. In my case, the data in each row is very
small, so each mapper could get millions of rows (with the default
region/block size). 

What would you do?

Thanks,
Thibaut
-- 
View this message in context: http://www.nabble.com/How-to-detect-when-the-mapper-is-called-the-last-time--tp20528861p20528861.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: How to detect when the mapper is called the last time?

Posted by Thibaut_ <tb...@blue.lu>.
Thanks St.Ack,

That's exactly what I needed :-). I will modify the MultithreadedMapRunner
class and the Mapper interface to add setup/teardown logic.

Thanks,
Thibaut

-- 
View this message in context: http://www.nabble.com/How-to-detect-when-the-mapper-is-called-the-last-time--tp20528861p20530873.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: How to detect when the mapper is called the last time?

Posted by Michael Stack <st...@duboce.net>.
Thibaut_ wrote:
> Hi, 
>
> As each row of my hbase table can take a lot of time to process (waiting on
> answeres from other hosts), I would like to create a few threads to process
> that data in parallel. 
> I would then use the last call to the map function to
> wait for all threads to finish their job and only return the last call to
> the map function when everything is done and all threads exited.
>
>   
See MapRunner up in Hadoop.  Its the class that does the next, next, 
next on the custom hbase RecordReader.  Sounds like you want your own 
MapRunner so you can do some handling after we've run off the end of the 
map's Region.  You can set your won MapRunner on the JobConf.


> How do I know when the last row is passed to my mapper function? (I'm
> extending TableMap for my mapper as also done in the wiki examples) I didn't
> find any function to check this.
>
>   
I looked at overriding TableInputFormat catching the end row but end row 
is not inclusive so you can't trigger your cleanup when you see the map 
end row.

> Another possibility would be to create more mapper jobs and let the hadoop
> framework do the processing in parallel. However I read somewhere that each
> mapper get's an entire region. In my case, the data in each row is very
> small, so each mapper could get millions of rows (with the default
> region/block size). 
>
>   
Run in parallel if you can.   Yes, each map gets a region by default 
(though its possible to make it so a map can have more than one region 
if you supply a number-of-splits < number-of-regions -- see getSplits in 
TableInputFormatBase).

Do you have many regions?  Do you want to make it so you can split a 
region into pieces?  If so, override TableInputFormat or 
TableInputFormatBase and do your own getSplits implementation.

St.Ack


> What would you do?
>
> Thanks,
> Thibaut
>