You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Rakhi Khatwani <ra...@gmail.com> on 2009/04/22 11:09:03 UTC

Custom Input Split

Hi,
     I have a table with N records,
     now i want to run a map reduce job with 4 maps and 0 reduces.
     is there a way i can create my own custom input split so that i can
send 'n' records to each map??
    if there is a way, can i have a sample code snippet to gain better
understanding?

Thanks
Raakhi.

Re: Custom Input Split

Posted by Lars George <la...@worldlingo.com>.

Rakhi,

Talking to the job counters is really easy. After the job finishes do this:

    JobConf jobConf = createJob(...);
    RunningJob job = JobClient.runJob(jobConf);
    job.waitForCompletion();
    Counters cnt = job.getCounters();
    long val = cnt.getCounter(Counter.ROWS);

Lars

Rakhi Khatwani wrote:
> Hi Stack,
>       ya i needed the result to feed a program,
> thanks for the suggestions though, i ll try out the Counter.ROWS thing
> tomorrow.
>
> Thanks,
> Raakhi
>
> On Wed, Apr 22, 2009 at 10:36 PM, stack <st...@duboce.net> wrote:
>
>   
>> So you need the result to feed a program?
>>
>> Maybe someone else knows how to ask a finished mapreduce job questions
>> about
>> its counters?   There must be a way?
>>
>> Or, yeah, I suppose, I don't believe RowCounter writes the count to the
>> filesystem.  You'd need to add that if you can't figure a way to ask the
>> finished RowCounter job what the value of its Counter.ROWS counter was.
>>
>> St.Ack
>>
>> On Wed, Apr 22, 2009 at 9:50 AM, Rakhi Khatwani <rakhi.khatwani@gmail.com
>>     
>>> wrote:
>>>       
>>> Hi St Ack,
>>>          well i did go through the usage... where we were supposed to
>>> mention 3 parameters, OutputDir, TableName and Columns
>>> what i actually wanted is an int value count, which contains the number
>>>       
>> of
>>     
>>> rows in the table.
>>> i guess this program seems to store the o/p in some output dir... correct
>>> me
>>> if i am going wrong.
>>>
>>> Thanks,
>>> Raakhi
>>>
>>> On Wed, Apr 22, 2009 at 8:25 AM, stack <st...@duboce.net> wrote:
>>>
>>>       
>>>> Oh, and the reason to use a MR job counting rows is because if many, a
>>>> single process would take too long (If you know you have a small table,
>>>>         
>>> use
>>>       
>>>> the 'count' command in shell).
>>>>
>>>> St.Ack
>>>>
>>>> On Wed, Apr 22, 2009 at 9:06 AM, Stack <sa...@gmail.com> wrote:
>>>>
>>>>         
>>>>> If you run
>>>>>
>>>>> ./bin/hadoop -jar hbase.jar rowcounter
>>>>>
>>>>> It will emit usage.  You are a smart fellow. I think you can take it
>>>>>           
>>> from
>>>       
>>>>> there.
>>>>>
>>>>> Stack
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Apr 22, 2009, at 5:48, Rakhi Khatwani <ra...@gmail.com>
>>>>>           
>>>> wrote:
>>>>         
>>>>>  Hi Lars,
>>>>>           
>>>>>>          Thanks for the suggesstion, I also figured out my problem
>>>>>>             
>>> using
>>>       
>>>>>> TableInputFormatBase.
>>>>>>
>>>>>> but my table had only one region but i still wanted to split the
>>>>>>             
>> input
>>     
>>>>>> into
>>>>>> 4 maps.
>>>>>> so i am basically overriding the getInputSplits() method in
>>>>>> TableInputFormatBase.
>>>>>>
>>>>>> One more question
>>>>>> is there any method in hbase API which can count the number of rows
>>>>>>             
>> in
>>     
>>> a
>>>       
>>>>>> table?
>>>>>> i tried googling it and all i came across is a RowCounter class
>>>>>>             
>> which
>>     
>>> is
>>>       
>>>> a
>>>>         
>>>>>> mapreduce job to count the number of rows. but i really dont know
>>>>>>             
>> how
>>     
>>> to
>>>       
>>>>>> use
>>>>>> it. any suggestions?
>>>>>>
>>>>>> thanks,
>>>>>> Raakhi
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 22, 2009 at 4:30 AM, Lars George <la...@worldlingo.com>
>>>>>>             
>>>> wrote:
>>>>         
>>>>>>  Hi Rakhi,
>>>>>>             
>>>>>>> This is all done in the TableInputFormatBase class, which you can
>>>>>>>               
>>>> extend
>>>>         
>>>>>>> and then override the getSplits() function:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
>>     
>>>>>>> This is where you can then specify how many rows per map are
>>>>>>>               
>>> assigned.
>>>       
>>>>>>> Really straight forward as I see it. I have used it to implement a
>>>>>>> special
>>>>>>> "only use N regions" support where I can run a sample subset
>>>>>>>               
>> against
>>     
>>> a
>>>       
>>>> MR
>>>>         
>>>>>>> job. For example only map 5 out if 8K regions of a table.
>>>>>>>
>>>>>>> The default one will always split all regions into N maps. Hence
>>>>>>>               
>> the
>>     
>>>>>>> recommendation to set the number of maps to the number of regions
>>>>>>>               
>> in
>>     
>>> a
>>>       
>>>>>>> table. If you set it to something lower than it will split the
>>>>>>>               
>>> regions
>>>       
>>>>>>> into
>>>>>>> a smaller number but with more rows per map, i.e. each map gets
>>>>>>>               
>> more
>>     
>>>> than
>>>>         
>>>>>>> one region to process.
>>>>>>>
>>>>>>> Look into the source of the above class and it should be obvious -
>>>>>>>               
>> I
>>     
>>>>>>> hope.
>>>>>>>
>>>>>>> Lars
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Rakhi Khatwani wrote:
>>>>>>>
>>>>>>>  Hi,
>>>>>>>               
>>>>>>>>   I have a table with N records,
>>>>>>>>   now i want to run a map reduce job with 4 maps and 0 reduces.
>>>>>>>>   is there a way i can create my own custom input split so that i
>>>>>>>>                 
>>> can
>>>       
>>>>>>>> send 'n' records to each map??
>>>>>>>>  if there is a way, can i have a sample code snippet to gain
>>>>>>>>                 
>> better
>>     
>>>>>>>> understanding?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Raakhi.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>
>

Re: Custom Input Split

Posted by Rakhi Khatwani <ra...@gmail.com>.

Hi Stack,
      ya i needed the result to feed a program,
thanks for the suggestions though, i ll try out the Counter.ROWS thing
tomorrow.

Thanks,
Raakhi

On Wed, Apr 22, 2009 at 10:36 PM, stack <st...@duboce.net> wrote:

> So you need the result to feed a program?
>
> Maybe someone else knows how to ask a finished mapreduce job questions
> about
> its counters?   There must be a way?
>
> Or, yeah, I suppose, I don't believe RowCounter writes the count to the
> filesystem.  You'd need to add that if you can't figure a way to ask the
> finished RowCounter job what the value of its Counter.ROWS counter was.
>
> St.Ack
>
> On Wed, Apr 22, 2009 at 9:50 AM, Rakhi Khatwani <rakhi.khatwani@gmail.com
> >wrote:
>
> > Hi St Ack,
> >          well i did go through the usage... where we were supposed to
> > mention 3 parameters, OutputDir, TableName and Columns
> > what i actually wanted is an int value count, which contains the number
> of
> > rows in the table.
> > i guess this program seems to store the o/p in some output dir... correct
> > me
> > if i am going wrong.
> >
> > Thanks,
> > Raakhi
> >
> > On Wed, Apr 22, 2009 at 8:25 AM, stack <st...@duboce.net> wrote:
> >
> > > Oh, and the reason to use a MR job counting rows is because if many, a
> > > single process would take too long (If you know you have a small table,
> > use
> > > the 'count' command in shell).
> > >
> > > St.Ack
> > >
> > > On Wed, Apr 22, 2009 at 9:06 AM, Stack <sa...@gmail.com> wrote:
> > >
> > > > If you run
> > > >
> > > > ./bin/hadoop -jar hbase.jar rowcounter
> > > >
> > > > It will emit usage.  You are a smart fellow. I think you can take it
> > from
> > > > there.
> > > >
> > > > Stack
> > > >
> > > >
> > > >
> > > >
> > > > On Apr 22, 2009, at 5:48, Rakhi Khatwani <ra...@gmail.com>
> > > wrote:
> > > >
> > > >  Hi Lars,
> > > >>          Thanks for the suggesstion, I also figured out my problem
> > using
> > > >> TableInputFormatBase.
> > > >>
> > > >> but my table had only one region but i still wanted to split the
> input
> > > >> into
> > > >> 4 maps.
> > > >> so i am basically overriding the getInputSplits() method in
> > > >> TableInputFormatBase.
> > > >>
> > > >> One more question
> > > >> is there any method in hbase API which can count the number of rows
> in
> > a
> > > >> table?
> > > >> i tried googling it and all i came across is a RowCounter class
> which
> > is
> > > a
> > > >> mapreduce job to count the number of rows. but i really dont know
> how
> > to
> > > >> use
> > > >> it. any suggestions?
> > > >>
> > > >> thanks,
> > > >> Raakhi
> > > >>
> > > >>
> > > >> On Wed, Apr 22, 2009 at 4:30 AM, Lars George <la...@worldlingo.com>
> > > wrote:
> > > >>
> > > >>  Hi Rakhi,
> > > >>>
> > > >>> This is all done in the TableInputFormatBase class, which you can
> > > extend
> > > >>> and then override the getSplits() function:
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > >
> >
> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
> > > >>>
> > > >>> This is where you can then specify how many rows per map are
> > assigned.
> > > >>> Really straight forward as I see it. I have used it to implement a
> > > >>> special
> > > >>> "only use N regions" support where I can run a sample subset
> against
> > a
> > > MR
> > > >>> job. For example only map 5 out if 8K regions of a table.
> > > >>>
> > > >>> The default one will always split all regions into N maps. Hence
> the
> > > >>> recommendation to set the number of maps to the number of regions
> in
> > a
> > > >>> table. If you set it to something lower than it will split the
> > regions
> > > >>> into
> > > >>> a smaller number but with more rows per map, i.e. each map gets
> more
> > > than
> > > >>> one region to process.
> > > >>>
> > > >>> Look into the source of the above class and it should be obvious -
> I
> > > >>> hope.
> > > >>>
> > > >>> Lars
> > > >>>
> > > >>>
> > > >>>
> > > >>> Rakhi Khatwani wrote:
> > > >>>
> > > >>>  Hi,
> > > >>>>   I have a table with N records,
> > > >>>>   now i want to run a map reduce job with 4 maps and 0 reduces.
> > > >>>>   is there a way i can create my own custom input split so that i
> > can
> > > >>>> send 'n' records to each map??
> > > >>>>  if there is a way, can i have a sample code snippet to gain
> better
> > > >>>> understanding?
> > > >>>>
> > > >>>> Thanks
> > > >>>> Raakhi.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>
> > >
> >
>

Re: Custom Input Split

Posted by stack <st...@duboce.net>.

So you need the result to feed a program?

Maybe someone else knows how to ask a finished mapreduce job questions about
its counters?   There must be a way?

Or, yeah, I suppose, I don't believe RowCounter writes the count to the
filesystem.  You'd need to add that if you can't figure a way to ask the
finished RowCounter job what the value of its Counter.ROWS counter was.

St.Ack

On Wed, Apr 22, 2009 at 9:50 AM, Rakhi Khatwani <ra...@gmail.com>wrote:

> Hi St Ack,
>          well i did go through the usage... where we were supposed to
> mention 3 parameters, OutputDir, TableName and Columns
> what i actually wanted is an int value count, which contains the number of
> rows in the table.
> i guess this program seems to store the o/p in some output dir... correct
> me
> if i am going wrong.
>
> Thanks,
> Raakhi
>
> On Wed, Apr 22, 2009 at 8:25 AM, stack <st...@duboce.net> wrote:
>
> > Oh, and the reason to use a MR job counting rows is because if many, a
> > single process would take too long (If you know you have a small table,
> use
> > the 'count' command in shell).
> >
> > St.Ack
> >
> > On Wed, Apr 22, 2009 at 9:06 AM, Stack <sa...@gmail.com> wrote:
> >
> > > If you run
> > >
> > > ./bin/hadoop -jar hbase.jar rowcounter
> > >
> > > It will emit usage.  You are a smart fellow. I think you can take it
> from
> > > there.
> > >
> > > Stack
> > >
> > >
> > >
> > >
> > > On Apr 22, 2009, at 5:48, Rakhi Khatwani <ra...@gmail.com>
> > wrote:
> > >
> > >  Hi Lars,
> > >>          Thanks for the suggesstion, I also figured out my problem
> using
> > >> TableInputFormatBase.
> > >>
> > >> but my table had only one region but i still wanted to split the input
> > >> into
> > >> 4 maps.
> > >> so i am basically overriding the getInputSplits() method in
> > >> TableInputFormatBase.
> > >>
> > >> One more question
> > >> is there any method in hbase API which can count the number of rows in
> a
> > >> table?
> > >> i tried googling it and all i came across is a RowCounter class which
> is
> > a
> > >> mapreduce job to count the number of rows. but i really dont know how
> to
> > >> use
> > >> it. any suggestions?
> > >>
> > >> thanks,
> > >> Raakhi
> > >>
> > >>
> > >> On Wed, Apr 22, 2009 at 4:30 AM, Lars George <la...@worldlingo.com>
> > wrote:
> > >>
> > >>  Hi Rakhi,
> > >>>
> > >>> This is all done in the TableInputFormatBase class, which you can
> > extend
> > >>> and then override the getSplits() function:
> > >>>
> > >>>
> > >>>
> > >>>
> >
> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
> > >>>
> > >>> This is where you can then specify how many rows per map are
> assigned.
> > >>> Really straight forward as I see it. I have used it to implement a
> > >>> special
> > >>> "only use N regions" support where I can run a sample subset against
> a
> > MR
> > >>> job. For example only map 5 out if 8K regions of a table.
> > >>>
> > >>> The default one will always split all regions into N maps. Hence the
> > >>> recommendation to set the number of maps to the number of regions in
> a
> > >>> table. If you set it to something lower than it will split the
> regions
> > >>> into
> > >>> a smaller number but with more rows per map, i.e. each map gets more
> > than
> > >>> one region to process.
> > >>>
> > >>> Look into the source of the above class and it should be obvious - I
> > >>> hope.
> > >>>
> > >>> Lars
> > >>>
> > >>>
> > >>>
> > >>> Rakhi Khatwani wrote:
> > >>>
> > >>>  Hi,
> > >>>>   I have a table with N records,
> > >>>>   now i want to run a map reduce job with 4 maps and 0 reduces.
> > >>>>   is there a way i can create my own custom input split so that i
> can
> > >>>> send 'n' records to each map??
> > >>>>  if there is a way, can i have a sample code snippet to gain better
> > >>>> understanding?
> > >>>>
> > >>>> Thanks
> > >>>> Raakhi.
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>
> >
>

Re: Custom Input Split

Posted by Rakhi Khatwani <ra...@gmail.com>.

Hi St Ack,
          well i did go through the usage... where we were supposed to
mention 3 parameters, OutputDir, TableName and Columns
what i actually wanted is an int value count, which contains the number of
rows in the table.
i guess this program seems to store the o/p in some output dir... correct me
if i am going wrong.

Thanks,
Raakhi

On Wed, Apr 22, 2009 at 8:25 AM, stack <st...@duboce.net> wrote:

> Oh, and the reason to use a MR job counting rows is because if many, a
> single process would take too long (If you know you have a small table, use
> the 'count' command in shell).
>
> St.Ack
>
> On Wed, Apr 22, 2009 at 9:06 AM, Stack <sa...@gmail.com> wrote:
>
> > If you run
> >
> > ./bin/hadoop -jar hbase.jar rowcounter
> >
> > It will emit usage.  You are a smart fellow. I think you can take it from
> > there.
> >
> > Stack
> >
> >
> >
> >
> > On Apr 22, 2009, at 5:48, Rakhi Khatwani <ra...@gmail.com>
> wrote:
> >
> >  Hi Lars,
> >>          Thanks for the suggesstion, I also figured out my problem using
> >> TableInputFormatBase.
> >>
> >> but my table had only one region but i still wanted to split the input
> >> into
> >> 4 maps.
> >> so i am basically overriding the getInputSplits() method in
> >> TableInputFormatBase.
> >>
> >> One more question
> >> is there any method in hbase API which can count the number of rows in a
> >> table?
> >> i tried googling it and all i came across is a RowCounter class which is
> a
> >> mapreduce job to count the number of rows. but i really dont know how to
> >> use
> >> it. any suggestions?
> >>
> >> thanks,
> >> Raakhi
> >>
> >>
> >> On Wed, Apr 22, 2009 at 4:30 AM, Lars George <la...@worldlingo.com>
> wrote:
> >>
> >>  Hi Rakhi,
> >>>
> >>> This is all done in the TableInputFormatBase class, which you can
> extend
> >>> and then override the getSplits() function:
> >>>
> >>>
> >>>
> >>>
> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
> >>>
> >>> This is where you can then specify how many rows per map are assigned.
> >>> Really straight forward as I see it. I have used it to implement a
> >>> special
> >>> "only use N regions" support where I can run a sample subset against a
> MR
> >>> job. For example only map 5 out if 8K regions of a table.
> >>>
> >>> The default one will always split all regions into N maps. Hence the
> >>> recommendation to set the number of maps to the number of regions in a
> >>> table. If you set it to something lower than it will split the regions
> >>> into
> >>> a smaller number but with more rows per map, i.e. each map gets more
> than
> >>> one region to process.
> >>>
> >>> Look into the source of the above class and it should be obvious - I
> >>> hope.
> >>>
> >>> Lars
> >>>
> >>>
> >>>
> >>> Rakhi Khatwani wrote:
> >>>
> >>>  Hi,
> >>>>   I have a table with N records,
> >>>>   now i want to run a map reduce job with 4 maps and 0 reduces.
> >>>>   is there a way i can create my own custom input split so that i can
> >>>> send 'n' records to each map??
> >>>>  if there is a way, can i have a sample code snippet to gain better
> >>>> understanding?
> >>>>
> >>>> Thanks
> >>>> Raakhi.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>

Re: Custom Input Split

Posted by stack <st...@duboce.net>.

Oh, and the reason to use a MR job counting rows is because if many, a
single process would take too long (If you know you have a small table, use
the 'count' command in shell).

St.Ack

On Wed, Apr 22, 2009 at 9:06 AM, Stack <sa...@gmail.com> wrote:

> If you run
>
> ./bin/hadoop -jar hbase.jar rowcounter
>
> It will emit usage.  You are a smart fellow. I think you can take it from
> there.
>
> Stack
>
>
>
>
> On Apr 22, 2009, at 5:48, Rakhi Khatwani <ra...@gmail.com> wrote:
>
>  Hi Lars,
>>          Thanks for the suggesstion, I also figured out my problem using
>> TableInputFormatBase.
>>
>> but my table had only one region but i still wanted to split the input
>> into
>> 4 maps.
>> so i am basically overriding the getInputSplits() method in
>> TableInputFormatBase.
>>
>> One more question
>> is there any method in hbase API which can count the number of rows in a
>> table?
>> i tried googling it and all i came across is a RowCounter class which is a
>> mapreduce job to count the number of rows. but i really dont know how to
>> use
>> it. any suggestions?
>>
>> thanks,
>> Raakhi
>>
>>
>> On Wed, Apr 22, 2009 at 4:30 AM, Lars George <la...@worldlingo.com> wrote:
>>
>>  Hi Rakhi,
>>>
>>> This is all done in the TableInputFormatBase class, which you can extend
>>> and then override the getSplits() function:
>>>
>>>
>>>
>>> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
>>>
>>> This is where you can then specify how many rows per map are assigned.
>>> Really straight forward as I see it. I have used it to implement a
>>> special
>>> "only use N regions" support where I can run a sample subset against a MR
>>> job. For example only map 5 out if 8K regions of a table.
>>>
>>> The default one will always split all regions into N maps. Hence the
>>> recommendation to set the number of maps to the number of regions in a
>>> table. If you set it to something lower than it will split the regions
>>> into
>>> a smaller number but with more rows per map, i.e. each map gets more than
>>> one region to process.
>>>
>>> Look into the source of the above class and it should be obvious - I
>>> hope.
>>>
>>> Lars
>>>
>>>
>>>
>>> Rakhi Khatwani wrote:
>>>
>>>  Hi,
>>>>   I have a table with N records,
>>>>   now i want to run a map reduce job with 4 maps and 0 reduces.
>>>>   is there a way i can create my own custom input split so that i can
>>>> send 'n' records to each map??
>>>>  if there is a way, can i have a sample code snippet to gain better
>>>> understanding?
>>>>
>>>> Thanks
>>>> Raakhi.
>>>>
>>>>
>>>>
>>>>
>>>

Re: Custom Input Split

Posted by Stack <sa...@gmail.com>.

If you run

./bin/hadoop -jar hbase.jar rowcounter

It will emit usage.  You are a smart fellow. I think you can take it  
from there.

Stack



On Apr 22, 2009, at 5:48, Rakhi Khatwani <ra...@gmail.com>  
wrote:

> Hi Lars,
>           Thanks for the suggesstion, I also figured out my problem  
> using
> TableInputFormatBase.
>
> but my table had only one region but i still wanted to split the  
> input into
> 4 maps.
> so i am basically overriding the getInputSplits() method in
> TableInputFormatBase.
>
> One more question
> is there any method in hbase API which can count the number of rows  
> in a
> table?
> i tried googling it and all i came across is a RowCounter class  
> which is a
> mapreduce job to count the number of rows. but i really dont know  
> how to use
> it. any suggestions?
>
> thanks,
> Raakhi
>
>
> On Wed, Apr 22, 2009 at 4:30 AM, Lars George <la...@worldlingo.com>  
> wrote:
>
>> Hi Rakhi,
>>
>> This is all done in the TableInputFormatBase class, which you can  
>> extend
>> and then override the getSplits() function:
>>
>>
>> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
>>
>> This is where you can then specify how many rows per map are  
>> assigned.
>> Really straight forward as I see it. I have used it to implement a  
>> special
>> "only use N regions" support where I can run a sample subset  
>> against a MR
>> job. For example only map 5 out if 8K regions of a table.
>>
>> The default one will always split all regions into N maps. Hence the
>> recommendation to set the number of maps to the number of regions  
>> in a
>> table. If you set it to something lower than it will split the  
>> regions into
>> a smaller number but with more rows per map, i.e. each map gets  
>> more than
>> one region to process.
>>
>> Look into the source of the above class and it should be obvious -  
>> I hope.
>>
>> Lars
>>
>>
>>
>> Rakhi Khatwani wrote:
>>
>>> Hi,
>>>    I have a table with N records,
>>>    now i want to run a map reduce job with 4 maps and 0 reduces.
>>>    is there a way i can create my own custom input split so that i  
>>> can
>>> send 'n' records to each map??
>>>   if there is a way, can i have a sample code snippet to gain better
>>> understanding?
>>>
>>> Thanks
>>> Raakhi.
>>>
>>>
>>>
>>

Re: Custom Input Split

Posted by Rakhi Khatwani <ra...@gmail.com>.

Hi Lars,
           Thanks for the suggesstion, I also figured out my problem using
TableInputFormatBase.

but my table had only one region but i still wanted to split the input into
4 maps.
so i am basically overriding the getInputSplits() method in
TableInputFormatBase.

One more question
is there any method in hbase API which can count the number of rows in a
table?
i tried googling it and all i came across is a RowCounter class which is a
mapreduce job to count the number of rows. but i really dont know how to use
it. any suggestions?

thanks,
Raakhi


On Wed, Apr 22, 2009 at 4:30 AM, Lars George <la...@worldlingo.com> wrote:

> Hi Rakhi,
>
> This is all done in the TableInputFormatBase class, which you can extend
> and then override the getSplits() function:
>
>
> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
>
> This is where you can then specify how many rows per map are assigned.
> Really straight forward as I see it. I have used it to implement a special
> "only use N regions" support where I can run a sample subset against a MR
> job. For example only map 5 out if 8K regions of a table.
>
> The default one will always split all regions into N maps. Hence the
> recommendation to set the number of maps to the number of regions in a
> table. If you set it to something lower than it will split the regions into
> a smaller number but with more rows per map, i.e. each map gets more than
> one region to process.
>
> Look into the source of the above class and it should be obvious - I hope.
>
> Lars
>
>
>
> Rakhi Khatwani wrote:
>
>> Hi,
>>     I have a table with N records,
>>     now i want to run a map reduce job with 4 maps and 0 reduces.
>>     is there a way i can create my own custom input split so that i can
>> send 'n' records to each map??
>>    if there is a way, can i have a sample code snippet to gain better
>> understanding?
>>
>> Thanks
>> Raakhi.
>>
>>
>>
>

Re: Custom Input Split

Posted by Lars George <la...@worldlingo.com>.

Hi Rakhi,

This is all done in the TableInputFormatBase class, which you can extend 
and then override the getSplits() function:

http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html

This is where you can then specify how many rows per map are assigned. 
Really straight forward as I see it. I have used it to implement a 
special "only use N regions" support where I can run a sample subset 
against a MR job. For example only map 5 out if 8K regions of a table.

The default one will always split all regions into N maps. Hence the 
recommendation to set the number of maps to the number of regions in a 
table. If you set it to something lower than it will split the regions 
into a smaller number but with more rows per map, i.e. each map gets 
more than one region to process.

Look into the source of the above class and it should be obvious - I hope.

Lars

Rakhi Khatwani wrote:
> Hi,
>      I have a table with N records,
>      now i want to run a map reduce job with 4 maps and 0 reduces.
>      is there a way i can create my own custom input split so that i can
> send 'n' records to each map??
>     if there is a way, can i have a sample code snippet to gain better
> understanding?
>
> Thanks
> Raakhi.
>
>