You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by john smith <js...@gmail.com> on 2010/01/08 04:51:09 UTC

MR in HBase

Hi all,

My requirement is that , I must read two tables (belonging to the same
region server) in the same Map .

Normally TableMap supports only 1 table at a time and right now I am reading
the entire 2nd table in any one
of the maps , This is a big overhead . So can any one suggest some
modification of TableMap or a different
approach which can read 2 tables simultaneously at the same time . This can
be very useful to us!

Thanks
J-S

Re: MR in HBase

Posted by bharath v <bh...@gmail.com>.
John,

I implemented this sometime back .. My need was something similar to yours
.. which involves scanning more than one table in Map at the same time .. as
you mentioned in an example . You just need to follow the steps as mentioned
by Mridul ..

You need to change getSplits() function and getRecordReader() funcs in such
a way that they can process 2 tables at the same time ..

if(split belongs to table1)
 {
           return  RR/InputSplits for t1
}
else
{
             return for t2;
}

You also need to change the way TableMapReduceUtil initiates the Map Job ..
You write your CustomTableMapReduceUtil..


Hope this helps in some way!

Thanks

V.Bharath
Btech-3rd year
IIIT-Hyderabad

On Mon, Jan 11, 2010 at 3:05 AM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

>
> Unfortuantely I cant directly share our code ... but for an example, you
> can look at MultipleInputs and/or DelegatingInputFormat in hadoop.
>
> If you want a more sophisticated example, please take a look at pig
> subproject in hadoop (though that might be a bit too complicated to
> investigate the code of, for this simple usecase)
>
>
> Regards,
> Mridul
>
> john smith wrote:
>
>> Mridul ,
>>
>> It seems it is feasible , but Iam not 100% clear . Can you please show us
>> your implementation in hadoop so that we can get some idea and implement
>> the
>> same for HBase. Thanks for your help.
>>
>> J-S
>>
>> On Sat, Jan 9, 2010 at 12:26 AM, Mridul Muralidharan
>> <mr...@yahoo-inc.com>wrote:
>>
>>  Hi,
>>>
>>>
>>> This is assuming there is no easier way to do it (someone from hbase team
>>> can comment better !).
>>>
>>> But the usual way to handle this for mapreduce is to create a composite
>>> input format : which delegates to the underlying formats to generate the
>>> splits, and the corresponding record reader's based on the split.
>>>
>>>
>>> I have not done this for hbase though - but looking at
>>> TableInputFormatBase, it looks possible to implement ...
>>>
>>> Specifically for hbase, something along the lines of :
>>>
>>> --- start dirty pseudo code ---
>>>
>>> CustomTableInputFormat extends TableInputFormatBase and implements
>>> setConf() to configure the table(s) required.
>>>
>>> public class CustomTableInputFormat extends
>>> InputFormat<ImmutableBytesWritable, Result> {
>>>
>>>  private CustomTableInputFormat delegate1;
>>>  private CustomTableInputFormat delegate2;
>>>
>>>  public void setConf(){
>>>   delegate1 = createTable1InputFormat();
>>>   delegate2 = createTable2InputFormat();
>>>  }
>>>
>>>  public List<InputSplit> getSplits(JobContext context) throws IOException
>>> {
>>>   List<InputSplit> retval = new LinkedList<InputSplit>();
>>>   retval.addAll(delegate1.getSplits(context));
>>>   retval.addAll(delegate1.getSplits(context));
>>>   return retval;
>>>  }
>>>
>>>
>>>  public abstract
>>>   RecordReader<K,V> createRecordReader(InputSplit split,
>>>                                        TaskAttemptContext context
>>>                                       ) throws IOException,
>>>                                                InterruptedException {
>>>   if (split for table1) return delegate.createRecordReader();
>>>   else if (split for table2) return delegate.createRecordReader();
>>>   else throw exception
>>>  }
>>>
>>> }
>>>
>>> --- end pseudo code ---
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>> john smith wrote:
>>>
>>>  Mridul
>>>>
>>>> Can you be more clear .. I didn't get you !
>>>>
>>>> On Fri, Jan 8, 2010 at 6:13 PM, Mridul Muralidharan
>>>> <mr...@yahoo-inc.com>wrote:
>>>>
>>>>
>>>>  If you just want to scan both tables for your mapper, assuming there is
>>>>> no
>>>>> easier way to do it - cant you not write a composite input format which
>>>>> delegates to both tables input formats ?
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>>
>>>>> john smith wrote:
>>>>>
>>>>>  Stack,
>>>>>
>>>>>> The requirement is that I need to I need to scan two tables A,B for
>>>>>>  an
>>>>>> MR
>>>>>> job ,Order is not important . That is , the reduce phase  contains
>>>>>> both
>>>>>> keys
>>>>>> from both A,B.
>>>>>>
>>>>>> Presently what iam doing is that I am using TableMap for "A" .. And in
>>>>>> one
>>>>>> of the mappers , I am reading the entire B using a scanner. But this
>>>>>> is
>>>>>> a
>>>>>> big overhead right ! Because non-local  B data will we transferred
>>>>>> (over
>>>>>> network) to the machine executing that Map phase . Instead what
>>>>>> I was thinking is that , there is some kind of variant of TableMap
>>>>>> which
>>>>>> scans for both A,B and emit the corresponding keys . Order is not at
>>>>>> all
>>>>>> important  and also no random lookups . I need the entire B table keys
>>>>>> in
>>>>>> some way or the other with least overhead !
>>>>>>
>>>>>> Also therz one more solution I was thinking ..  Suppose Iam scanning
>>>>>> some
>>>>>> particular region using table map . I can get that particular region
>>>>>> names
>>>>>> using some func in the API , then I can build a scanner on B over that
>>>>>> particular region and emit all the keys from B . This doesn't require
>>>>>> and
>>>>>> network transfer of data . Is this solution feasible ?? If yes any
>>>>>> hints
>>>>>> on
>>>>>> what classes to use from API ?
>>>>>>
>>>>>> Thanks ,
>>>>>> J-S
>>>>>>
>>>>>> On Fri, Jan 8, 2010 at 10:46 AM, stack <st...@duboce.net> wrote:
>>>>>>
>>>>>>  This is a little tough.  Do both tables have same number of regions?
>>>>>>  Are
>>>>>>
>>>>>>  you walking through the two tables serially in your mapreduce or do
>>>>>>> you
>>>>>>> want
>>>>>>> to do random lookups into the second table dependent on the row you
>>>>>>> are
>>>>>>> currently processing in table one?
>>>>>>>
>>>>>>> St.Ack
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 7, 2010 at 7:51 PM, john smith <js...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Hi all,
>>>>>>>
>>>>>>>  My requirement is that , I must read two tables (belonging to the
>>>>>>>> same
>>>>>>>> region server) in the same Map .
>>>>>>>>
>>>>>>>> Normally TableMap supports only 1 table at a time and right now I am
>>>>>>>> reading
>>>>>>>> the entire 2nd table in any one
>>>>>>>> of the maps , This is a big overhead . So can any one suggest some
>>>>>>>> modification of TableMap or a different
>>>>>>>> approach which can read 2 tables simultaneously at the same time .
>>>>>>>> This
>>>>>>>>
>>>>>>>>  can
>>>>>>>>
>>>>>>>  be very useful to us!
>>>>>>>
>>>>>>>> Thanks
>>>>>>>> J-S
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>

Re: MR in HBase

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Unfortuantely I cant directly share our code ... but for an example, you 
can look at MultipleInputs and/or DelegatingInputFormat in hadoop.

If you want a more sophisticated example, please take a look at pig 
subproject in hadoop (though that might be a bit too complicated to 
investigate the code of, for this simple usecase)

Regards,
Mridul

john smith wrote:
> Mridul ,
> 
> It seems it is feasible , but Iam not 100% clear . Can you please show us
> your implementation in hadoop so that we can get some idea and implement the
> same for HBase. Thanks for your help.
> 
> J-S
> 
> On Sat, Jan 9, 2010 at 12:26 AM, Mridul Muralidharan
> <mr...@yahoo-inc.com>wrote:
> 
>> Hi,
>>
>>
>> This is assuming there is no easier way to do it (someone from hbase team
>> can comment better !).
>>
>> But the usual way to handle this for mapreduce is to create a composite
>> input format : which delegates to the underlying formats to generate the
>> splits, and the corresponding record reader's based on the split.
>>
>>
>> I have not done this for hbase though - but looking at
>> TableInputFormatBase, it looks possible to implement ...
>>
>> Specifically for hbase, something along the lines of :
>>
>> --- start dirty pseudo code ---
>>
>> CustomTableInputFormat extends TableInputFormatBase and implements
>> setConf() to configure the table(s) required.
>>
>> public class CustomTableInputFormat extends
>> InputFormat<ImmutableBytesWritable, Result> {
>>
>>  private CustomTableInputFormat delegate1;
>>  private CustomTableInputFormat delegate2;
>>
>>  public void setConf(){
>>    delegate1 = createTable1InputFormat();
>>    delegate2 = createTable2InputFormat();
>>  }
>>
>>  public List<InputSplit> getSplits(JobContext context) throws IOException {
>>    List<InputSplit> retval = new LinkedList<InputSplit>();
>>    retval.addAll(delegate1.getSplits(context));
>>    retval.addAll(delegate1.getSplits(context));
>>    return retval;
>>  }
>>
>>
>>  public abstract
>>    RecordReader<K,V> createRecordReader(InputSplit split,
>>                                         TaskAttemptContext context
>>                                        ) throws IOException,
>>                                                 InterruptedException {
>>    if (split for table1) return delegate.createRecordReader();
>>    else if (split for table2) return delegate.createRecordReader();
>>    else throw exception
>>  }
>>
>> }
>>
>> --- end pseudo code ---
>>
>>
>> Regards,
>> Mridul
>>
>> john smith wrote:
>>
>>> Mridul
>>>
>>> Can you be more clear .. I didn't get you !
>>>
>>> On Fri, Jan 8, 2010 at 6:13 PM, Mridul Muralidharan
>>> <mr...@yahoo-inc.com>wrote:
>>>
>>>
>>>> If you just want to scan both tables for your mapper, assuming there is
>>>> no
>>>> easier way to do it - cant you not write a composite input format which
>>>> delegates to both tables input formats ?
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>>
>>>> john smith wrote:
>>>>
>>>>  Stack,
>>>>> The requirement is that I need to I need to scan two tables A,B for  an
>>>>> MR
>>>>> job ,Order is not important . That is , the reduce phase  contains both
>>>>> keys
>>>>> from both A,B.
>>>>>
>>>>> Presently what iam doing is that I am using TableMap for "A" .. And in
>>>>> one
>>>>> of the mappers , I am reading the entire B using a scanner. But this is
>>>>> a
>>>>> big overhead right ! Because non-local  B data will we transferred (over
>>>>> network) to the machine executing that Map phase . Instead what
>>>>> I was thinking is that , there is some kind of variant of TableMap which
>>>>> scans for both A,B and emit the corresponding keys . Order is not at all
>>>>> important  and also no random lookups . I need the entire B table keys
>>>>> in
>>>>> some way or the other with least overhead !
>>>>>
>>>>> Also therz one more solution I was thinking ..  Suppose Iam scanning
>>>>> some
>>>>> particular region using table map . I can get that particular region
>>>>> names
>>>>> using some func in the API , then I can build a scanner on B over that
>>>>> particular region and emit all the keys from B . This doesn't require
>>>>> and
>>>>> network transfer of data . Is this solution feasible ?? If yes any hints
>>>>> on
>>>>> what classes to use from API ?
>>>>>
>>>>> Thanks ,
>>>>> J-S
>>>>>
>>>>> On Fri, Jan 8, 2010 at 10:46 AM, stack <st...@duboce.net> wrote:
>>>>>
>>>>>  This is a little tough.  Do both tables have same number of regions?
>>>>>  Are
>>>>>
>>>>>> you walking through the two tables serially in your mapreduce or do you
>>>>>> want
>>>>>> to do random lookups into the second table dependent on the row you are
>>>>>> currently processing in table one?
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 7, 2010 at 7:51 PM, john smith <js...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>  Hi all,
>>>>>>
>>>>>>> My requirement is that , I must read two tables (belonging to the same
>>>>>>> region server) in the same Map .
>>>>>>>
>>>>>>> Normally TableMap supports only 1 table at a time and right now I am
>>>>>>> reading
>>>>>>> the entire 2nd table in any one
>>>>>>> of the maps , This is a big overhead . So can any one suggest some
>>>>>>> modification of TableMap or a different
>>>>>>> approach which can read 2 tables simultaneously at the same time .
>>>>>>> This
>>>>>>>
>>>>>>>  can
>>>>>>  be very useful to us!
>>>>>>> Thanks
>>>>>>> J-S
>>>>>>>
>>>>>>>
>>>>>>>


Re: MR in HBase

Posted by john smith <js...@gmail.com>.
Mridul ,

It seems it is feasible , but Iam not 100% clear . Can you please show us
your implementation in hadoop so that we can get some idea and implement the
same for HBase. Thanks for your help.

J-S

On Sat, Jan 9, 2010 at 12:26 AM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

>
> Hi,
>
>
> This is assuming there is no easier way to do it (someone from hbase team
> can comment better !).
>
> But the usual way to handle this for mapreduce is to create a composite
> input format : which delegates to the underlying formats to generate the
> splits, and the corresponding record reader's based on the split.
>
>
> I have not done this for hbase though - but looking at
> TableInputFormatBase, it looks possible to implement ...
>
> Specifically for hbase, something along the lines of :
>
> --- start dirty pseudo code ---
>
> CustomTableInputFormat extends TableInputFormatBase and implements
> setConf() to configure the table(s) required.
>
> public class CustomTableInputFormat extends
> InputFormat<ImmutableBytesWritable, Result> {
>
>  private CustomTableInputFormat delegate1;
>  private CustomTableInputFormat delegate2;
>
>  public void setConf(){
>    delegate1 = createTable1InputFormat();
>    delegate2 = createTable2InputFormat();
>  }
>
>  public List<InputSplit> getSplits(JobContext context) throws IOException {
>    List<InputSplit> retval = new LinkedList<InputSplit>();
>    retval.addAll(delegate1.getSplits(context));
>    retval.addAll(delegate1.getSplits(context));
>    return retval;
>  }
>
>
>  public abstract
>    RecordReader<K,V> createRecordReader(InputSplit split,
>                                         TaskAttemptContext context
>                                        ) throws IOException,
>                                                 InterruptedException {
>    if (split for table1) return delegate.createRecordReader();
>    else if (split for table2) return delegate.createRecordReader();
>    else throw exception
>  }
>
> }
>
> --- end pseudo code ---
>
>
> Regards,
> Mridul
>
> john smith wrote:
>
>> Mridul
>>
>> Can you be more clear .. I didn't get you !
>>
>> On Fri, Jan 8, 2010 at 6:13 PM, Mridul Muralidharan
>> <mr...@yahoo-inc.com>wrote:
>>
>>
>>> If you just want to scan both tables for your mapper, assuming there is
>>> no
>>> easier way to do it - cant you not write a composite input format which
>>> delegates to both tables input formats ?
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> john smith wrote:
>>>
>>>  Stack,
>>>>
>>>> The requirement is that I need to I need to scan two tables A,B for  an
>>>> MR
>>>> job ,Order is not important . That is , the reduce phase  contains both
>>>> keys
>>>> from both A,B.
>>>>
>>>> Presently what iam doing is that I am using TableMap for "A" .. And in
>>>> one
>>>> of the mappers , I am reading the entire B using a scanner. But this is
>>>> a
>>>> big overhead right ! Because non-local  B data will we transferred (over
>>>> network) to the machine executing that Map phase . Instead what
>>>> I was thinking is that , there is some kind of variant of TableMap which
>>>> scans for both A,B and emit the corresponding keys . Order is not at all
>>>> important  and also no random lookups . I need the entire B table keys
>>>> in
>>>> some way or the other with least overhead !
>>>>
>>>> Also therz one more solution I was thinking ..  Suppose Iam scanning
>>>> some
>>>> particular region using table map . I can get that particular region
>>>> names
>>>> using some func in the API , then I can build a scanner on B over that
>>>> particular region and emit all the keys from B . This doesn't require
>>>> and
>>>> network transfer of data . Is this solution feasible ?? If yes any hints
>>>> on
>>>> what classes to use from API ?
>>>>
>>>> Thanks ,
>>>> J-S
>>>>
>>>> On Fri, Jan 8, 2010 at 10:46 AM, stack <st...@duboce.net> wrote:
>>>>
>>>>  This is a little tough.  Do both tables have same number of regions?
>>>>  Are
>>>>
>>>>> you walking through the two tables serially in your mapreduce or do you
>>>>> want
>>>>> to do random lookups into the second table dependent on the row you are
>>>>> currently processing in table one?
>>>>>
>>>>> St.Ack
>>>>>
>>>>>
>>>>> On Thu, Jan 7, 2010 at 7:51 PM, john smith <js...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>  Hi all,
>>>>>
>>>>>> My requirement is that , I must read two tables (belonging to the same
>>>>>> region server) in the same Map .
>>>>>>
>>>>>> Normally TableMap supports only 1 table at a time and right now I am
>>>>>> reading
>>>>>> the entire 2nd table in any one
>>>>>> of the maps , This is a big overhead . So can any one suggest some
>>>>>> modification of TableMap or a different
>>>>>> approach which can read 2 tables simultaneously at the same time .
>>>>>> This
>>>>>>
>>>>>>  can
>>>>>
>>>>>  be very useful to us!
>>>>>>
>>>>>> Thanks
>>>>>> J-S
>>>>>>
>>>>>>
>>>>>>
>

Re: MR in HBase

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Hi,


This is assuming there is no easier way to do it (someone from hbase 
team can comment better !).

But the usual way to handle this for mapreduce is to create a composite 
input format : which delegates to the underlying formats to generate the 
splits, and the corresponding record reader's based on the split.


I have not done this for hbase though - but looking at 
TableInputFormatBase, it looks possible to implement ...	

Specifically for hbase, something along the lines of :

--- start dirty pseudo code ---

CustomTableInputFormat extends TableInputFormatBase and implements 
setConf() to configure the table(s) required.

public class CustomTableInputFormat extends 
InputFormat<ImmutableBytesWritable, Result> {

   private CustomTableInputFormat delegate1;
   private CustomTableInputFormat delegate2;

   public void setConf(){
     delegate1 = createTable1InputFormat();
     delegate2 = createTable2InputFormat();
   }

   public List<InputSplit> getSplits(JobContext context) throws 
IOException {
     List<InputSplit> retval = new LinkedList<InputSplit>();
     retval.addAll(delegate1.getSplits(context));
     retval.addAll(delegate1.getSplits(context));
     return retval;
   }


   public abstract
     RecordReader<K,V> createRecordReader(InputSplit split,
                                          TaskAttemptContext context
                                         ) throws IOException,
                                                  InterruptedException {
     if (split for table1) return delegate.createRecordReader();
     else if (split for table2) return delegate.createRecordReader();
     else throw exception
   }

}

--- end pseudo code ---

Regards,
Mridul

john smith wrote:
> Mridul
> 
> Can you be more clear .. I didn't get you !
> 
> On Fri, Jan 8, 2010 at 6:13 PM, Mridul Muralidharan
> <mr...@yahoo-inc.com>wrote:
> 
>>
>> If you just want to scan both tables for your mapper, assuming there is no
>> easier way to do it - cant you not write a composite input format which
>> delegates to both tables input formats ?
>>
>>
>> Regards,
>> Mridul
>>
>>
>> john smith wrote:
>>
>>> Stack,
>>>
>>> The requirement is that I need to I need to scan two tables A,B for  an MR
>>> job ,Order is not important . That is , the reduce phase  contains both
>>> keys
>>> from both A,B.
>>>
>>> Presently what iam doing is that I am using TableMap for "A" .. And in one
>>> of the mappers , I am reading the entire B using a scanner. But this is a
>>> big overhead right ! Because non-local  B data will we transferred (over
>>> network) to the machine executing that Map phase . Instead what
>>> I was thinking is that , there is some kind of variant of TableMap which
>>> scans for both A,B and emit the corresponding keys . Order is not at all
>>> important  and also no random lookups . I need the entire B table keys in
>>> some way or the other with least overhead !
>>>
>>> Also therz one more solution I was thinking ..  Suppose Iam scanning some
>>> particular region using table map . I can get that particular region names
>>> using some func in the API , then I can build a scanner on B over that
>>> particular region and emit all the keys from B . This doesn't require and
>>> network transfer of data . Is this solution feasible ?? If yes any hints
>>> on
>>> what classes to use from API ?
>>>
>>> Thanks ,
>>> J-S
>>>
>>> On Fri, Jan 8, 2010 at 10:46 AM, stack <st...@duboce.net> wrote:
>>>
>>>  This is a little tough.  Do both tables have same number of regions?  Are
>>>> you walking through the two tables serially in your mapreduce or do you
>>>> want
>>>> to do random lookups into the second table dependent on the row you are
>>>> currently processing in table one?
>>>>
>>>> St.Ack
>>>>
>>>>
>>>> On Thu, Jan 7, 2010 at 7:51 PM, john smith <js...@gmail.com>
>>>> wrote:
>>>>
>>>>  Hi all,
>>>>> My requirement is that , I must read two tables (belonging to the same
>>>>> region server) in the same Map .
>>>>>
>>>>> Normally TableMap supports only 1 table at a time and right now I am
>>>>> reading
>>>>> the entire 2nd table in any one
>>>>> of the maps , This is a big overhead . So can any one suggest some
>>>>> modification of TableMap or a different
>>>>> approach which can read 2 tables simultaneously at the same time . This
>>>>>
>>>> can
>>>>
>>>>> be very useful to us!
>>>>>
>>>>> Thanks
>>>>> J-S
>>>>>
>>>>>


Re: MR in HBase

Posted by john smith <js...@gmail.com>.
Mridul

Can you be more clear .. I didn't get you !

On Fri, Jan 8, 2010 at 6:13 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

>
>
> If you just want to scan both tables for your mapper, assuming there is no
> easier way to do it - cant you not write a composite input format which
> delegates to both tables input formats ?
>
>
> Regards,
> Mridul
>
>
> john smith wrote:
>
>> Stack,
>>
>> The requirement is that I need to I need to scan two tables A,B for  an MR
>> job ,Order is not important . That is , the reduce phase  contains both
>> keys
>> from both A,B.
>>
>> Presently what iam doing is that I am using TableMap for "A" .. And in one
>> of the mappers , I am reading the entire B using a scanner. But this is a
>> big overhead right ! Because non-local  B data will we transferred (over
>> network) to the machine executing that Map phase . Instead what
>> I was thinking is that , there is some kind of variant of TableMap which
>> scans for both A,B and emit the corresponding keys . Order is not at all
>> important  and also no random lookups . I need the entire B table keys in
>> some way or the other with least overhead !
>>
>> Also therz one more solution I was thinking ..  Suppose Iam scanning some
>> particular region using table map . I can get that particular region names
>> using some func in the API , then I can build a scanner on B over that
>> particular region and emit all the keys from B . This doesn't require and
>> network transfer of data . Is this solution feasible ?? If yes any hints
>> on
>> what classes to use from API ?
>>
>> Thanks ,
>> J-S
>>
>> On Fri, Jan 8, 2010 at 10:46 AM, stack <st...@duboce.net> wrote:
>>
>>  This is a little tough.  Do both tables have same number of regions?  Are
>>> you walking through the two tables serially in your mapreduce or do you
>>> want
>>> to do random lookups into the second table dependent on the row you are
>>> currently processing in table one?
>>>
>>> St.Ack
>>>
>>>
>>> On Thu, Jan 7, 2010 at 7:51 PM, john smith <js...@gmail.com>
>>> wrote:
>>>
>>>  Hi all,
>>>>
>>>> My requirement is that , I must read two tables (belonging to the same
>>>> region server) in the same Map .
>>>>
>>>> Normally TableMap supports only 1 table at a time and right now I am
>>>> reading
>>>> the entire 2nd table in any one
>>>> of the maps , This is a big overhead . So can any one suggest some
>>>> modification of TableMap or a different
>>>> approach which can read 2 tables simultaneously at the same time . This
>>>>
>>> can
>>>
>>>> be very useful to us!
>>>>
>>>> Thanks
>>>> J-S
>>>>
>>>>
>

Re: MR in HBase

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

If you just want to scan both tables for your mapper, assuming there is 
no easier way to do it - cant you not write a composite input format 
which delegates to both tables input formats ?


Regards,
Mridul

john smith wrote:
> Stack,
> 
> The requirement is that I need to I need to scan two tables A,B for  an MR
> job ,Order is not important . That is , the reduce phase  contains both keys
> from both A,B.
> 
> Presently what iam doing is that I am using TableMap for "A" .. And in one
> of the mappers , I am reading the entire B using a scanner. But this is a
> big overhead right ! Because non-local  B data will we transferred (over
> network) to the machine executing that Map phase . Instead what
> I was thinking is that , there is some kind of variant of TableMap which
> scans for both A,B and emit the corresponding keys . Order is not at all
> important  and also no random lookups . I need the entire B table keys in
> some way or the other with least overhead !
> 
> Also therz one more solution I was thinking ..  Suppose Iam scanning some
> particular region using table map . I can get that particular region names
> using some func in the API , then I can build a scanner on B over that
> particular region and emit all the keys from B . This doesn't require and
> network transfer of data . Is this solution feasible ?? If yes any hints on
> what classes to use from API ?
> 
> Thanks ,
> J-S
> 
> On Fri, Jan 8, 2010 at 10:46 AM, stack <st...@duboce.net> wrote:
> 
>> This is a little tough.  Do both tables have same number of regions?  Are
>> you walking through the two tables serially in your mapreduce or do you
>> want
>> to do random lookups into the second table dependent on the row you are
>> currently processing in table one?
>>
>> St.Ack
>>
>>
>> On Thu, Jan 7, 2010 at 7:51 PM, john smith <js...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> My requirement is that , I must read two tables (belonging to the same
>>> region server) in the same Map .
>>>
>>> Normally TableMap supports only 1 table at a time and right now I am
>>> reading
>>> the entire 2nd table in any one
>>> of the maps , This is a big overhead . So can any one suggest some
>>> modification of TableMap or a different
>>> approach which can read 2 tables simultaneously at the same time . This
>> can
>>> be very useful to us!
>>>
>>> Thanks
>>> J-S
>>>


Re: MR in HBase

Posted by Andrew Purtell <ap...@apache.org>.
John,

Have you looked at Cascading? 

   http://www.cascading.org/ 

It sounds like you could use two HBase table backed inputs, and then
make use of the filter and join type functions that Cascading provides,
and then use a HBase table backed output to collect the result -- in a
way that is natural for that framework. 

Best regards,

   - Andy




----- Original Message ----
> From: john smith <js...@gmail.com>
> To: hbase-user@hadoop.apache.org
> Sent: Fri, January 8, 2010 2:00:09 AM
> Subject: Re: MR in HBase
> 
> Stack,
> 
> The requirement is that I need to I need to scan two tables A,B for  an MR
> job ,Order is not important . That is , the reduce phase  contains both keys
> from both A,B.
> 
> Presently what iam doing is that I am using TableMap for "A" .. And in one
> of the mappers , I am reading the entire B using a scanner. But this is a
> big overhead right ! Because non-local  B data will we transferred (over
> network) to the machine executing that Map phase . Instead what
> I was thinking is that , there is some kind of variant of TableMap which
> scans for both A,B and emit the corresponding keys . Order is not at all
> important  and also no random lookups . I need the entire B table keys in
> some way or the other with least overhead !
> 
> Also therz one more solution I was thinking ..  Suppose Iam scanning some
> particular region using table map . I can get that particular region names
> using some func in the API , then I can build a scanner on B over that
> particular region and emit all the keys from B . This doesn't require and
> network transfer of data . Is this solution feasible ?? If yes any hints on
> what classes to use from API ?
> 
> Thanks ,
> J-S
> 
> On Fri, Jan 8, 2010 at 10:46 AM, stack wrote:
> 
> > This is a little tough.  Do both tables have same number of regions?  Are
> > you walking through the two tables serially in your mapreduce or do you
> > want
> > to do random lookups into the second table dependent on the row you are
> > currently processing in table one?
> >
> > St.Ack
> >
> >
> > On Thu, Jan 7, 2010 at 7:51 PM, john smith wrote:
> >
> > > Hi all,
> > >
> > > My requirement is that , I must read two tables (belonging to the same
> > > region server) in the same Map .
> > >
> > > Normally TableMap supports only 1 table at a time and right now I am
> > > reading
> > > the entire 2nd table in any one
> > > of the maps , This is a big overhead . So can any one suggest some
> > > modification of TableMap or a different
> > > approach which can read 2 tables simultaneously at the same time . This
> > can
> > > be very useful to us!
> > >
> > > Thanks
> > > J-S
> > >
> >



      


Re: MR in HBase

Posted by john smith <js...@gmail.com>.
Stack,

The requirement is that I need to I need to scan two tables A,B for  an MR
job ,Order is not important . That is , the reduce phase  contains both keys
from both A,B.

Presently what iam doing is that I am using TableMap for "A" .. And in one
of the mappers , I am reading the entire B using a scanner. But this is a
big overhead right ! Because non-local  B data will we transferred (over
network) to the machine executing that Map phase . Instead what
I was thinking is that , there is some kind of variant of TableMap which
scans for both A,B and emit the corresponding keys . Order is not at all
important  and also no random lookups . I need the entire B table keys in
some way or the other with least overhead !

Also therz one more solution I was thinking ..  Suppose Iam scanning some
particular region using table map . I can get that particular region names
using some func in the API , then I can build a scanner on B over that
particular region and emit all the keys from B . This doesn't require and
network transfer of data . Is this solution feasible ?? If yes any hints on
what classes to use from API ?

Thanks ,
J-S

On Fri, Jan 8, 2010 at 10:46 AM, stack <st...@duboce.net> wrote:

> This is a little tough.  Do both tables have same number of regions?  Are
> you walking through the two tables serially in your mapreduce or do you
> want
> to do random lookups into the second table dependent on the row you are
> currently processing in table one?
>
> St.Ack
>
>
> On Thu, Jan 7, 2010 at 7:51 PM, john smith <js...@gmail.com> wrote:
>
> > Hi all,
> >
> > My requirement is that , I must read two tables (belonging to the same
> > region server) in the same Map .
> >
> > Normally TableMap supports only 1 table at a time and right now I am
> > reading
> > the entire 2nd table in any one
> > of the maps , This is a big overhead . So can any one suggest some
> > modification of TableMap or a different
> > approach which can read 2 tables simultaneously at the same time . This
> can
> > be very useful to us!
> >
> > Thanks
> > J-S
> >
>

Re: MR in HBase

Posted by stack <st...@duboce.net>.
This is a little tough.  Do both tables have same number of regions?  Are
you walking through the two tables serially in your mapreduce or do you want
to do random lookups into the second table dependent on the row you are
currently processing in table one?

St.Ack


On Thu, Jan 7, 2010 at 7:51 PM, john smith <js...@gmail.com> wrote:

> Hi all,
>
> My requirement is that , I must read two tables (belonging to the same
> region server) in the same Map .
>
> Normally TableMap supports only 1 table at a time and right now I am
> reading
> the entire 2nd table in any one
> of the maps , This is a big overhead . So can any one suggest some
> modification of TableMap or a different
> approach which can read 2 tables simultaneously at the same time . This can
> be very useful to us!
>
> Thanks
> J-S
>