You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Doug Meil <do...@explorysmedical.com> on 2011/08/04 17:18:57 UTC

Re: loading data in HBase table using APIs

David, thanks for the tip on this.  I just checked in a reorg to the
performance chapter and included this tip.

Stack does the website updating so it's not visible yet, but this tip is
in there.

Thanks!




On 7/18/11 6:18 PM, "Buttler, David" <bu...@llnl.gov> wrote:

>After a quick scan of the performance section, I didn't see what I
>consider to be a huge performance consideration:
>If at all possible, don't do a reduce on your puts.  The shuffle/sort
>part of the map/reduce paradigm is often useless if all you are trying to
>do is insert/update data in HBase.  From the OP's description it sounds
>like he doesn't need to have any kind of reduce phase [and may be a great
>candidate for bulk loading and the pre-creation of regions].  In any
>case, don't reduce if you can avoid it.
>
>Dave
>
>-----Original Message-----
>From: Doug Meil [mailto:doug.meil@explorysmedical.com]
>Sent: Sunday, July 17, 2011 4:40 PM
>To: user@hbase.apache.org
>Subject: Re: loading data in HBase table using APIs
>
>
>Hi there-
>
>Take a look at this for starters:
>http://hbase.apache.org/book.html#schema
>
>1)  double-check your row-keys (sanity check), that's in the Schema Design
>chapter.
>
>http://hbase.apache.org/book.html#performance
>
>
>2)  if not using bulk-load - re-create regions, do this regardless of
>using MR or non-MR.
>
>3)  if not using MR job and are using multiple threads with the Java API,
>take a look at HTableUtil.  It's on trunk, but that utility can help you.
>
>
>
>
>
>
>On 7/17/11 4:08 PM, "abhay ratnaparkhi" <ab...@gmail.com>
>wrote:
>
>>Hello,
>>
>>I am loading lots of data through API in HBase table.
>>I am using HBase Java API to do this.
>>If I convert this code to map-reduce task and use *TableOutputFormat*
>>class
>>then will I get any performance improvement?
>>
>>As I am not getting input data from existing HBase table or HDFS files
>>there
>>will not be any input to map task.
>>The only advantage is multiple map tasks running simultaneously might
>>make
>>processing faster.
>>
>>Thanks!
>>Regars,
>>Abhay
>

Re: loading data in HBase table using APIs

Posted by Jean-Daniel Cryans <jd...@apache.org>.

> Can you give me some example where I can use TableOutPutFormat to insert
> data to HBase (which does not have reduce step)?

Just set the output of your map to use TableOutputFormat, an example
comes with HBase:
https://github.com/apache/hbase/blob/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/Import.java

J-D

Re: loading data in HBase table using APIs

Posted by abhay ratnaparkhi <ab...@gmail.com>.

yes. I reviewed that.

I want to insert data in HBase. (source is HBase table and sink is also
HBase table)
I do not need reduce step.

Previously I used *IdentityTableReducer*. Like below.

 TableMapReduceUtil.initTableMapperJob((inPutTableName), scan,
SSIBulkLoaderMapper.class, ImmutableBytesWritable.class,Put.class, job);
 TableMapReduceUtil.initTableReducerJob((outPutTableName),
IdentityTableReducer.class,  job);

I don't need to use reducer (as it is not necessary). I want to insert from
map.
One way to to use *HTable APIs *to insert data from map. (This is working)
Another way is using *TableOutPutFormat*.(How to use this? I tried doing
context.write(new ImmutableBytesWritable(Bytes.toBytes(OUTPUT_TABLE_NAME)),
p);  from map and it's not working. )

Can you give me some example where I can use TableOutPutFormat to insert
data to HBase (which does not have reduce step)?

Thank You!
Abhay

On Thu, Aug 18, 2011 at 5:56 PM, Doug Meil <do...@explorysmedical.com>wrote:

>
> Have you reviewed this?
>
> http://hbase.apache.org/book.html#mapreduce.example
>
> I'm planning to add more examples in this chapter, but there is some
> sample code to review.
>
>
>
> On 8/18/11 4:18 AM, "abhay ratnaparkhi" <ab...@gmail.com>
> wrote:
>
> >Thank you for all these information.
> >Can you give me any example where I have only map task and I can put data
> >in
> >HBase from map?
> >I tried following settings.
> >
> >          job = new Job(conf, "Bulk Processing - Only Map.");
> >          job.setNumReduceTasks(0);
> >          job.setJarByClass(MyBulkDataLoader.class);
> >          //job.setMapOutputKeyClass(ImmutableBytesWritable.class);
> >          //job.setMapOutputValueClass(ImmutableBytesWritable.class);
> >          job.setOutputKeyClass(ImmutableBytesWritable.class);
> >          job.setOutputValueClass(Put.class);
> >          job.setOutputFormatClass(TableOutputFormat.class);
> >          Scan scan = new Scan();
> >          TableMapReduceUtil.initTableMapperJob((INPUT_TABLE_NAME),
> >scan,MyBulkLoaderMapper.class, ImmutableBytesWritable.class,Put.class,
> >job);
> >          //TableMapReduceUtil.initTableReducerJob((OUTPUT_TABLE_NAME),
> >IdentityTableReducer.class,  job);
> >          LOG.info("Started " + INPUT_TABLE_NAME);
> >          job.waitForCompletion(true);
> >
> >From map class I am doing...
> >context.write(new
> >ImmutableBytesWritable(Bytes.toBytes(OUTPUT_TABLE_NAME)),
> >p);   //P is an instance of Put.
> >
> >Previously I was using "IdentityTableReducer". As reduce step is not
> >required for bulk loading I only need to insert data in Hbase through Map
> >phase.
> >Where can I give the output table name?
> > If you can give me any example that only has map task and HBase as a
> >source
> >and sink that will be helpful.
> >
> >Thank you.
> >Abhay.
> >On Tue, Aug 9, 2011 at 4:51 AM, Stack <st...@duboce.net> wrote:
> >
> >> The doc here suggests avoiding reduce:
> >>
> >>
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package
> >>-summary.html#sink
> >> St.Ack
> >>
> >> On Fri, Aug 5, 2011 at 2:19 AM, Doug Meil
> >><do...@explorysmedical.com>
> >> wrote:
> >> >
> >> > It's not obvious to a lot of newer folks that an MR job can exist
> >>minus
> >> > the R.
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On 8/4/11 5:52 PM, "Michael Segel" <mi...@hotmail.com> wrote:
> >> >
> >> >>
> >> >>Uhm Silly question...
> >> >>
> >> >>Why would you ever need a reduce step when you're writing to an HBase
> >> >>table?
> >> >>
> >> >>Now I'm sure that there may be some fringe case, but in the past two
> >> >>years, I've never come across a case where you would need to do a
> >>reducer
> >> >>when you're writing to HBase.
> >> >>
> >> >>So what am I missing?
> >> >>
> >> >>
> >> >>
> >> >>> From: doug.meil@explorysmedical.com
> >> >>> To: user@hbase.apache.org
> >> >>> Date: Thu, 4 Aug 2011 11:18:57 -0400
> >> >>> Subject: Re: loading data in HBase table using APIs
> >> >>>
> >> >>>
> >> >>> David, thanks for the tip on this.  I just checked in a reorg to the
> >> >>> performance chapter and included this tip.
> >> >>>
> >> >>> Stack does the website updating so it's not visible yet, but this
> >>tip
> >> is
> >> >>> in there.
> >> >>>
> >> >>> Thanks!
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On 7/18/11 6:18 PM, "Buttler, David" <bu...@llnl.gov> wrote:
> >> >>>
> >> >>> >After a quick scan of the performance section, I didn't see what I
> >> >>> >consider to be a huge performance consideration:
> >> >>> >If at all possible, don't do a reduce on your puts.  The
> >>shuffle/sort
> >> >>> >part of the map/reduce paradigm is often useless if all you are
> >>trying
> >> >>>to
> >> >>> >do is insert/update data in HBase.  From the OP's description it
> >> sounds
> >> >>> >like he doesn't need to have any kind of reduce phase [and may be a
> >> >>>great
> >> >>> >candidate for bulk loading and the pre-creation of regions].  In
> >>any
> >> >>> >case, don't reduce if you can avoid it.
> >> >>> >
> >> >>> >Dave
> >> >>> >
> >> >>> >-----Original Message-----
> >> >>> >From: Doug Meil [mailto:doug.meil@explorysmedical.com]
> >> >>> >Sent: Sunday, July 17, 2011 4:40 PM
> >> >>> >To: user@hbase.apache.org
> >> >>> >Subject: Re: loading data in HBase table using APIs
> >> >>> >
> >> >>> >
> >> >>> >Hi there-
> >> >>> >
> >> >>> >Take a look at this for starters:
> >> >>> >http://hbase.apache.org/book.html#schema
> >> >>> >
> >> >>> >1)  double-check your row-keys (sanity check), that's in the Schema
> >> >>>Design
> >> >>> >chapter.
> >> >>> >
> >> >>> >http://hbase.apache.org/book.html#performance
> >> >>> >
> >> >>> >
> >> >>> >2)  if not using bulk-load - re-create regions, do this regardless
> >>of
> >> >>> >using MR or non-MR.
> >> >>> >
> >> >>> >3)  if not using MR job and are using multiple threads with the
> >>Java
> >> >>>API,
> >> >>> >take a look at HTableUtil.  It's on trunk, but that utility can
> >>help
> >> >>>you.
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >On 7/17/11 4:08 PM, "abhay ratnaparkhi"
> >><ab...@gmail.com>
> >> >>> >wrote:
> >> >>> >
> >> >>> >>Hello,
> >> >>> >>
> >> >>> >>I am loading lots of data through API in HBase table.
> >> >>> >>I am using HBase Java API to do this.
> >> >>> >>If I convert this code to map-reduce task and use
> >>*TableOutputFormat*
> >> >>> >>class
> >> >>> >>then will I get any performance improvement?
> >> >>> >>
> >> >>> >>As I am not getting input data from existing HBase table or HDFS
> >> files
> >> >>> >>there
> >> >>> >>will not be any input to map task.
> >> >>> >>The only advantage is multiple map tasks running simultaneously
> >>might
> >> >>> >>make
> >> >>> >>processing faster.
> >> >>> >>
> >> >>> >>Thanks!
> >> >>> >>Regars,
> >> >>> >>Abhay
> >> >>> >
> >> >>>
> >> >>
> >> >
> >> >
> >>
>
>

Re: loading data in HBase table using APIs

Posted by Doug Meil <do...@explorysmedical.com>.

Have you reviewed this?

http://hbase.apache.org/book.html#mapreduce.example

I'm planning to add more examples in this chapter, but there is some
sample code to review.



On 8/18/11 4:18 AM, "abhay ratnaparkhi" <ab...@gmail.com>
wrote:

>Thank you for all these information.
>Can you give me any example where I have only map task and I can put data
>in
>HBase from map?
>I tried following settings.
>
>          job = new Job(conf, "Bulk Processing - Only Map.");
>          job.setNumReduceTasks(0);
>          job.setJarByClass(MyBulkDataLoader.class);
>          //job.setMapOutputKeyClass(ImmutableBytesWritable.class);
>          //job.setMapOutputValueClass(ImmutableBytesWritable.class);
>          job.setOutputKeyClass(ImmutableBytesWritable.class);
>          job.setOutputValueClass(Put.class);
>          job.setOutputFormatClass(TableOutputFormat.class);
>          Scan scan = new Scan();
>          TableMapReduceUtil.initTableMapperJob((INPUT_TABLE_NAME),
>scan,MyBulkLoaderMapper.class, ImmutableBytesWritable.class,Put.class,
>job);
>          //TableMapReduceUtil.initTableReducerJob((OUTPUT_TABLE_NAME),
>IdentityTableReducer.class,  job);
>          LOG.info("Started " + INPUT_TABLE_NAME);
>          job.waitForCompletion(true);
>
>From map class I am doing...
>context.write(new 
>ImmutableBytesWritable(Bytes.toBytes(OUTPUT_TABLE_NAME)),
>p);   //P is an instance of Put.
>
>Previously I was using "IdentityTableReducer". As reduce step is not
>required for bulk loading I only need to insert data in Hbase through Map
>phase.
>Where can I give the output table name?
> If you can give me any example that only has map task and HBase as a
>source
>and sink that will be helpful.
>
>Thank you.
>Abhay.
>On Tue, Aug 9, 2011 at 4:51 AM, Stack <st...@duboce.net> wrote:
>
>> The doc here suggests avoiding reduce:
>>
>> 
>>http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package
>>-summary.html#sink
>> St.Ack
>>
>> On Fri, Aug 5, 2011 at 2:19 AM, Doug Meil
>><do...@explorysmedical.com>
>> wrote:
>> >
>> > It's not obvious to a lot of newer folks that an MR job can exist
>>minus
>> > the R.
>> >
>> >
>> >
>> >
>> >
>> > On 8/4/11 5:52 PM, "Michael Segel" <mi...@hotmail.com> wrote:
>> >
>> >>
>> >>Uhm Silly question...
>> >>
>> >>Why would you ever need a reduce step when you're writing to an HBase
>> >>table?
>> >>
>> >>Now I'm sure that there may be some fringe case, but in the past two
>> >>years, I've never come across a case where you would need to do a
>>reducer
>> >>when you're writing to HBase.
>> >>
>> >>So what am I missing?
>> >>
>> >>
>> >>
>> >>> From: doug.meil@explorysmedical.com
>> >>> To: user@hbase.apache.org
>> >>> Date: Thu, 4 Aug 2011 11:18:57 -0400
>> >>> Subject: Re: loading data in HBase table using APIs
>> >>>
>> >>>
>> >>> David, thanks for the tip on this.  I just checked in a reorg to the
>> >>> performance chapter and included this tip.
>> >>>
>> >>> Stack does the website updating so it's not visible yet, but this
>>tip
>> is
>> >>> in there.
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On 7/18/11 6:18 PM, "Buttler, David" <bu...@llnl.gov> wrote:
>> >>>
>> >>> >After a quick scan of the performance section, I didn't see what I
>> >>> >consider to be a huge performance consideration:
>> >>> >If at all possible, don't do a reduce on your puts.  The
>>shuffle/sort
>> >>> >part of the map/reduce paradigm is often useless if all you are
>>trying
>> >>>to
>> >>> >do is insert/update data in HBase.  From the OP's description it
>> sounds
>> >>> >like he doesn't need to have any kind of reduce phase [and may be a
>> >>>great
>> >>> >candidate for bulk loading and the pre-creation of regions].  In
>>any
>> >>> >case, don't reduce if you can avoid it.
>> >>> >
>> >>> >Dave
>> >>> >
>> >>> >-----Original Message-----
>> >>> >From: Doug Meil [mailto:doug.meil@explorysmedical.com]
>> >>> >Sent: Sunday, July 17, 2011 4:40 PM
>> >>> >To: user@hbase.apache.org
>> >>> >Subject: Re: loading data in HBase table using APIs
>> >>> >
>> >>> >
>> >>> >Hi there-
>> >>> >
>> >>> >Take a look at this for starters:
>> >>> >http://hbase.apache.org/book.html#schema
>> >>> >
>> >>> >1)  double-check your row-keys (sanity check), that's in the Schema
>> >>>Design
>> >>> >chapter.
>> >>> >
>> >>> >http://hbase.apache.org/book.html#performance
>> >>> >
>> >>> >
>> >>> >2)  if not using bulk-load - re-create regions, do this regardless
>>of
>> >>> >using MR or non-MR.
>> >>> >
>> >>> >3)  if not using MR job and are using multiple threads with the
>>Java
>> >>>API,
>> >>> >take a look at HTableUtil.  It's on trunk, but that utility can
>>help
>> >>>you.
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >On 7/17/11 4:08 PM, "abhay ratnaparkhi"
>><ab...@gmail.com>
>> >>> >wrote:
>> >>> >
>> >>> >>Hello,
>> >>> >>
>> >>> >>I am loading lots of data through API in HBase table.
>> >>> >>I am using HBase Java API to do this.
>> >>> >>If I convert this code to map-reduce task and use
>>*TableOutputFormat*
>> >>> >>class
>> >>> >>then will I get any performance improvement?
>> >>> >>
>> >>> >>As I am not getting input data from existing HBase table or HDFS
>> files
>> >>> >>there
>> >>> >>will not be any input to map task.
>> >>> >>The only advantage is multiple map tasks running simultaneously
>>might
>> >>> >>make
>> >>> >>processing faster.
>> >>> >>
>> >>> >>Thanks!
>> >>> >>Regars,
>> >>> >>Abhay
>> >>> >
>> >>>
>> >>
>> >
>> >
>>

Re: loading data in HBase table using APIs

Posted by abhay ratnaparkhi <ab...@gmail.com>.

Thank you for all these information.
Can you give me any example where I have only map task and I can put data in
HBase from map?
I tried following settings.

          job = new Job(conf, "Bulk Processing - Only Map.");
          job.setNumReduceTasks(0);
          job.setJarByClass(MyBulkDataLoader.class);
          //job.setMapOutputKeyClass(ImmutableBytesWritable.class);
          //job.setMapOutputValueClass(ImmutableBytesWritable.class);
          job.setOutputKeyClass(ImmutableBytesWritable.class);
          job.setOutputValueClass(Put.class);
          job.setOutputFormatClass(TableOutputFormat.class);
          Scan scan = new Scan();
          TableMapReduceUtil.initTableMapperJob((INPUT_TABLE_NAME),
scan,MyBulkLoaderMapper.class, ImmutableBytesWritable.class,Put.class, job);
          //TableMapReduceUtil.initTableReducerJob((OUTPUT_TABLE_NAME),
IdentityTableReducer.class,  job);
          LOG.info("Started " + INPUT_TABLE_NAME);
          job.waitForCompletion(true);

>From map class I am doing...
context.write(new ImmutableBytesWritable(Bytes.toBytes(OUTPUT_TABLE_NAME)),
p);   //P is an instance of Put.

Previously I was using "IdentityTableReducer". As reduce step is not
required for bulk loading I only need to insert data in Hbase through Map
phase.
Where can I give the output table name?
 If you can give me any example that only has map task and HBase as a source
and sink that will be helpful.

Thank you.
Abhay.
On Tue, Aug 9, 2011 at 4:51 AM, Stack <st...@duboce.net> wrote:

> The doc here suggests avoiding reduce:
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink
> St.Ack
>
> On Fri, Aug 5, 2011 at 2:19 AM, Doug Meil <do...@explorysmedical.com>
> wrote:
> >
> > It's not obvious to a lot of newer folks that an MR job can exist minus
> > the R.
> >
> >
> >
> >
> >
> > On 8/4/11 5:52 PM, "Michael Segel" <mi...@hotmail.com> wrote:
> >
> >>
> >>Uhm Silly question...
> >>
> >>Why would you ever need a reduce step when you're writing to an HBase
> >>table?
> >>
> >>Now I'm sure that there may be some fringe case, but in the past two
> >>years, I've never come across a case where you would need to do a reducer
> >>when you're writing to HBase.
> >>
> >>So what am I missing?
> >>
> >>
> >>
> >>> From: doug.meil@explorysmedical.com
> >>> To: user@hbase.apache.org
> >>> Date: Thu, 4 Aug 2011 11:18:57 -0400
> >>> Subject: Re: loading data in HBase table using APIs
> >>>
> >>>
> >>> David, thanks for the tip on this.  I just checked in a reorg to the
> >>> performance chapter and included this tip.
> >>>
> >>> Stack does the website updating so it's not visible yet, but this tip
> is
> >>> in there.
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>
> >>>
> >>> On 7/18/11 6:18 PM, "Buttler, David" <bu...@llnl.gov> wrote:
> >>>
> >>> >After a quick scan of the performance section, I didn't see what I
> >>> >consider to be a huge performance consideration:
> >>> >If at all possible, don't do a reduce on your puts.  The shuffle/sort
> >>> >part of the map/reduce paradigm is often useless if all you are trying
> >>>to
> >>> >do is insert/update data in HBase.  From the OP's description it
> sounds
> >>> >like he doesn't need to have any kind of reduce phase [and may be a
> >>>great
> >>> >candidate for bulk loading and the pre-creation of regions].  In any
> >>> >case, don't reduce if you can avoid it.
> >>> >
> >>> >Dave
> >>> >
> >>> >-----Original Message-----
> >>> >From: Doug Meil [mailto:doug.meil@explorysmedical.com]
> >>> >Sent: Sunday, July 17, 2011 4:40 PM
> >>> >To: user@hbase.apache.org
> >>> >Subject: Re: loading data in HBase table using APIs
> >>> >
> >>> >
> >>> >Hi there-
> >>> >
> >>> >Take a look at this for starters:
> >>> >http://hbase.apache.org/book.html#schema
> >>> >
> >>> >1)  double-check your row-keys (sanity check), that's in the Schema
> >>>Design
> >>> >chapter.
> >>> >
> >>> >http://hbase.apache.org/book.html#performance
> >>> >
> >>> >
> >>> >2)  if not using bulk-load - re-create regions, do this regardless of
> >>> >using MR or non-MR.
> >>> >
> >>> >3)  if not using MR job and are using multiple threads with the Java
> >>>API,
> >>> >take a look at HTableUtil.  It's on trunk, but that utility can help
> >>>you.
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >On 7/17/11 4:08 PM, "abhay ratnaparkhi" <ab...@gmail.com>
> >>> >wrote:
> >>> >
> >>> >>Hello,
> >>> >>
> >>> >>I am loading lots of data through API in HBase table.
> >>> >>I am using HBase Java API to do this.
> >>> >>If I convert this code to map-reduce task and use *TableOutputFormat*
> >>> >>class
> >>> >>then will I get any performance improvement?
> >>> >>
> >>> >>As I am not getting input data from existing HBase table or HDFS
> files
> >>> >>there
> >>> >>will not be any input to map task.
> >>> >>The only advantage is multiple map tasks running simultaneously might
> >>> >>make
> >>> >>processing faster.
> >>> >>
> >>> >>Thanks!
> >>> >>Regars,
> >>> >>Abhay
> >>> >
> >>>
> >>
> >
> >
>

Re: loading data in HBase table using APIs

Posted by Stack <st...@duboce.net>.

The doc here suggests avoiding reduce:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#sink
St.Ack

On Fri, Aug 5, 2011 at 2:19 AM, Doug Meil <do...@explorysmedical.com> wrote:
>
> It's not obvious to a lot of newer folks that an MR job can exist minus
> the R.
>
>
>
>
>
> On 8/4/11 5:52 PM, "Michael Segel" <mi...@hotmail.com> wrote:
>
>>
>>Uhm Silly question...
>>
>>Why would you ever need a reduce step when you're writing to an HBase
>>table?
>>
>>Now I'm sure that there may be some fringe case, but in the past two
>>years, I've never come across a case where you would need to do a reducer
>>when you're writing to HBase.
>>
>>So what am I missing?
>>
>>
>>
>>> From: doug.meil@explorysmedical.com
>>> To: user@hbase.apache.org
>>> Date: Thu, 4 Aug 2011 11:18:57 -0400
>>> Subject: Re: loading data in HBase table using APIs
>>>
>>>
>>> David, thanks for the tip on this.  I just checked in a reorg to the
>>> performance chapter and included this tip.
>>>
>>> Stack does the website updating so it's not visible yet, but this tip is
>>> in there.
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>> On 7/18/11 6:18 PM, "Buttler, David" <bu...@llnl.gov> wrote:
>>>
>>> >After a quick scan of the performance section, I didn't see what I
>>> >consider to be a huge performance consideration:
>>> >If at all possible, don't do a reduce on your puts.  The shuffle/sort
>>> >part of the map/reduce paradigm is often useless if all you are trying
>>>to
>>> >do is insert/update data in HBase.  From the OP's description it sounds
>>> >like he doesn't need to have any kind of reduce phase [and may be a
>>>great
>>> >candidate for bulk loading and the pre-creation of regions].  In any
>>> >case, don't reduce if you can avoid it.
>>> >
>>> >Dave
>>> >
>>> >-----Original Message-----
>>> >From: Doug Meil [mailto:doug.meil@explorysmedical.com]
>>> >Sent: Sunday, July 17, 2011 4:40 PM
>>> >To: user@hbase.apache.org
>>> >Subject: Re: loading data in HBase table using APIs
>>> >
>>> >
>>> >Hi there-
>>> >
>>> >Take a look at this for starters:
>>> >http://hbase.apache.org/book.html#schema
>>> >
>>> >1)  double-check your row-keys (sanity check), that's in the Schema
>>>Design
>>> >chapter.
>>> >
>>> >http://hbase.apache.org/book.html#performance
>>> >
>>> >
>>> >2)  if not using bulk-load - re-create regions, do this regardless of
>>> >using MR or non-MR.
>>> >
>>> >3)  if not using MR job and are using multiple threads with the Java
>>>API,
>>> >take a look at HTableUtil.  It's on trunk, but that utility can help
>>>you.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >On 7/17/11 4:08 PM, "abhay ratnaparkhi" <ab...@gmail.com>
>>> >wrote:
>>> >
>>> >>Hello,
>>> >>
>>> >>I am loading lots of data through API in HBase table.
>>> >>I am using HBase Java API to do this.
>>> >>If I convert this code to map-reduce task and use *TableOutputFormat*
>>> >>class
>>> >>then will I get any performance improvement?
>>> >>
>>> >>As I am not getting input data from existing HBase table or HDFS files
>>> >>there
>>> >>will not be any input to map task.
>>> >>The only advantage is multiple map tasks running simultaneously might
>>> >>make
>>> >>processing faster.
>>> >>
>>> >>Thanks!
>>> >>Regars,
>>> >>Abhay
>>> >
>>>
>>
>
>

Re: loading data in HBase table using APIs

Posted by Doug Meil <do...@explorysmedical.com>.

It's not obvious to a lot of newer folks that an MR job can exist minus
the R.





On 8/4/11 5:52 PM, "Michael Segel" <mi...@hotmail.com> wrote:

>
>Uhm Silly question...
>
>Why would you ever need a reduce step when you're writing to an HBase
>table?
>
>Now I'm sure that there may be some fringe case, but in the past two
>years, I've never come across a case where you would need to do a reducer
>when you're writing to HBase.
>
>So what am I missing?
>
>
>
>> From: doug.meil@explorysmedical.com
>> To: user@hbase.apache.org
>> Date: Thu, 4 Aug 2011 11:18:57 -0400
>> Subject: Re: loading data in HBase table using APIs
>> 
>> 
>> David, thanks for the tip on this.  I just checked in a reorg to the
>> performance chapter and included this tip.
>> 
>> Stack does the website updating so it's not visible yet, but this tip is
>> in there.
>> 
>> Thanks!
>> 
>> 
>> 
>> 
>> On 7/18/11 6:18 PM, "Buttler, David" <bu...@llnl.gov> wrote:
>> 
>> >After a quick scan of the performance section, I didn't see what I
>> >consider to be a huge performance consideration:
>> >If at all possible, don't do a reduce on your puts.  The shuffle/sort
>> >part of the map/reduce paradigm is often useless if all you are trying
>>to
>> >do is insert/update data in HBase.  From the OP's description it sounds
>> >like he doesn't need to have any kind of reduce phase [and may be a
>>great
>> >candidate for bulk loading and the pre-creation of regions].  In any
>> >case, don't reduce if you can avoid it.
>> >
>> >Dave
>> >
>> >-----Original Message-----
>> >From: Doug Meil [mailto:doug.meil@explorysmedical.com]
>> >Sent: Sunday, July 17, 2011 4:40 PM
>> >To: user@hbase.apache.org
>> >Subject: Re: loading data in HBase table using APIs
>> >
>> >
>> >Hi there-
>> >
>> >Take a look at this for starters:
>> >http://hbase.apache.org/book.html#schema
>> >
>> >1)  double-check your row-keys (sanity check), that's in the Schema
>>Design
>> >chapter.
>> >
>> >http://hbase.apache.org/book.html#performance
>> >
>> >
>> >2)  if not using bulk-load - re-create regions, do this regardless of
>> >using MR or non-MR.
>> >
>> >3)  if not using MR job and are using multiple threads with the Java
>>API,
>> >take a look at HTableUtil.  It's on trunk, but that utility can help
>>you.
>> >
>> >
>> >
>> >
>> >
>> >
>> >On 7/17/11 4:08 PM, "abhay ratnaparkhi" <ab...@gmail.com>
>> >wrote:
>> >
>> >>Hello,
>> >>
>> >>I am loading lots of data through API in HBase table.
>> >>I am using HBase Java API to do this.
>> >>If I convert this code to map-reduce task and use *TableOutputFormat*
>> >>class
>> >>then will I get any performance improvement?
>> >>
>> >>As I am not getting input data from existing HBase table or HDFS files
>> >>there
>> >>will not be any input to map task.
>> >>The only advantage is multiple map tasks running simultaneously might
>> >>make
>> >>processing faster.
>> >>
>> >>Thanks!
>> >>Regars,
>> >>Abhay
>> >
>> 
>

RE: loading data in HBase table using APIs

Posted by Michael Segel <mi...@hotmail.com>.

Uhm Silly question...

Why would you ever need a reduce step when you're writing to an HBase table?

Now I'm sure that there may be some fringe case, but in the past two years, I've never come across a case where you would need to do a reducer when you're writing to HBase.

So what am I missing?



> From: doug.meil@explorysmedical.com
> To: user@hbase.apache.org
> Date: Thu, 4 Aug 2011 11:18:57 -0400
> Subject: Re: loading data in HBase table using APIs
> 
> 
> David, thanks for the tip on this.  I just checked in a reorg to the
> performance chapter and included this tip.
> 
> Stack does the website updating so it's not visible yet, but this tip is
> in there.
> 
> Thanks!
> 
> 
> 
> 
> On 7/18/11 6:18 PM, "Buttler, David" <bu...@llnl.gov> wrote:
> 
> >After a quick scan of the performance section, I didn't see what I
> >consider to be a huge performance consideration:
> >If at all possible, don't do a reduce on your puts.  The shuffle/sort
> >part of the map/reduce paradigm is often useless if all you are trying to
> >do is insert/update data in HBase.  From the OP's description it sounds
> >like he doesn't need to have any kind of reduce phase [and may be a great
> >candidate for bulk loading and the pre-creation of regions].  In any
> >case, don't reduce if you can avoid it.
> >
> >Dave
> >
> >-----Original Message-----
> >From: Doug Meil [mailto:doug.meil@explorysmedical.com]
> >Sent: Sunday, July 17, 2011 4:40 PM
> >To: user@hbase.apache.org
> >Subject: Re: loading data in HBase table using APIs
> >
> >
> >Hi there-
> >
> >Take a look at this for starters:
> >http://hbase.apache.org/book.html#schema
> >
> >1)  double-check your row-keys (sanity check), that's in the Schema Design
> >chapter.
> >
> >http://hbase.apache.org/book.html#performance
> >
> >
> >2)  if not using bulk-load - re-create regions, do this regardless of
> >using MR or non-MR.
> >
> >3)  if not using MR job and are using multiple threads with the Java API,
> >take a look at HTableUtil.  It's on trunk, but that utility can help you.
> >
> >
> >
> >
> >
> >
> >On 7/17/11 4:08 PM, "abhay ratnaparkhi" <ab...@gmail.com>
> >wrote:
> >
> >>Hello,
> >>
> >>I am loading lots of data through API in HBase table.
> >>I am using HBase Java API to do this.
> >>If I convert this code to map-reduce task and use *TableOutputFormat*
> >>class
> >>then will I get any performance improvement?
> >>
> >>As I am not getting input data from existing HBase table or HDFS files
> >>there
> >>will not be any input to map task.
> >>The only advantage is multiple map tasks running simultaneously might
> >>make
> >>processing faster.
> >>
> >>Thanks!
> >>Regars,
> >>Abhay
> >
>