You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Stuart White <st...@gmail.com> on 2009/04/02 22:28:19 UTC

Bulk import - does sort order of input data affect success rate?

I, like many others, am having difficulty getting a mapred job that
bulk imports data into an HBase table to run successfully to
completion.

At this time, rather than get into specifics of my configuration, the
exceptions I'm receiving, etc..., I wanted to ask a general question:

Should I expect my bulk import to be more likely to succeed if my data
is sorted by its key?
Or should I expect my bulk import to be more likely to succeed if my
data is randomized?
Or should I expect the ordering of my input data to have no effect on
my ability to successfully bulk import records?

Thanks.

Re: Bulk import - is the error general to both MapReduce and non-MapReduce programs?

Posted by Stuart White <st...@gmail.com>.
To my understanding, the problem I am facing is not specific to
mapreduce.  So, I would expect that Ryan's code is equally applicable
to your case.

On Thu, Apr 2, 2009 at 4:37 PM, Taylor, Ronald C <ro...@pnl.gov> wrote:
>
> Hello,
>
> I have been following this thread, and got a question. I am new to Hbase coding, and I have within the past few days written a standalone (not MapReduce based) Java program to do a bulk upload into one Hbase table. I believe that I got the same error that you folks have been talking about. The program works fine on small uploads, fails with the error msg you mention when moving to import of ten of thousands of rows. So - I wanted to ask: has this import error been reported for only MapReduce-based programs, or is it indeed more general (which I could then assume may be something that affects by current import program, and I should try using the doCommit() code shown below as a fix)?
>  Cheers,
>  Ron Taylor
> ___________________________________________
> Ronald Taylor, Ph.D.
> Computational Biology & Bioinformatics Group
> Pacific Northwest National Laboratory
> 902 Battelle Boulevard
> P.O. Box 999, MSIN K7-90
> Richland, WA  99352 USA
> Office:  509-372-6568
> Email: ronald.taylor@pnl.gov
> www.pnl.gov
>
> -----Original Message-----
> From: Stuart White [mailto:stuart.white1@gmail.com]
> Sent: Thursday, April 02, 2009 1:37 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Bulk import - does sort order of input data affect success rate?
>
> On Thu, Apr 2, 2009 at 3:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> The last thing - success should not be a function of sort order.
>>
>> However, speed will be related.
>
> How?  Sorted = faster, or Sorted = slower?
>
>>
>> One thing I found I had to do was:
>>    private void doCommit(HTable t, BatchUpdate update) throws
>> IOException {
>>      boolean commited = false;
>>      while (!commited) {
>>        try {
>>          t.commit(update);
>>          commited = true;
>>        } catch (RetriesExhaustedException e) {
>>          // DAMN, ignore
>>        }
>>      }
>>    }
>>
>
> I'm running a mapred job, using TableOutputFormat to write the results to HBase.  For the code you've provided, was that for a custom output format?  Or a standalone (non-mapred) application?  I see the point you're making, I just don't understand where I'd put that code.
> Thanks!
>

RE: Bulk import - is the error general to both MapReduce and non-MapReduce programs?

Posted by "Taylor, Ronald C" <ro...@pnl.gov>.
 
Hello,

I have been following this thread, and got a question. I am new to Hbase coding, and I have within the past few days written a standalone (not MapReduce based) Java program to do a bulk upload into one Hbase table. I believe that I got the same error that you folks have been talking about. The program works fine on small uploads, fails with the error msg you mention when moving to import of ten of thousands of rows. So - I wanted to ask: has this import error been reported for only MapReduce-based programs, or is it indeed more general (which I could then assume may be something that affects by current import program, and I should try using the doCommit() code shown below as a fix)?
  Cheers,
  Ron Taylor
___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group
Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, MSIN K7-90
Richland, WA  99352 USA
Office:  509-372-6568
Email: ronald.taylor@pnl.gov
www.pnl.gov

-----Original Message-----
From: Stuart White [mailto:stuart.white1@gmail.com] 
Sent: Thursday, April 02, 2009 1:37 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Bulk import - does sort order of input data affect success rate?

On Thu, Apr 2, 2009 at 3:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
> The last thing - success should not be a function of sort order.
>
> However, speed will be related.

How?  Sorted = faster, or Sorted = slower?

>
> One thing I found I had to do was:
>    private void doCommit(HTable t, BatchUpdate update) throws 
> IOException {
>      boolean commited = false;
>      while (!commited) {
>        try {
>          t.commit(update);
>          commited = true;
>        } catch (RetriesExhaustedException e) {
>          // DAMN, ignore
>        }
>      }
>    }
>

I'm running a mapred job, using TableOutputFormat to write the results to HBase.  For the code you've provided, was that for a custom output format?  Or a standalone (non-mapred) application?  I see the point you're making, I just don't understand where I'd put that code.
Thanks!

Re: Bulk import - does sort order of input data affect success rate?

Posted by Billy Pearson <bi...@sbcglobal.net>.

I found using HRegionPartitioner on tables that are not new and have multi 
regions per server it speeds things up might look
in to making a HServerPartitioner one reduce per server but would lose 
performance if the server has many spare cores to use.

Billy

----- Original Message ----- 
From: "Ryan Rawson" <ry...@public.gmane.org>
Newsgroups: gmane.comp.java.hadoop.hbase.user
To: <hb...@public.gmane.org>
Sent: Thursday, April 02, 2009 5:53 PM
Subject: Re: Bulk import - does sort order of input data affect success 
rate?


> hey,
>
> sorted = slower, randomized = faster.
>
> this is because if it is sorted in natural key order, you tend to hotspot 
> in
> 1 or 2 regions.
>
> I don't use table output format, I use direct commits from the map, no
> reduce. That seems to be the most performance solution.
>
> have fun!
>
>
> On Thu, Apr 2, 2009 at 1:36 PM, Stuart White 
> <st...@public.gmane.org>wrote:
>
>> On Thu, Apr 2, 2009 at 3:30 PM, Ryan Rawson 
>> <ry...@public.gmane.org> wrote:
>> > The last thing - success should not be a function of sort order.
>> >
>> > However, speed will be related.
>>
>> How?  Sorted = faster, or Sorted = slower?
>>
>> >
>> > One thing I found I had to do was:
>> >    private void doCommit(HTable t, BatchUpdate update) throws 
>> > IOException
>> {
>> >      boolean commited = false;
>> >      while (!commited) {
>> >        try {
>> >          t.commit(update);
>> >          commited = true;
>> >        } catch (RetriesExhaustedException e) {
>> >          // DAMN, ignore
>> >        }
>> >      }
>> >    }
>> >
>>
>> I'm running a mapred job, using TableOutputFormat to write the results
>> to HBase.  For the code you've provided, was that for a custom output
>> format?  Or a standalone (non-mapred) application?  I see the point
>> you're making, I just don't understand where I'd put that code.
>> Thanks!
>>
> 



Re: Bulk import - does sort order of input data affect success rate?

Posted by Ryan Rawson <ry...@gmail.com>.
hey,

sorted = slower, randomized = faster.

this is because if it is sorted in natural key order, you tend to hotspot in
1 or 2 regions.

I don't use table output format, I use direct commits from the map, no
reduce. That seems to be the most performance solution.

have fun!


On Thu, Apr 2, 2009 at 1:36 PM, Stuart White <st...@gmail.com>wrote:

> On Thu, Apr 2, 2009 at 3:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
> > The last thing - success should not be a function of sort order.
> >
> > However, speed will be related.
>
> How?  Sorted = faster, or Sorted = slower?
>
> >
> > One thing I found I had to do was:
> >    private void doCommit(HTable t, BatchUpdate update) throws IOException
> {
> >      boolean commited = false;
> >      while (!commited) {
> >        try {
> >          t.commit(update);
> >          commited = true;
> >        } catch (RetriesExhaustedException e) {
> >          // DAMN, ignore
> >        }
> >      }
> >    }
> >
>
> I'm running a mapred job, using TableOutputFormat to write the results
> to HBase.  For the code you've provided, was that for a custom output
> format?  Or a standalone (non-mapred) application?  I see the point
> you're making, I just don't understand where I'd put that code.
> Thanks!
>

Re: Bulk import - does sort order of input data affect success rate?

Posted by Stuart White <st...@gmail.com>.
On Thu, Apr 2, 2009 at 3:30 PM, Ryan Rawson <ry...@gmail.com> wrote:
> The last thing - success should not be a function of sort order.
>
> However, speed will be related.

How?  Sorted = faster, or Sorted = slower?

>
> One thing I found I had to do was:
>    private void doCommit(HTable t, BatchUpdate update) throws IOException {
>      boolean commited = false;
>      while (!commited) {
>        try {
>          t.commit(update);
>          commited = true;
>        } catch (RetriesExhaustedException e) {
>          // DAMN, ignore
>        }
>      }
>    }
>

I'm running a mapred job, using TableOutputFormat to write the results
to HBase.  For the code you've provided, was that for a custom output
format?  Or a standalone (non-mapred) application?  I see the point
you're making, I just don't understand where I'd put that code.
Thanks!

Re: Bulk import - does sort order of input data affect success rate?

Posted by Ryan Rawson <ry...@gmail.com>.
The last thing - success should not be a function of sort order.

However, speed will be related.

One thing I found I had to do was:
    private void doCommit(HTable t, BatchUpdate update) throws IOException {
      boolean commited = false;
      while (!commited) {
        try {
          t.commit(update);
          commited = true;
        } catch (RetriesExhaustedException e) {
          // DAMN, ignore
        }
      }
    }

good luck!
-ryan

On Thu, Apr 2, 2009 at 1:28 PM, Stuart White <st...@gmail.com>wrote:

> I, like many others, am having difficulty getting a mapred job that
> bulk imports data into an HBase table to run successfully to
> completion.
>
> At this time, rather than get into specifics of my configuration, the
> exceptions I'm receiving, etc..., I wanted to ask a general question:
>
> Should I expect my bulk import to be more likely to succeed if my data
> is sorted by its key?
> Or should I expect my bulk import to be more likely to succeed if my
> data is randomized?
> Or should I expect the ordering of my input data to have no effect on
> my ability to successfully bulk import records?
>
> Thanks.
>