You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jim Twensky <ji...@gmail.com> on 2009/04/29 23:56:53 UTC

Re: Performance of hbase importing

Hi Ryan,

Have you got your new hardware? I was keeping an eye on your blog for the
past few days but I haven't seen any updates there so I just decided to ask
you on the list. If you have some results, would you like to give us some
numbers along with hardware details?

Thanks,
Jim

On Thu, Jan 15, 2009 at 2:28 PM, Larry Compton
<la...@gmail.com>wrote:

> That explains it. Thanks!
>
> On Thu, Jan 15, 2009 at 2:11 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > Larry,
> >
> > This feature was done for 0.19.0 for which a release candidate is on the
> > way.
> >
> > J-D
> >
> > On Thu, Jan 15, 2009 at 2:03 PM, Larry Compton
> > <la...@gmail.com>wrote:
> >
> > > I'm interested in trying this, but I'm not seeing "setAutoFlush()" and
> > > "setWriteBufferSize()" in the "HTable" API (I'm using HBase 0.18.1).
> > >
> > > Larry
> > >
> > > On Sun, Jan 11, 2009 at 5:11 PM, Ryan Rawson <ry...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > New user of hbase here. I've been trolling about in IRC for a few
> days,
> > > and
> > > > been getting great help all around so far.
> > > >
> > > > The topic turns to importing data into hbase - I have largeish
> datasets
> > I
> > > > want to evaluate hbase performance on, so I've been working at
> > importing
> > > > said data.  I've managed to get some impressive performance speedups,
> > and
> > > I
> > > > chronicled them here:
> > > >
> > > >
> > > >
> > >
> >
> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
> > > >
> > > > To summarize:
> > > > - Use the Native HBASE API in Java or Jython (or presumably any JVM
> > > > language)
> > > > - Disable table auto flush, set write buffer large (12M for me)
> > > >
> > > > At this point I can import a 18 GB, 440m row comma-seperated flat
> file
> > in
> > > > about 72 minutes using map-reduce.  This is on a 3 node cluster all
> > > running
> > > > hdfs,hbase,mapred with 12 map tasks (4 per).  This hardware is loaner
> > DB
> > > > hardware, so once I get my real cluster I'll revise/publish new data.
> > > >
> > > > I look forward to meeting some of you next week at the hbase meetup
> at
> > > > powerset!
> > > >
> > > > -ryan
> > > >
> > >
> >
>
>
>
> --
> Larry Compton
> SRA International
> 240.373.5312 (APL)
> 443.742.2762 (cell)
>

Re: Performance of hbase importing

Posted by Ryan Rawson <ry...@gmail.com>.
Hey,

I wrote a reply to a different thread which encapsulates most my recent
learning and understanding of how GC and the JVM impacts large scale data
import.

At this point, I have a 19 machine cluster, with 30 TB of aggregate storage
on raid0 (2 disks/box).   I've devoted them to hbase 0.20 testing, and I've
been able to load a massive set of (real) data in.  Unlike previous data
sets, this one is both (a) huge and (b) tiny rows.

One thing I am finding is I end up with weird bottlenecks:
- The clients dont always seem to be able to push maximal speed
- GC pauses are death
- Compaction thread limit might be holding things up, but I'm not sure about
this one yet.
- In-memory complexity and size is pretty much stressing the JVM
significantly.

The bottom line is we are fighting against the JVM now - both with GC
problems, and as well as general efficiency.  For example, a typical
regionserver can carry a memcache load of 1000-1500m.  That is a lot of
outstanding writes.

As for numbers, I generally want to see the following import performance to
be happy:
- 100-130k ops/sec across 19 nodes
- 125-200 MB/sec of network traffic across all nodes
- 76 map reads reading from mysql -> hbase

This is currently sustainable with 3k regions for prolonged periods of
time.  I have an import that has run for 12 hours at these speeds.

Speed problems start to manifest themselves as dips in the network
performance graph.  The bigger dips (when I was having maximal GC pause
problems) would bounce performance between 0 and 175MB/sec.  Smaller ones
could be due to io-wait or other inefficiencies.

It's all about the GC pause!

-ryan

On Wed, Apr 29, 2009 at 2:56 PM, Jim Twensky <ji...@gmail.com> wrote:

> Hi Ryan,
>
> Have you got your new hardware? I was keeping an eye on your blog for the
> past few days but I haven't seen any updates there so I just decided to ask
> you on the list. If you have some results, would you like to give us some
> numbers along with hardware details?
>
> Thanks,
> Jim
>
> On Thu, Jan 15, 2009 at 2:28 PM, Larry Compton
> <la...@gmail.com>wrote:
>
> > That explains it. Thanks!
> >
> > On Thu, Jan 15, 2009 at 2:11 PM, Jean-Daniel Cryans <jdcryans@apache.org
> > >wrote:
> >
> > > Larry,
> > >
> > > This feature was done for 0.19.0 for which a release candidate is on
> the
> > > way.
> > >
> > > J-D
> > >
> > > On Thu, Jan 15, 2009 at 2:03 PM, Larry Compton
> > > <la...@gmail.com>wrote:
> > >
> > > > I'm interested in trying this, but I'm not seeing "setAutoFlush()"
> and
> > > > "setWriteBufferSize()" in the "HTable" API (I'm using HBase 0.18.1).
> > > >
> > > > Larry
> > > >
> > > > On Sun, Jan 11, 2009 at 5:11 PM, Ryan Rawson <ry...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > New user of hbase here. I've been trolling about in IRC for a few
> > days,
> > > > and
> > > > > been getting great help all around so far.
> > > > >
> > > > > The topic turns to importing data into hbase - I have largeish
> > datasets
> > > I
> > > > > want to evaluate hbase performance on, so I've been working at
> > > importing
> > > > > said data.  I've managed to get some impressive performance
> speedups,
> > > and
> > > > I
> > > > > chronicled them here:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
> > > > >
> > > > > To summarize:
> > > > > - Use the Native HBASE API in Java or Jython (or presumably any JVM
> > > > > language)
> > > > > - Disable table auto flush, set write buffer large (12M for me)
> > > > >
> > > > > At this point I can import a 18 GB, 440m row comma-seperated flat
> > file
> > > in
> > > > > about 72 minutes using map-reduce.  This is on a 3 node cluster all
> > > > running
> > > > > hdfs,hbase,mapred with 12 map tasks (4 per).  This hardware is
> loaner
> > > DB
> > > > > hardware, so once I get my real cluster I'll revise/publish new
> data.
> > > > >
> > > > > I look forward to meeting some of you next week at the hbase meetup
> > at
> > > > > powerset!
> > > > >
> > > > > -ryan
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Larry Compton
> > SRA International
> > 240.373.5312 (APL)
> > 443.742.2762 (cell)
> >
>