You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Sean Bigdatafun <se...@gmail.com> on 2011/01/03 02:11:44 UTC

Re: Read/Write Performance

Has this cured the GC pause at all? I do not see why turning on LZO is
relavent at all (I read your email, and it sounds that you saw pause after
the LZO is turned on).

BTW, are you using CMS on a 8GB Heapsize JVM and experiencing a 4 mins
pause? That sounds a lot.

On Thu, Dec 30, 2010 at 1:51 PM, Wayne <wa...@gmail.com> wrote:

> Lesson learned...restart thrift servers *after* restarting hadoop+hbase.
>
> On Thu, Dec 30, 2010 at 3:39 PM, Wayne <wa...@gmail.com> wrote:
>
> > We have restarted with lzop compression, and now I am seeing some really
> > long and frequent stop the world pauses of the entire cluster. The load
> > requests for all regions all go to zero except for the meta table region.
> No
> > data batches are getting in (no loads are occurring) and everything seems
> > frozen. It seems to last for 5+ seconds. Is this GC on the master or GC
> in
> > the meta region? What could cause everything to stop for several seconds?
> It
> > appears to happen on a recurring basis as well. I think we saw it before
> > switching to lzo but it seems much worse now (lasts longer and occurs
> more
> > frequently).
> >
> > Thanks.
> >
> >
> >
> > On Thu, Dec 30, 2010 at 12:20 PM, Wayne <wa...@gmail.com> wrote:
> >
> >> HBase Version 0.89.20100924, r1001068 w/ 8GB heap
> >>
> >> I plan to run for 1 week straight maxed out. I am worried about GC
> pauses,
> >> especially concurrent mode failures (does hbase/hadoop suffer these
> under
> >> extended load?). What should I be looking for in the gc log in terms of
> >> problem signs? The ParNews are quick but the CMS concurrent marks are
> taking
> >> as much as 4 mins with an average of 20-30 secs.
> >>
> >> Thanks.
> >>
> >>
> >>
> >> On Thu, Dec 30, 2010 at 12:00 PM, Stack <st...@duboce.net> wrote:
> >>
> >>> Oh, what versions are you using?
> >>> St.Ack
> >>>
> >>> On Thu, Dec 30, 2010 at 9:00 AM, Stack <st...@duboce.net> wrote:
> >>> > Keep going. Let it run longer.  Get the servers as loaded as you
> think
> >>> > they'll be in production.  Make sure the perf numbers are not because
> >>> > cluster is 'fresh'.
> >>> > St.Ack
> >>> >
> >>> > On Thu, Dec 30, 2010 at 5:51 AM, Wayne <wa...@gmail.com> wrote:
> >>> >> We finally got our cluster up and running and write performance
> looks
> >>> very
> >>> >> good. We are getting sustained 8-10k writes/sec/node on a 10 node
> >>> cluster
> >>> >> from Python through thrift. These are values written to 3 CFs so
> >>> actual
> >>> >> hbase performance is 25-30k writes/sec/node. The nodes are currently
> >>> disk
> >>> >> i/o bound (40-50% utilization) but hopefully once we get lzop
> working
> >>> this
> >>> >> will go down. We have been running for 12 hours without a problem.
> We
> >>> hope
> >>> >> to get lzop going today and then load all through the long weekend.
> >>> >>
> >>> >> We plan to then test reads next week after we get some data in
> there.
> >>> Looks
> >>> >> good so far! Below are our settings in case there are some
> >>> >> suggestions/concerns.
> >>> >>
> >>> >> Thanks for everyone's help. It is pretty exciting to get performance
> >>> like
> >>> >> this from the start.
> >>> >>
> >>> >>
> >>> >> *Global*
> >>> >>
> >>> >> client.write.buffer = 10485760 (10MB = 5x default)
> >>> >>
> >>> >> optionalLogFlushInterval = 10000 (10 secs = 10x default)
> >>> >>
> >>> >> memstore.flush.size = 268435456 (256MB = 4x default)
> >>> >>
> >>> >> hregion.max.filesize = 1073741824 (1GB = 4x default)
> >>> >>
> >>> >> *Table*
> >>> >>
> >>> >> alter 'xxx', METHOD => 'table_att', DEFERRED_LOG_FLUSH => true
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Wed, Dec 29, 2010 at 12:55 AM, Stack <st...@duboce.net> wrote:
> >>> >>
> >>> >>> On Mon, Dec 27, 2010 at 11:47 AM, Wayne <wa...@gmail.com> wrote:
> >>> >>> > All data is written to 3 CFs. Basically 2 of the CFs are
> secondary
> >>> >>> indexes
> >>> >>> > (manually managed as normal CFs). It sounds like we should try
> hard
> >>> to
> >>> >>> get
> >>> >>> > as much out of thrift as we can before going to a lower level.
> >>> >>>
> >>> >>> Yes.
> >>> >>>
> >>> >>> Writes need
> >>> >>> > to be "fast enough", but reads are more important in the end (and
> >>> are the
> >>> >>> > reason we are switching from a different solution). The numbers
> you
> >>> >>> quoted
> >>> >>> > below sound like they are in the ballpark of what we are looking
> to
> >>> do.
> >>> >>> >
> >>> >>>
> >>> >>> Even the tens per second that I threw in there to CMA?
> >>> >>>
> >>> >>> > Much of our data is cold, and we expect reads to be disk i/o
> based.
> >>> >>>
> >>> >>> OK.  FYI, we're not the best at this -- cache-miss cold reads --
> what
> >>> >>> w/ a network hop in the way and currently we'll open a socket per
> >>> >>> access.
> >>> >>>
> >>> >>> > Given
> >>> >>> > this is 8GB heap a good place to start on the data nodes (24GB
> >>> ram)? Is
> >>> >>> the
> >>> >>> > block cache managed on its own (being it won't blow up causing
> >>> OOM),
> >>> >>>
> >>> >>> It won't.  Its constrained.  Does our home-brewed sizeof.  Default,
> >>> >>> its 0.2 of total heap.  If you think cache will help, you could go
> up
> >>> >>> from there.  0.4 or 0.5 of heap.
> >>> >>>
> >>> >>> > and if
> >>> >>> > we do not use it (block cache) should we go even lower for the
> heap
> >>> (we
> >>> >>> want
> >>> >>> > to avoid CMF and long GC pauses)?
> >>> >>>
> >>> >>> If you are going to be doing cache-miss most of the time and cold
> >>> >>> reads, then yes, you can do away with cache.
> >>> >>>
> >>> >>> In testing of 0.90.x I've been running w/ 1MB heaps with 1k regions
> >>> >>> but this is my trying to break stuff.
> >>> >>>
> >>> >>> > Are there any timeouts we need to tweak to
> >>> >>> > make the cluster more "accepting" of long GC pauses while under
> >>> sustained
> >>> >>> > load (7+ days of 10k/inserts/sec/node)?
> >>> >>> >
> >>> >>>
> >>> >>> If zookeeper client timesout, the regionserver will shut itself
> down.
> >>> >>> In 0.90.0RC2, the client sessionout is set high -- 3 minutes.  If
> you
> >>> >>> timeout that, then thats pretty extreme... something badly wrong
> I'd
> >>> >>> say.  Heres' a few notes on the config and others that you might
> want
> >>> >>> to twiddle (see previous section on required configs... make sure
> >>> >>> you've got those too):
> >>> >>>
> >>> >>>
> >>>
> http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations
> <
> http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations>
> >
> >>> <
> >>>
> http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations>
> >>> >
> >>> >>>
> >>> >>>
> >>> >>> > Does LZO compression speed up reads/writes where there is excess
> >>> CPU to
> >>> >>> do
> >>> >>> > the compression? I assume it would lower disk i/o but increase
> CPU
> >>> a lot.
> >>> >>> Is
> >>> >>> > data compressed on the initial write or only after compaction?
> >>> >>> >
> >>> >>>
> >>> >>> LZO is pretty frictionless -- i.e. little CPU cost -- and yes,
> >>> usually
> >>> >>> helps speed things up (grab more in the one go).  What size are
> your
> >>> >>> records?  You might want to mess w/ hfile block sizes though the
> 64k
> >>> >>> default is usually good enough for all but very small cell sizes.
> >>> >>>
> >>> >>>
> >>> >>> > With the replication in the HDFS layer how are reads managed in
> >>> terms of
> >>> >>> > load balancing across region servers? Does HDFS know to spread
> >>> multiple
> >>> >>> > requests across the 3 region servers that contain the same data?
> >>> >>>
> >>> >>> You only read from one of the replicas, always the 'closest'.  If
> the
> >>> >>> DFSClient has trouble getting the first of the replicas, it moves
> on
> >>> >>> to the second, etc.
> >>> >>>
> >>> >>>
> >>> >>> > For example
> >>> >>> > with 10 data nodes if we have 50 concurrent readers with very
> >>> "random"
> >>> >>> key
> >>> >>> > requests we would expect to have 5 reads occurring on each data
> >>> node at
> >>> >>> the
> >>> >>> > same time. We plan to have a thrift server on each data node, so
> 5
> >>> >>> > concurrent readers will be connected to each thrift server at any
> >>> given
> >>> >>> time
> >>> >>> > (50 in aggregate across 10 nodes). We want to be sure everything
> is
> >>> >>> designed
> >>> >>> > to evenly spread this load to avoid any possible hot-spots.
> >>> >>> >
> >>> >>>
> >>> >>> This is different.  This is key design.  A thrift server will be
> >>> doing
> >>> >>> some subset of the key space.  If the requests are evenly
> distributed
> >>> >>> over all of the key space, then you should be fine; all thrift
> >>> servers
> >>> >>> will be evenly loaded.  If not, then there could be hot spots.
> >>> >>>
> >>> >>> We have a balancer that currently only counts regions per server,
> not
> >>> >>> regions per server plus hits per region so it could be the case
> that
> >>> a
> >>> >>> server by chance ends up carrying all of the hot regions.  HBase
> >>> >>> itself is not too smart dealing with this.  In 0.90.0, there is
> >>> >>> facility for manually moving regions -- i.e. closing in current
> >>> >>> location and moving the region off to another server w/ some outage
> >>> >>> while the move is happening (usually seconds) -- or you could split
> >>> >>> the hot region manually and then the daughters could be moved off
> to
> >>> >>> other servers... Primitive for now but should be better in next
> HBase
> >>> >>> versions.
> >>> >>>
> >>> >>> Have you been able to test w/ your data and your query pattern?
> >>> >>> That'll tell you way more than I ever could.
> >>> >>>
> >>> >>> Good luck,
> >>> >>> St.Ack
> >>> >>>
> >>> >>>
> >>> >>> >
> >>> >>> >
> >>> >>> > On Mon, Dec 27, 2010 at 1:49 PM, Stack <st...@duboce.net> wrote:
> >>> >>> >
> >>> >>> >> On Fri, Dec 24, 2010 at 5:09 AM, Wayne <wa...@gmail.com>
> wrote:
> >>> >>> >> > We are in the process of evaluating hbase in an effort to
> switch
> >>> from
> >>> >>> a
> >>> >>> >> > different nosql solution. Performance is of course an
> important
> >>> part
> >>> >>> of
> >>> >>> >> our
> >>> >>> >> > evaluation. We are a python shop and we are very worried that
> we
> >>> can
> >>> >>> not
> >>> >>> >> get
> >>> >>> >> > any real performance out of hbase using thrift (and must drop
> >>> down to
> >>> >>> >> java).
> >>> >>> >> > We are aware of the various lower level options for bulk
> insert
> >>> or
> >>> >>> java
> >>> >>> >> > based inserts with turning off WAL etc. but none of these are
> >>> >>> available
> >>> >>> >> to
> >>> >>> >> > us in python so are not part of our evaluation.
> >>> >>> >>
> >>> >>> >> I can understand python for continuous updates from your
> frontend
> >>> or
> >>> >>> >> whatever but you might consider hacking up a bit of java to make
> >>> us of
> >>> >>> >> the bulk updater; you'll get upload rates orders of magnitude
> >>> beyond
> >>> >>> >> what you'd achieve going via the API via python (or java for
> that
> >>> >>> >> matter).  You can also do incremental updates using the bulk
> >>> loader.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> We have a 10 node cluster
> >>> >>> >> > (24gb, 6 x 1TB, 16 core) that we setting up as data/region
> >>> nodes, and
> >>> >>> we
> >>> >>> >> are
> >>> >>> >> > looking for suggestions on configuration as well as benchmarks
> >>> in
> >>> >>> terms
> >>> >>> >> of
> >>> >>> >> > expectations of performance. Below are some specific
> questions.
> >>> I
> >>> >>> realize
> >>> >>> >> > there are a million factors that help determine specific
> >>> performance
> >>> >>> >> > numbers, so any examples of performance from running clusters
> >>> would be
> >>> >>> >> great
> >>> >>> >> > as examples of what can be done.
> >>> >>> >>
> >>> >>> >> Yeah, you have been around the block obviously. Its hard to give
> >>> out
> >>> >>> >> 'numbers' since so many different factors involved.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> Again thrift seems to be our "problem" so
> >>> >>> >> > non java based solutions are preferred (do any non java based
> >>> shops
> >>> >>> run
> >>> >>> >> > large scale hbase clusters?). Our total production cluster
> size
> >>> is
> >>> >>> >> estimated
> >>> >>> >> > to be 50TB.
> >>> >>> >> >
> >>> >>> >>
> >>> >>> >> There are some substantial shops running non-java; e.g. the
> yfrog
> >>> >>> >> folks go via REST, the mozilla fellas are python over thrift,
> >>> >>> >> Stumbleupon is php over thrift.
> >>> >>> >>
> >>> >>> >> > Our data model is 3 CFs, one primary and 2 secondary indexes.
> >>> All
> >>> >>> writes
> >>> >>> >> go
> >>> >>> >> > to all 3 CFs and are grouped as a batch of row mutations which
> >>> should
> >>> >>> >> avoid
> >>> >>> >> > row locking issues.
> >>> >>> >> >
> >>> >>> >>
> >>> >>> >> A write updates 3CFs and secondary indices?  Thats an expensive
> >>> Put
> >>> >>> >> relatively.  You have to run w/ 3CFs?  It facilitates fast
> >>> querying?
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> > What heap size is recommended for master, and for region
> servers
> >>> (24gb
> >>> >>> >> ram)?
> >>> >>> >>
> >>> >>> >> Master doesn't take much heap, at least not in the coming 0.90.0
> >>> HBase
> >>> >>> >> (Is that what you intend to run)?
> >>> >>> >>
> >>> >>> >> The more RAM you give the regionservers, the more cache your
> >>> cluster
> >>> >>> will
> >>> >>> >> have.
> >>> >>> >>
> >>> >>> >> Whats important to you read or write times?
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> > What other settings can/should be tweaked in hbase to optimize
> >>> >>> >> performance
> >>> >>> >> > (we have looked at the wiki page)?
> >>> >>> >>
> >>> >>> >> Thats a good place to start.  Take a look through this mailing
> >>> list
> >>> >>> >> for others (Its time for a trawl of mailing list and then
> >>> distilling
> >>> >>> >> the findings into a reedit of our perf page).
> >>> >>> >>
> >>> >>> >> > What is a good batch size for writes? We will start with 10k
> >>> >>> >> values/batch.
> >>> >>> >>
> >>> >>> >> Start small with defaults.  Make sure its all running smooth
> >>> first.
> >>> >>> >> Then rachet it up.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> > How many concurrent writers/readers can a single data node
> >>> handle with
> >>> >>> >> > evenly distributed load? Are there settings specific to this?
> >>> >>> >>
> >>> >>> >> How many clients you going to have writing HBase?
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> > What is "very good" read/write latency for a single put/get in
> >>> hbase
> >>> >>> >> using
> >>> >>> >> > thrift?
> >>> >>> >>
> >>> >>> >> "Very Good" would be < a few milliseconds.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> > What is "very good" read/write throughput per node in hbase
> >>> using
> >>> >>> thrift?
> >>> >>> >> >
> >>> >>> >>
> >>> >>> >> Thousands of ops per second per regionserver (Sorry, can't be
> more
> >>> >>> >> specific than that).  If the Puts are multi-family + updates on
> >>> >>> >> secondary indices, hundreds -- maybe even tens... I'm not sure
> --
> >>> >>> >> rather than thousands.
> >>> >>> >>
> >>> >>> >> > We are looking to get performance numbers in the range of 10k
> >>> >>> aggregate
> >>> >>> >> > inserts/sec/node and read latency < 30ms/read with 3-4
> >>> concurrent
> >>> >>> >> > readers/node. Can our expectations be met with hbase through
> >>> thrift?
> >>> >>> Can
> >>> >>> >> > they be met with hbase through java?
> >>> >>> >> >
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> I wouldn't fixate on the thrift hop.  At SU we can do thousands
> of
> >>> ops
> >>> >>> >> a second per node np from PHP frontend over thrift.
> >>> >>> >>
> >>> >>> >> 10k inserts a second per node into single CF might be doable.
>  If
> >>> into
> >>> >>> >> 3CFs, then you need to recalibrate your expectations (I'd say).
> >>> >>> >>
> >>> >>> >> > Thanks in advance for any help, examples, or recommendations
> >>> that you
> >>> >>> can
> >>> >>> >> > provide!
> >>> >>> >> >
> >>> >>> >> Sorry, the above is light on recommendations (for reasons cited
> by
> >>> >>> >> Ryan above -- smile).
> >>> >>> >> St.Ack
> >>> >>> >>
> >>> >>> >
> >>> >>>
> >>> >>
> >>> >
> >>>
> >>
> >>
> >
>



-- 
--Sean

Re: Read/Write Performance

Posted by Stack <st...@duboce.net>.
On Sun, Jan 2, 2011 at 6:04 PM, Wayne <wa...@gmail.com> wrote:
> The LZO did not seem to work with the 1GB region size. It was causing
> several minute pauses followed by 5 seconds of requests being processed and
> then again 30+ second pauses (is this GC, compaction, or splits??...all
> regions go to 0 requests but the Meta region which has a few hundred
> requests).

Whats it say in the regionserver at this time?  There are a few
reasons for our blocking incoming writes; if we've not been compacting
fast enough or if memory is full.  You can mess with configs. to take
on more before the barriers go up (IIRC, reading your configs., you'd
upped the memstore size to match the bigger 1GB regions?  If logs say
blocking because too many storefiles, you could up the upper
count...hbase.hstore.blockingStoreFiles).

 Once going back to the default region size the pauses seemed to
> go away. I still see all nodes going to zero requests except the Meta table
> region, but it only lasts for a few seconds at the most. We also had the max
> open files problem (which has now been fixed), so don't know if that could
> have caused that.
>

Max open files?  Ulimit?

St.Ack

Re: Read/Write Performance

Posted by Wayne <wa...@gmail.com>.
The LZO did not seem to work with the 1GB region size. It was causing
several minute pauses followed by 5 seconds of requests being processed and
then again 30+ second pauses (is this GC, compaction, or splits??...all
regions go to 0 requests but the Meta region which has a few hundred
requests). Once going back to the default region size the pauses seemed to
go away. I still see all nodes going to zero requests except the Meta table
region, but it only lasts for a few seconds at the most. We also had the max
open files problem (which has now been fixed), so don't know if that could
have caused that.

I have been loading for several days and since fixing the max file problem
everything has been smooth. We are up to 2+ TB compressed and almost 5,000
regions on 10 nodes. Lzop sure helps pack a lot of data into a small space.
The writes seem to hover in the 7.5-8k/node/sec range. This week we will
test the reads. So far so good.

Thanks.


On Sun, Jan 2, 2011 at 8:11 PM, Sean Bigdatafun
<se...@gmail.com>wrote:

> Has this cured the GC pause at all? I do not see why turning on LZO is
> relavent at all (I read your email, and it sounds that you saw pause after
> the LZO is turned on).
>
> BTW, are you using CMS on a 8GB Heapsize JVM and experiencing a 4 mins
> pause? That sounds a lot.
>
> On Thu, Dec 30, 2010 at 1:51 PM, Wayne <wa...@gmail.com> wrote:
>
> > Lesson learned...restart thrift servers *after* restarting hadoop+hbase.
> >
> > On Thu, Dec 30, 2010 at 3:39 PM, Wayne <wa...@gmail.com> wrote:
> >
> > > We have restarted with lzop compression, and now I am seeing some
> really
> > > long and frequent stop the world pauses of the entire cluster. The load
> > > requests for all regions all go to zero except for the meta table
> region.
> > No
> > > data batches are getting in (no loads are occurring) and everything
> seems
> > > frozen. It seems to last for 5+ seconds. Is this GC on the master or GC
> > in
> > > the meta region? What could cause everything to stop for several
> seconds?
> > It
> > > appears to happen on a recurring basis as well. I think we saw it
> before
> > > switching to lzo but it seems much worse now (lasts longer and occurs
> > more
> > > frequently).
> > >
> > > Thanks.
> > >
> > >
> > >
> > > On Thu, Dec 30, 2010 at 12:20 PM, Wayne <wa...@gmail.com> wrote:
> > >
> > >> HBase Version 0.89.20100924, r1001068 w/ 8GB heap
> > >>
> > >> I plan to run for 1 week straight maxed out. I am worried about GC
> > pauses,
> > >> especially concurrent mode failures (does hbase/hadoop suffer these
> > under
> > >> extended load?). What should I be looking for in the gc log in terms
> of
> > >> problem signs? The ParNews are quick but the CMS concurrent marks are
> > taking
> > >> as much as 4 mins with an average of 20-30 secs.
> > >>
> > >> Thanks.
> > >>
> > >>
> > >>
> > >> On Thu, Dec 30, 2010 at 12:00 PM, Stack <st...@duboce.net> wrote:
> > >>
> > >>> Oh, what versions are you using?
> > >>> St.Ack
> > >>>
> > >>> On Thu, Dec 30, 2010 at 9:00 AM, Stack <st...@duboce.net> wrote:
> > >>> > Keep going. Let it run longer.  Get the servers as loaded as you
> > think
> > >>> > they'll be in production.  Make sure the perf numbers are not
> because
> > >>> > cluster is 'fresh'.
> > >>> > St.Ack
> > >>> >
> > >>> > On Thu, Dec 30, 2010 at 5:51 AM, Wayne <wa...@gmail.com> wrote:
> > >>> >> We finally got our cluster up and running and write performance
> > looks
> > >>> very
> > >>> >> good. We are getting sustained 8-10k writes/sec/node on a 10 node
> > >>> cluster
> > >>> >> from Python through thrift. These are values written to 3 CFs so
> > >>> actual
> > >>> >> hbase performance is 25-30k writes/sec/node. The nodes are
> currently
> > >>> disk
> > >>> >> i/o bound (40-50% utilization) but hopefully once we get lzop
> > working
> > >>> this
> > >>> >> will go down. We have been running for 12 hours without a problem.
> > We
> > >>> hope
> > >>> >> to get lzop going today and then load all through the long
> weekend.
> > >>> >>
> > >>> >> We plan to then test reads next week after we get some data in
> > there.
> > >>> Looks
> > >>> >> good so far! Below are our settings in case there are some
> > >>> >> suggestions/concerns.
> > >>> >>
> > >>> >> Thanks for everyone's help. It is pretty exciting to get
> performance
> > >>> like
> > >>> >> this from the start.
> > >>> >>
> > >>> >>
> > >>> >> *Global*
> > >>> >>
> > >>> >> client.write.buffer = 10485760 (10MB = 5x default)
> > >>> >>
> > >>> >> optionalLogFlushInterval = 10000 (10 secs = 10x default)
> > >>> >>
> > >>> >> memstore.flush.size = 268435456 (256MB = 4x default)
> > >>> >>
> > >>> >> hregion.max.filesize = 1073741824 (1GB = 4x default)
> > >>> >>
> > >>> >> *Table*
> > >>> >>
> > >>> >> alter 'xxx', METHOD => 'table_att', DEFERRED_LOG_FLUSH => true
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >> On Wed, Dec 29, 2010 at 12:55 AM, Stack <st...@duboce.net> wrote:
> > >>> >>
> > >>> >>> On Mon, Dec 27, 2010 at 11:47 AM, Wayne <wa...@gmail.com>
> wrote:
> > >>> >>> > All data is written to 3 CFs. Basically 2 of the CFs are
> > secondary
> > >>> >>> indexes
> > >>> >>> > (manually managed as normal CFs). It sounds like we should try
> > hard
> > >>> to
> > >>> >>> get
> > >>> >>> > as much out of thrift as we can before going to a lower level.
> > >>> >>>
> > >>> >>> Yes.
> > >>> >>>
> > >>> >>> Writes need
> > >>> >>> > to be "fast enough", but reads are more important in the end
> (and
> > >>> are the
> > >>> >>> > reason we are switching from a different solution). The numbers
> > you
> > >>> >>> quoted
> > >>> >>> > below sound like they are in the ballpark of what we are
> looking
> > to
> > >>> do.
> > >>> >>> >
> > >>> >>>
> > >>> >>> Even the tens per second that I threw in there to CMA?
> > >>> >>>
> > >>> >>> > Much of our data is cold, and we expect reads to be disk i/o
> > based.
> > >>> >>>
> > >>> >>> OK.  FYI, we're not the best at this -- cache-miss cold reads --
> > what
> > >>> >>> w/ a network hop in the way and currently we'll open a socket per
> > >>> >>> access.
> > >>> >>>
> > >>> >>> > Given
> > >>> >>> > this is 8GB heap a good place to start on the data nodes (24GB
> > >>> ram)? Is
> > >>> >>> the
> > >>> >>> > block cache managed on its own (being it won't blow up causing
> > >>> OOM),
> > >>> >>>
> > >>> >>> It won't.  Its constrained.  Does our home-brewed sizeof.
>  Default,
> > >>> >>> its 0.2 of total heap.  If you think cache will help, you could
> go
> > up
> > >>> >>> from there.  0.4 or 0.5 of heap.
> > >>> >>>
> > >>> >>> > and if
> > >>> >>> > we do not use it (block cache) should we go even lower for the
> > heap
> > >>> (we
> > >>> >>> want
> > >>> >>> > to avoid CMF and long GC pauses)?
> > >>> >>>
> > >>> >>> If you are going to be doing cache-miss most of the time and cold
> > >>> >>> reads, then yes, you can do away with cache.
> > >>> >>>
> > >>> >>> In testing of 0.90.x I've been running w/ 1MB heaps with 1k
> regions
> > >>> >>> but this is my trying to break stuff.
> > >>> >>>
> > >>> >>> > Are there any timeouts we need to tweak to
> > >>> >>> > make the cluster more "accepting" of long GC pauses while under
> > >>> sustained
> > >>> >>> > load (7+ days of 10k/inserts/sec/node)?
> > >>> >>> >
> > >>> >>>
> > >>> >>> If zookeeper client timesout, the regionserver will shut itself
> > down.
> > >>> >>> In 0.90.0RC2, the client sessionout is set high -- 3 minutes.  If
> > you
> > >>> >>> timeout that, then thats pretty extreme... something badly wrong
> > I'd
> > >>> >>> say.  Heres' a few notes on the config and others that you might
> > want
> > >>> >>> to twiddle (see previous section on required configs... make sure
> > >>> >>> you've got those too):
> > >>> >>>
> > >>> >>>
> > >>>
> >
> http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations>
> > <
> >
> http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations
> <
> http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations>
> >
> > >
> > >>> <
> > >>>
> >
> http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations
> <
> http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations>
> >
> > >>> >
> > >>> >>>
> > >>> >>>
> > >>> >>> > Does LZO compression speed up reads/writes where there is
> excess
> > >>> CPU to
> > >>> >>> do
> > >>> >>> > the compression? I assume it would lower disk i/o but increase
> > CPU
> > >>> a lot.
> > >>> >>> Is
> > >>> >>> > data compressed on the initial write or only after compaction?
> > >>> >>> >
> > >>> >>>
> > >>> >>> LZO is pretty frictionless -- i.e. little CPU cost -- and yes,
> > >>> usually
> > >>> >>> helps speed things up (grab more in the one go).  What size are
> > your
> > >>> >>> records?  You might want to mess w/ hfile block sizes though the
> > 64k
> > >>> >>> default is usually good enough for all but very small cell sizes.
> > >>> >>>
> > >>> >>>
> > >>> >>> > With the replication in the HDFS layer how are reads managed in
> > >>> terms of
> > >>> >>> > load balancing across region servers? Does HDFS know to spread
> > >>> multiple
> > >>> >>> > requests across the 3 region servers that contain the same
> data?
> > >>> >>>
> > >>> >>> You only read from one of the replicas, always the 'closest'.  If
> > the
> > >>> >>> DFSClient has trouble getting the first of the replicas, it moves
> > on
> > >>> >>> to the second, etc.
> > >>> >>>
> > >>> >>>
> > >>> >>> > For example
> > >>> >>> > with 10 data nodes if we have 50 concurrent readers with very
> > >>> "random"
> > >>> >>> key
> > >>> >>> > requests we would expect to have 5 reads occurring on each data
> > >>> node at
> > >>> >>> the
> > >>> >>> > same time. We plan to have a thrift server on each data node,
> so
> > 5
> > >>> >>> > concurrent readers will be connected to each thrift server at
> any
> > >>> given
> > >>> >>> time
> > >>> >>> > (50 in aggregate across 10 nodes). We want to be sure
> everything
> > is
> > >>> >>> designed
> > >>> >>> > to evenly spread this load to avoid any possible hot-spots.
> > >>> >>> >
> > >>> >>>
> > >>> >>> This is different.  This is key design.  A thrift server will be
> > >>> doing
> > >>> >>> some subset of the key space.  If the requests are evenly
> > distributed
> > >>> >>> over all of the key space, then you should be fine; all thrift
> > >>> servers
> > >>> >>> will be evenly loaded.  If not, then there could be hot spots.
> > >>> >>>
> > >>> >>> We have a balancer that currently only counts regions per server,
> > not
> > >>> >>> regions per server plus hits per region so it could be the case
> > that
> > >>> a
> > >>> >>> server by chance ends up carrying all of the hot regions.  HBase
> > >>> >>> itself is not too smart dealing with this.  In 0.90.0, there is
> > >>> >>> facility for manually moving regions -- i.e. closing in current
> > >>> >>> location and moving the region off to another server w/ some
> outage
> > >>> >>> while the move is happening (usually seconds) -- or you could
> split
> > >>> >>> the hot region manually and then the daughters could be moved off
> > to
> > >>> >>> other servers... Primitive for now but should be better in next
> > HBase
> > >>> >>> versions.
> > >>> >>>
> > >>> >>> Have you been able to test w/ your data and your query pattern?
> > >>> >>> That'll tell you way more than I ever could.
> > >>> >>>
> > >>> >>> Good luck,
> > >>> >>> St.Ack
> > >>> >>>
> > >>> >>>
> > >>> >>> >
> > >>> >>> >
> > >>> >>> > On Mon, Dec 27, 2010 at 1:49 PM, Stack <st...@duboce.net>
> wrote:
> > >>> >>> >
> > >>> >>> >> On Fri, Dec 24, 2010 at 5:09 AM, Wayne <wa...@gmail.com>
> > wrote:
> > >>> >>> >> > We are in the process of evaluating hbase in an effort to
> > switch
> > >>> from
> > >>> >>> a
> > >>> >>> >> > different nosql solution. Performance is of course an
> > important
> > >>> part
> > >>> >>> of
> > >>> >>> >> our
> > >>> >>> >> > evaluation. We are a python shop and we are very worried
> that
> > we
> > >>> can
> > >>> >>> not
> > >>> >>> >> get
> > >>> >>> >> > any real performance out of hbase using thrift (and must
> drop
> > >>> down to
> > >>> >>> >> java).
> > >>> >>> >> > We are aware of the various lower level options for bulk
> > insert
> > >>> or
> > >>> >>> java
> > >>> >>> >> > based inserts with turning off WAL etc. but none of these
> are
> > >>> >>> available
> > >>> >>> >> to
> > >>> >>> >> > us in python so are not part of our evaluation.
> > >>> >>> >>
> > >>> >>> >> I can understand python for continuous updates from your
> > frontend
> > >>> or
> > >>> >>> >> whatever but you might consider hacking up a bit of java to
> make
> > >>> us of
> > >>> >>> >> the bulk updater; you'll get upload rates orders of magnitude
> > >>> beyond
> > >>> >>> >> what you'd achieve going via the API via python (or java for
> > that
> > >>> >>> >> matter).  You can also do incremental updates using the bulk
> > >>> loader.
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> We have a 10 node cluster
> > >>> >>> >> > (24gb, 6 x 1TB, 16 core) that we setting up as data/region
> > >>> nodes, and
> > >>> >>> we
> > >>> >>> >> are
> > >>> >>> >> > looking for suggestions on configuration as well as
> benchmarks
> > >>> in
> > >>> >>> terms
> > >>> >>> >> of
> > >>> >>> >> > expectations of performance. Below are some specific
> > questions.
> > >>> I
> > >>> >>> realize
> > >>> >>> >> > there are a million factors that help determine specific
> > >>> performance
> > >>> >>> >> > numbers, so any examples of performance from running
> clusters
> > >>> would be
> > >>> >>> >> great
> > >>> >>> >> > as examples of what can be done.
> > >>> >>> >>
> > >>> >>> >> Yeah, you have been around the block obviously. Its hard to
> give
> > >>> out
> > >>> >>> >> 'numbers' since so many different factors involved.
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> Again thrift seems to be our "problem" so
> > >>> >>> >> > non java based solutions are preferred (do any non java
> based
> > >>> shops
> > >>> >>> run
> > >>> >>> >> > large scale hbase clusters?). Our total production cluster
> > size
> > >>> is
> > >>> >>> >> estimated
> > >>> >>> >> > to be 50TB.
> > >>> >>> >> >
> > >>> >>> >>
> > >>> >>> >> There are some substantial shops running non-java; e.g. the
> > yfrog
> > >>> >>> >> folks go via REST, the mozilla fellas are python over thrift,
> > >>> >>> >> Stumbleupon is php over thrift.
> > >>> >>> >>
> > >>> >>> >> > Our data model is 3 CFs, one primary and 2 secondary
> indexes.
> > >>> All
> > >>> >>> writes
> > >>> >>> >> go
> > >>> >>> >> > to all 3 CFs and are grouped as a batch of row mutations
> which
> > >>> should
> > >>> >>> >> avoid
> > >>> >>> >> > row locking issues.
> > >>> >>> >> >
> > >>> >>> >>
> > >>> >>> >> A write updates 3CFs and secondary indices?  Thats an
> expensive
> > >>> Put
> > >>> >>> >> relatively.  You have to run w/ 3CFs?  It facilitates fast
> > >>> querying?
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> > What heap size is recommended for master, and for region
> > servers
> > >>> (24gb
> > >>> >>> >> ram)?
> > >>> >>> >>
> > >>> >>> >> Master doesn't take much heap, at least not in the coming
> 0.90.0
> > >>> HBase
> > >>> >>> >> (Is that what you intend to run)?
> > >>> >>> >>
> > >>> >>> >> The more RAM you give the regionservers, the more cache your
> > >>> cluster
> > >>> >>> will
> > >>> >>> >> have.
> > >>> >>> >>
> > >>> >>> >> Whats important to you read or write times?
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> > What other settings can/should be tweaked in hbase to
> optimize
> > >>> >>> >> performance
> > >>> >>> >> > (we have looked at the wiki page)?
> > >>> >>> >>
> > >>> >>> >> Thats a good place to start.  Take a look through this mailing
> > >>> list
> > >>> >>> >> for others (Its time for a trawl of mailing list and then
> > >>> distilling
> > >>> >>> >> the findings into a reedit of our perf page).
> > >>> >>> >>
> > >>> >>> >> > What is a good batch size for writes? We will start with 10k
> > >>> >>> >> values/batch.
> > >>> >>> >>
> > >>> >>> >> Start small with defaults.  Make sure its all running smooth
> > >>> first.
> > >>> >>> >> Then rachet it up.
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> > How many concurrent writers/readers can a single data node
> > >>> handle with
> > >>> >>> >> > evenly distributed load? Are there settings specific to
> this?
> > >>> >>> >>
> > >>> >>> >> How many clients you going to have writing HBase?
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> > What is "very good" read/write latency for a single put/get
> in
> > >>> hbase
> > >>> >>> >> using
> > >>> >>> >> > thrift?
> > >>> >>> >>
> > >>> >>> >> "Very Good" would be < a few milliseconds.
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> > What is "very good" read/write throughput per node in hbase
> > >>> using
> > >>> >>> thrift?
> > >>> >>> >> >
> > >>> >>> >>
> > >>> >>> >> Thousands of ops per second per regionserver (Sorry, can't be
> > more
> > >>> >>> >> specific than that).  If the Puts are multi-family + updates
> on
> > >>> >>> >> secondary indices, hundreds -- maybe even tens... I'm not sure
> > --
> > >>> >>> >> rather than thousands.
> > >>> >>> >>
> > >>> >>> >> > We are looking to get performance numbers in the range of
> 10k
> > >>> >>> aggregate
> > >>> >>> >> > inserts/sec/node and read latency < 30ms/read with 3-4
> > >>> concurrent
> > >>> >>> >> > readers/node. Can our expectations be met with hbase through
> > >>> thrift?
> > >>> >>> Can
> > >>> >>> >> > they be met with hbase through java?
> > >>> >>> >> >
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> I wouldn't fixate on the thrift hop.  At SU we can do
> thousands
> > of
> > >>> ops
> > >>> >>> >> a second per node np from PHP frontend over thrift.
> > >>> >>> >>
> > >>> >>> >> 10k inserts a second per node into single CF might be doable.
> >  If
> > >>> into
> > >>> >>> >> 3CFs, then you need to recalibrate your expectations (I'd
> say).
> > >>> >>> >>
> > >>> >>> >> > Thanks in advance for any help, examples, or recommendations
> > >>> that you
> > >>> >>> can
> > >>> >>> >> > provide!
> > >>> >>> >> >
> > >>> >>> >> Sorry, the above is light on recommendations (for reasons
> cited
> > by
> > >>> >>> >> Ryan above -- smile).
> > >>> >>> >> St.Ack
> > >>> >>> >>
> > >>> >>> >
> > >>> >>>
> > >>> >>
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>
>
>
> --
> --Sean
>