You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Wayne <wa...@gmail.com> on 2011/01/12 17:27:23 UTC

Recommended Node Size Limits

What is the recommended max node size in terms of data (using lzo
compression), region counts, and region sizes? I know there is no hard limit
and it should depend on the required load in terms of concurrent
reads/writes that the hardware can handle but I do not think that is true.
Based on real experience how many regions per node is too many? What max
region size is too big? Do production clusters have 3TB nodes with thousands
of very large regions? Where are the hidden cliffs that we should avoid?

Thanks.

Re: Recommended Node Size Limits

Posted by Edward Capriolo <ed...@gmail.com>.

On Sat, Jan 15, 2011 at 8:45 AM, Wayne <wa...@gmail.com> wrote:
> Not everyone is looking for a distributed memcache. Many of us are looking
> for a database that scales up and out, and for that there is only one
> choice. HBase does auto partitioning with regions; this is the genius of the
> original bigtable design. Regions are logical units small enough to be fast
> to copy around/replicate, compact, and access with random disk I/O.
> Cassandra has NOTHING like this. Cassandra partitions great to the node, but
> then the node is one logical unit with some very very very big CF files.
> What DBA can sleep at night with their data in 3x 350GB files? Our
> definition of big data is big data ON the node and big data across the
> nodes. Cassandra can not handle large nodes; 30 hour compaction is real on a
> 1TB node and is a serious problem with scale. HBase compresses with LZOP so
> your data size is already much smaller, you can up the region size and
> handle up to 100 "active" regions and a heck of lot more inactive regions
> (perfect fit for time series data), and they all are compacted and accessed
> individually. There is no common java limitation here...
>
> I spent 6 months with Cassandra to get it to work perfectly, and then
> totally abandoned it due to fundamental design flaws like this. This in
> conjunction with having to ask ALL copies of data for a consistent read
> which causes 3x the disk i/o on reads. Cassandra is not a good choice if
> consistency is important or if scaling UP nodes is important. For those of
> us looking to scale out/up a relational database (and we can not afford to
> put 50TB in RAM) HBase is the only choice in the nosql space. Cassandra is a
> great piece of software and the people behind it are top notch. Jonathan
> Ellis has personally helped me with many problems and helped to get our
> cluster to work flawlessly. Cassandra has many great uses and is a great
> choice if you can keep most data in cache, but it is no relational database
> replacement. I was bought into the hype and I hope others will be able to
> get real assessments of differences to make the best decision for their
> needs.
>
>
> On Fri, Jan 14, 2011 at 10:50 PM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> On Fri, Jan 14, 2011 at 2:02 PM, Jonathan Gray <jg...@fb.com> wrote:
>> > How about Integer.MAX_VALUE (or I believe 0 works) to completely disable
>> splits?
>> >
>> > As far what we are running with today, we do have clusters with regions
>> over 10GB and growing.  There has been a lot of work in the compaction logic
>> to make these large regions more efficient with IO (by not compacting
>> big/old files and such).
>> >
>> > JG
>> >
>> >> -----Original Message-----
>> >> From: Ted Dunning [mailto:tdunning@maprtech.com]
>> >> Sent: Friday, January 14, 2011 10:12 AM
>> >> To: user@hbase.apache.org
>> >> Subject: Re: Recommended Node Size Limits
>> >>
>> >> Way up = ??
>> >>
>> >> 1GB?
>> >>
>> >> 10GB?
>> >>
>> >> If 1GB, doesn't this mean that you are serving only 64GB of data per
>> node?
>> >>  That seems really, really small.
>> >>
>> >> On Fri, Jan 14, 2011 at 9:39 AM, Jonathan Gray <jg...@fb.com> wrote:
>> >>
>> >> > Then you can turn your split size way up, effectively preventing
>> >> > further splits.  Again, this is for randomly distributed requests.
>> >> >
>> >
>>
>> I would think both systems (Cassandra || HBase) have large node
>> pitfalls. If you want >500GB a node storage, low latency (<5ms), and
>> random lookup you better have a lot of RAM!
>>
>> Current JVM's have diminishing returns past 24 GB of heap memory so in
>> JVM caching for that volume is out. Both benefit from VFS cache, but
>> both have cache churn from compacting.
>>
>> For applications that are mostly random read, the data to ram ratio is
>> a biggest factor. I have been curious for a while on what peoples ram
>> - data ration is.
>>
>


Defiantly the Cassandra compaction large disk volumes is less elegant
then the small region design. I think the best alternative their is to
lose your RAID-0/ RAID5 and just run one Cassandra instance per JBOD
disk. There is some overhead and management with that but hey then the
largest compaction are not much of an issue any more. (its something
like a virtualized solution)

K. It seemed from your post asking "can hbase handle 1k regions per
server" you were not sure if hbase could handle it either. However it
sounds like you have it figured it out.

Re: Recommended Node Size Limits

Posted by Wayne <wa...@gmail.com>.

Not everyone is looking for a distributed memcache. Many of us are looking
for a database that scales up and out, and for that there is only one
choice. HBase does auto partitioning with regions; this is the genius of the
original bigtable design. Regions are logical units small enough to be fast
to copy around/replicate, compact, and access with random disk I/O.
Cassandra has NOTHING like this. Cassandra partitions great to the node, but
then the node is one logical unit with some very very very big CF files.
What DBA can sleep at night with their data in 3x 350GB files? Our
definition of big data is big data ON the node and big data across the
nodes. Cassandra can not handle large nodes; 30 hour compaction is real on a
1TB node and is a serious problem with scale. HBase compresses with LZOP so
your data size is already much smaller, you can up the region size and
handle up to 100 "active" regions and a heck of lot more inactive regions
(perfect fit for time series data), and they all are compacted and accessed
individually. There is no common java limitation here...

I spent 6 months with Cassandra to get it to work perfectly, and then
totally abandoned it due to fundamental design flaws like this. This in
conjunction with having to ask ALL copies of data for a consistent read
which causes 3x the disk i/o on reads. Cassandra is not a good choice if
consistency is important or if scaling UP nodes is important. For those of
us looking to scale out/up a relational database (and we can not afford to
put 50TB in RAM) HBase is the only choice in the nosql space. Cassandra is a
great piece of software and the people behind it are top notch. Jonathan
Ellis has personally helped me with many problems and helped to get our
cluster to work flawlessly. Cassandra has many great uses and is a great
choice if you can keep most data in cache, but it is no relational database
replacement. I was bought into the hype and I hope others will be able to
get real assessments of differences to make the best decision for their
needs.

On Fri, Jan 14, 2011 at 10:50 PM, Edward Capriolo <ed...@gmail.com>wrote:

> On Fri, Jan 14, 2011 at 2:02 PM, Jonathan Gray <jg...@fb.com> wrote:
> > How about Integer.MAX_VALUE (or I believe 0 works) to completely disable
> splits?
> >
> > As far what we are running with today, we do have clusters with regions
> over 10GB and growing.  There has been a lot of work in the compaction logic
> to make these large regions more efficient with IO (by not compacting
> big/old files and such).
> >
> > JG
> >
> >> -----Original Message-----
> >> From: Ted Dunning [mailto:tdunning@maprtech.com]
> >> Sent: Friday, January 14, 2011 10:12 AM
> >> To: user@hbase.apache.org
> >> Subject: Re: Recommended Node Size Limits
> >>
> >> Way up = ??
> >>
> >> 1GB?
> >>
> >> 10GB?
> >>
> >> If 1GB, doesn't this mean that you are serving only 64GB of data per
> node?
> >>  That seems really, really small.
> >>
> >> On Fri, Jan 14, 2011 at 9:39 AM, Jonathan Gray <jg...@fb.com> wrote:
> >>
> >> > Then you can turn your split size way up, effectively preventing
> >> > further splits.  Again, this is for randomly distributed requests.
> >> >
> >
>
> I would think both systems (Cassandra || HBase) have large node
> pitfalls. If you want >500GB a node storage, low latency (<5ms), and
> random lookup you better have a lot of RAM!
>
> Current JVM's have diminishing returns past 24 GB of heap memory so in
> JVM caching for that volume is out. Both benefit from VFS cache, but
> both have cache churn from compacting.
>
> For applications that are mostly random read, the data to ram ratio is
> a biggest factor. I have been curious for a while on what peoples ram
> - data ration is.
>

Re: Recommended Node Size Limits

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Jan 14, 2011 at 2:02 PM, Jonathan Gray <jg...@fb.com> wrote:
> How about Integer.MAX_VALUE (or I believe 0 works) to completely disable splits?
>
> As far what we are running with today, we do have clusters with regions over 10GB and growing.  There has been a lot of work in the compaction logic to make these large regions more efficient with IO (by not compacting big/old files and such).
>
> JG
>
>> -----Original Message-----
>> From: Ted Dunning [mailto:tdunning@maprtech.com]
>> Sent: Friday, January 14, 2011 10:12 AM
>> To: user@hbase.apache.org
>> Subject: Re: Recommended Node Size Limits
>>
>> Way up = ??
>>
>> 1GB?
>>
>> 10GB?
>>
>> If 1GB, doesn't this mean that you are serving only 64GB of data per node?
>>  That seems really, really small.
>>
>> On Fri, Jan 14, 2011 at 9:39 AM, Jonathan Gray <jg...@fb.com> wrote:
>>
>> > Then you can turn your split size way up, effectively preventing
>> > further splits.  Again, this is for randomly distributed requests.
>> >
>

I would think both systems (Cassandra || HBase) have large node
pitfalls. If you want >500GB a node storage, low latency (<5ms), and
random lookup you better have a lot of RAM!

Current JVM's have diminishing returns past 24 GB of heap memory so in
JVM caching for that volume is out. Both benefit from VFS cache, but
both have cache churn from compacting.

For applications that are mostly random read, the data to ram ratio is
a biggest factor. I have been curious for a while on what peoples ram
- data ration is.

RE: Recommended Node Size Limits

Posted by Jonathan Gray <jg...@fb.com>.

How about Integer.MAX_VALUE (or I believe 0 works) to completely disable splits?

As far what we are running with today, we do have clusters with regions over 10GB and growing.  There has been a lot of work in the compaction logic to make these large regions more efficient with IO (by not compacting big/old files and such).

JG

> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@maprtech.com]
> Sent: Friday, January 14, 2011 10:12 AM
> To: user@hbase.apache.org
> Subject: Re: Recommended Node Size Limits
> 
> Way up = ??
> 
> 1GB?
> 
> 10GB?
> 
> If 1GB, doesn't this mean that you are serving only 64GB of data per node?
>  That seems really, really small.
> 
> On Fri, Jan 14, 2011 at 9:39 AM, Jonathan Gray <jg...@fb.com> wrote:
> 
> > Then you can turn your split size way up, effectively preventing
> > further splits.  Again, this is for randomly distributed requests.
> >

Re: Recommended Node Size Limits

Posted by Ted Dunning <td...@maprtech.com>.

Way up = ??

1GB?

10GB?

If 1GB, doesn't this mean that you are serving only 64GB of data per node?
 That seems really, really small.

On Fri, Jan 14, 2011 at 9:39 AM, Jonathan Gray <jg...@fb.com> wrote:

> Then you can turn your split size way up, effectively preventing further
> splits.  Again, this is for randomly distributed requests.
>

RE: Recommended Node Size Limits

Posted by Jonathan Gray <jg...@fb.com>.

One of the most important factors to look at is how the number of regions relates to how much heap is available for your RegionServers, and then how that will impact your expected MemStore flush sizes.  More than total number of regions, this is about the number of actively written to regions.

Assuming I have randomly distributed writes evenly across all regions... If I have 2GB of heap available for all MemStores, and I have 1000 regions on each node, then this means I have ~2MB available for each region's MemStore.  This is really low.  You'll be flushing 2MB files and that will mean more compacting and less efficient utilization of IO.

If 4GB of heap for MemStores and 200 regions, then you get 20MB for each region.  This is approaching a more sane value.

Again, things change significantly depending on the nature of your application, though it sounds like you do have very randomly distributed writes.

Working backwards, if I wanted an expected flush size of 64MB and I had 8GB of total heap, I might give 50% to the MemStores (so 4GB), and then I'd want to target (4096 / 64 = 64) regions per server.  This seems low but 64 shards per server should still be sufficient.  The BigTable paper describes ~100 shards/node in production.  One way to get this control over the number of regions is the pre-split your table at creation time (there is an API that allows you to define your keyspace and number of splits you want).  Then you can turn your split size way up, effectively preventing further splits.  Again, this is for randomly distributed requests.

Regions that don't take significant amounts of writes (or any writes at all) don't need to be considered here.  There's a low cost associated with serving higher numbers of inactive regions (1000s).

JG

> -----Original Message-----
> From: Wayne [mailto:wav100@gmail.com]
> Sent: Friday, January 14, 2011 6:40 AM
> To: user@hbase.apache.org
> Subject: Re: Recommended Node Size Limits
> 
> No suggestions? We are trying to size our production nodes and are afraid
> that 1k+ regions per node is "too" much. We spent 6 months with Cassandra
> before realizing that it does not scale *UP* at all and that more than 500GB
> per node is suicide in terms of read latency degradation and compaction (we
> had 30 hour compaction with 1TB nodes). We would prefer to not have to
> find out these "surprises" on our own with HBase. Any real world production
> experience in terms of sizing would be greatly appreciated. What criteria
> triggers more nodes other than concurrent read/write load?
> 
> Thanks.
> 
> On Wed, Jan 12, 2011 at 11:27 AM, Wayne <wa...@gmail.com> wrote:
> 
> > What is the recommended max node size in terms of data (using lzo
> > compression), region counts, and region sizes? I know there is no hard
> > limit and it should depend on the required load in terms of concurrent
> > reads/writes that the hardware can handle but I do not think that is true.
> > Based on real experience how many regions per node is too many? What
> > max region size is too big? Do production clusters have 3TB nodes with
> > thousands of very large regions? Where are the hidden cliffs that we
> should avoid?
> >
> > Thanks.
> >

Re: Recommended Node Size Limits

Posted by Wayne <wa...@gmail.com>.

No suggestions? We are trying to size our production nodes and are afraid
that 1k+ regions per node is "too" much. We spent 6 months with Cassandra
before realizing that it does not scale *UP* at all and that more than 500GB
per node is suicide in terms of read latency degradation and compaction (we
had 30 hour compaction with 1TB nodes). We would prefer to not have to find
out these "surprises" on our own with HBase. Any real world production
experience in terms of sizing would be greatly appreciated. What criteria
triggers more nodes other than concurrent read/write load?

Thanks.

On Wed, Jan 12, 2011 at 11:27 AM, Wayne <wa...@gmail.com> wrote:

> What is the recommended max node size in terms of data (using lzo
> compression), region counts, and region sizes? I know there is no hard limit
> and it should depend on the required load in terms of concurrent
> reads/writes that the hardware can handle but I do not think that is true.
> Based on real experience how many regions per node is too many? What max
> region size is too big? Do production clusters have 3TB nodes with thousands
> of very large regions? Where are the hidden cliffs that we should avoid?
>
> Thanks.
>