You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "M. C. Srivas" <mc...@gmail.com> on 2011/02/18 08:15:31 UTC

Re: Cluster Size/Node Density

I was reading this thread with interest. Here's my $.02

On Fri, Dec 17, 2010 at 12:29 PM, Wayne <wa...@gmail.com> wrote:

> Sorry, I am sure my questions were far too broad to answer.
>
> Let me *try* to ask more specific questions. Assuming all data requests are
> cold (random reading pattern) and everything comes from the disks (no block
> cache), what level of concurrency can HDFS handle?


Cold cache, random reads ==> totally governed by seeks, so governed by # of
spindles per box.

A SATA drive can do about 100 random seeks per sec, ie, 100 reads/second



> Almost all of the load is
> controlled data processing, but we have to do a lot of work at night during
> the batch window so something in the 15-20,000 QPS range would meet current
> worse case requirements. How many nodes would be required to effectively
> return data against a 50TB aggregate data store?


Assuming 12 drives per node, and a cache hit rate of 0% (since its most cold
cache), you will see about 12 * 100  = 1200 reads per second per node.
If your cache hit rate goes up to 25%, then, your read rate is 1600
reads/sec/node
Thus, 10 machines can serve about 12k-16k reads/sec, cold cache.

50TB of data on 10 machine => 5 TB/node. That might be bit too much for each
region server (a RS can do about 700 regions comfortably, each of 1G). If
you push, you might get 2TB/regionserver, or, 25 machines for total. If the
data compresses 50%, then 12-13 nodes.

So, for your scenario, its about 12-13 RS's, with 12 drives each, and you
will comfortably do 24k QPS cold cache.

Does that help?



> Disk I/O assumedly starts
> to break down at a certain point in terms of concurrent readers/node/disk.
> We have in our control how many total concurrent readers there are, so if
> we
> can get 10ms response time with 100 readers that might be better than 100ms
> responses from 1000 concurrent readers.
>
> Thanks.
>
>
> On Fri, Dec 17, 2010 at 2:46 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > Hi Wayne,
> >
> > This question has such a large scope but is applicable to such a tiny
> > subset of workloads (eg yours) that fielding all the questions in
> > details would probably end up just wasting everyone's cycles. So first
> > I'd like to clear up some confusion.
> >
> > > We would like some help with cluster sizing estimates. We have 15TB of
> > > currently relational data we want to store in hbase.
> >
> > There's the 3x replication factor, but also you have to account that
> > each value is stored with it's row key, family name, qualifier and
> > timestamp. That could be a lot more data to store, but at the same
> > time you can use LZO compression to bring that down ~4x.
> >
> > > How many nodes, regions, etc. are we going to need?
> >
> > You don't really have the control over regions, they are created for
> > you as your data grows.
> >
> > > What will our read latency be for 30 vs. 100? Sure we can pack 20 nodes
> > with 3TB
> > > of data each but will it take 1+s for every get?
> >
> > I'm not sure what kind of back-of-the-envelope calculations took you
> > to 1sec, but latency will be strictly determined by concurrency and
> > actual machine load. Even if you were able to pack 20TB in one onde
> > but using a tiny portion of it, you would still get sub 100ms
> > latencies. Or if you have only 10GB on that node but it's getting
> > hammered by 10000 clients, then you should expect much higher
> > latencies.
> >
> > > Will compaction run for 3 days?
> >
> > Which compactions? Major ones? If you don't insert new data in a
> > region, it won't be major compacted. Also if you have that much data,
> > I would set the time between major compactions to be bigger than 1
> > day. Heck, since you are doing time series, this means you'll never
> > delete anything right? So you might as well disable them.
> >
> > And now for the meaty part...
> >
> > The size of your dataset is only one part of the equation, the other
> > being traffic you would be pushing to the cluster which I think wasn't
> > covered at all in your email. Like I said previously, you can pack a
> > lot of data in a single node and can retrieve it really fast as long
> > as concurrency is low. Another thing is how random your reading
> > pattern is... can you even leverage the block cache at all? If yes,
> > then you can accept more concurrency, if not then hitting HDFS is a
> > lot slower (and it's still not very good at handling many clients).
> >
> > Unfortunately, even if you gave us exactly how many QPS you want to do
> > per second, we'd have a hard time recommending any number of nodes.
> >
> > What I would recommend then is to benchmark it. Try to grab 5-6
> > machines, load a subset of the data, generate traffic, see how it
> > behaves.
> >
> > Hope that helps,
> >
> > J-D
> >
> > On Fri, Dec 17, 2010 at 9:09 AM, Wayne <wa...@gmail.com> wrote:
> > > We would like some help with cluster sizing estimates. We have 15TB of
> > > currently relational data we want to store in hbase. Once that is
> > replicated
> > > to a factor of 3 and stored with secondary indexes etc. we assume will
> > have
> > > 50TB+ of data. The data is basically data warehouse style time series
> > data
> > > where much of it is cold, however want good read latency to get access
> to
> > > all of it. Not memory based latency but < 25ms latency for a small
> chunks
> > of
> > > data.
> > >
> > > How many nodes, regions, etc. are we going to need? Assuming a typical
> 6
> > > disk, 24GB ram, 16 core data node, how many of these do we need to
> > > sufficiently manage this volume of data? Obviously there are a million
> > "it
> > > depends", but the bigger drivers are how much data can a node handle?
> How
> > > long will compaction take? How many regions can a node handle and how
> big
> > > can those regions get? Can we really have 1.5TB of data on a single
> node
> > in
> > > 6,000 regions? What are the true drivers between more nodes vs. bigger
> > > nodes? Do we need 30 nodes to handle our 50GB of data or 100 nodes?
> What
> > > will our read latency be for 30 vs. 100? Sure we can pack 20 nodes with
> > 3TB
> > > of data each but will it take 1+s for every get? Will compaction run
> for
> > 3
> > > days? How much data is really "too much" on an hbase data node?
> > >
> > > Any help or advice would be greatly appreciated.
> > >
> > > Thanks
> > >
> > > Wayne
> > >
> >
>

Re: Cluster Size/Node Density

Posted by Jean-Daniel Cryans <jd...@apache.org>.

It would be the second report of someone having u23 being less stable
than u17 that I see in less than a week. Interesting...

J-D

On Sat, Feb 19, 2011 at 9:43 AM, Wayne <wa...@gmail.com> wrote:
> What JVM is recommended for the new memstore allocator? We swtiched from u23
> back to u17 which helped a lot. Is this optimized for a specific JVM or does
> it not matter?
>
> On Fri, Feb 18, 2011 at 5:46 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>> On Fri, Feb 18, 2011 at 12:10 PM, Jean-Daniel Cryans
>> <jd...@apache.org>wrote:
>>
>> > The bigger the heap the longer the GC pause of the world when
>>
>> fragmentation requires it, 8GB is "safer".
>> >
>>
>> On my boxes, a stop-the-world on 8G heap is already around 80 seconds...
>> pretty catastrophic. Of course we've bumped the ZK timeout up to several
>> minutes these days, but it's just a bandaid.
>>
>>
>> >
>> > In 0.90.1 you can try enabling the new memstore allocator that seems
>> > to do a really good job, checkout the jira first:
>> > https://issues.apache.org/jira/browse/HBASE-3455
>> >
>> >
>> Yep. Hopefully will have time to do a blog post this weekend about it as
>> well. In my testing, try as I might, I can't get my region servers to do a
>> full GC anymore when this is enabled.
>>
>> -Todd
>>
>>
>> > On Fri, Feb 18, 2011 at 12:05 PM, Chris Tarnas <cf...@email.com> wrote:
>> > > Thank you , ad that bring me to my next question...
>> > >
>> > > What is the current recommendation on the max heap size for Hbase if
>> > > RAM
>> > on the server is not an issue? Right now I am at 8GB and have no issues,
>> > can
>> > I safely do 12GB? The servers have plenty of RAM (48GB) so that should
>> > not
>> > be an issue - I just want to minimize the risk that GC will cause
>> > problems.
>> > >
>> > > thanks again.
>> > > -chris
>> > >
>> > > On Feb 18, 2011, at 11:59 AM, Jean-Daniel Cryans wrote:
>> > >
>> > >> That's what I usually recommend, the bigger the flushed files the
>> > >> better. On the other hand, you only have so much memory to dedicate
>> > >> to
>> > >> the MemStore...
>> > >>
>> > >> J-D
>> > >>
>> > >> On Fri, Feb 18, 2011 at 11:50 AM, Chris Tarnas <cf...@email.com> wrote:
>> > >>> Would it be a good idea to raise the
>> > >>> hbase.hregion.memstore.flush.size
>> > if you have really large regions?
>> > >>>
>> > >>> -chris
>> > >>>
>> > >>> On Feb 18, 2011, at 11:43 AM, Jean-Daniel Cryans wrote:
>> > >>>
>> > >>>> Less regions, but it's often a good thing if you have a lot of data
>> > >>>> :)
>> > >>>>
>> > >>>> It's probably a good thing to bump the HDFS block size to 128 or
>> > >>>> 256MB
>> > >>>> since you know you're going to have huge-ish files.
>> > >>>>
>> > >>>> But anyway regarding penalties, I can't think of one that clearly
>> > >>>> comes out (unless you use a very small heap). The IO usage patterns
>> > >>>> will change, but unless you flush very small files all the time and
>> > >>>> need to recompact them into much bigger ones, then it shouldn't
>> > >>>> really
>> > >>>> be an issue.
>> > >>>>
>> > >>>> J-D
>> > >>>>
>> > >>>> On Fri, Feb 18, 2011 at 11:36 AM, Jason Rutherglen
>> > >>>> <ja...@gmail.com> wrote:
>> > >>>>>>  We are also using a 5Gb region size to keep our region
>> > >>>>>> counts in the 100-200 range/node per Jonathan Grey's
>> > >>>>>> recommendation.
>> > >>>>>
>> > >>>>> So there isn't a penalty incurred from increasing the max region
>> > >>>>> size
>> > >>>>> from 256MB to 5GB?
>> > >>>>>
>> > >>>
>> > >>>
>> > >
>> > >
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>
>

Re: Cluster Size/Node Density

Posted by Stack <sa...@gmail.com>.

It is not jvm version dependent. 

Stack



On Feb 19, 2011, at 6:43, Wayne <wa...@gmail.com> wrote:

> What JVM is recommended for the new memstore allocator? We swtiched from u23
> back to u17 which helped a lot. Is this optimized for a specific JVM or does
> it not matter?
> 
> On Fri, Feb 18, 2011 at 5:46 PM, Todd Lipcon <to...@cloudera.com> wrote:
> 
>> On Fri, Feb 18, 2011 at 12:10 PM, Jean-Daniel Cryans <jdcryans@apache.org
>>> wrote:
>> 
>>> The bigger the heap the longer the GC pause of the world when
>> 
>> fragmentation requires it, 8GB is "safer".
>>> 
>> 
>> On my boxes, a stop-the-world on 8G heap is already around 80 seconds...
>> pretty catastrophic. Of course we've bumped the ZK timeout up to several
>> minutes these days, but it's just a bandaid.
>> 
>> 
>>> 
>>> In 0.90.1 you can try enabling the new memstore allocator that seems
>>> to do a really good job, checkout the jira first:
>>> https://issues.apache.org/jira/browse/HBASE-3455
>>> 
>>> 
>> Yep. Hopefully will have time to do a blog post this weekend about it as
>> well. In my testing, try as I might, I can't get my region servers to do a
>> full GC anymore when this is enabled.
>> 
>> -Todd
>> 
>> 
>>> On Fri, Feb 18, 2011 at 12:05 PM, Chris Tarnas <cf...@email.com> wrote:
>>>> Thank you , ad that bring me to my next question...
>>>> 
>>>> What is the current recommendation on the max heap size for Hbase if
>> RAM
>>> on the server is not an issue? Right now I am at 8GB and have no issues,
>> can
>>> I safely do 12GB? The servers have plenty of RAM (48GB) so that should
>> not
>>> be an issue - I just want to minimize the risk that GC will cause
>> problems.
>>>> 
>>>> thanks again.
>>>> -chris
>>>> 
>>>> On Feb 18, 2011, at 11:59 AM, Jean-Daniel Cryans wrote:
>>>> 
>>>>> That's what I usually recommend, the bigger the flushed files the
>>>>> better. On the other hand, you only have so much memory to dedicate to
>>>>> the MemStore...
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Fri, Feb 18, 2011 at 11:50 AM, Chris Tarnas <cf...@email.com> wrote:
>>>>>> Would it be a good idea to raise the
>> hbase.hregion.memstore.flush.size
>>> if you have really large regions?
>>>>>> 
>>>>>> -chris
>>>>>> 
>>>>>> On Feb 18, 2011, at 11:43 AM, Jean-Daniel Cryans wrote:
>>>>>> 
>>>>>>> Less regions, but it's often a good thing if you have a lot of data
>> :)
>>>>>>> 
>>>>>>> It's probably a good thing to bump the HDFS block size to 128 or
>> 256MB
>>>>>>> since you know you're going to have huge-ish files.
>>>>>>> 
>>>>>>> But anyway regarding penalties, I can't think of one that clearly
>>>>>>> comes out (unless you use a very small heap). The IO usage patterns
>>>>>>> will change, but unless you flush very small files all the time and
>>>>>>> need to recompact them into much bigger ones, then it shouldn't
>> really
>>>>>>> be an issue.
>>>>>>> 
>>>>>>> J-D
>>>>>>> 
>>>>>>> On Fri, Feb 18, 2011 at 11:36 AM, Jason Rutherglen
>>>>>>> <ja...@gmail.com> wrote:
>>>>>>>>> We are also using a 5Gb region size to keep our region
>>>>>>>>> counts in the 100-200 range/node per Jonathan Grey's
>> recommendation.
>>>>>>>> 
>>>>>>>> So there isn't a penalty incurred from increasing the max region
>> size
>>>>>>>> from 256MB to 5GB?
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>

Re: Cluster Size/Node Density

Posted by Wayne <wa...@gmail.com>.

What JVM is recommended for the new memstore allocator? We swtiched from u23
back to u17 which helped a lot. Is this optimized for a specific JVM or does
it not matter?

On Fri, Feb 18, 2011 at 5:46 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Fri, Feb 18, 2011 at 12:10 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > The bigger the heap the longer the GC pause of the world when
>
> fragmentation requires it, 8GB is "safer".
> >
>
> On my boxes, a stop-the-world on 8G heap is already around 80 seconds...
> pretty catastrophic. Of course we've bumped the ZK timeout up to several
> minutes these days, but it's just a bandaid.
>
>
> >
> > In 0.90.1 you can try enabling the new memstore allocator that seems
> > to do a really good job, checkout the jira first:
> > https://issues.apache.org/jira/browse/HBASE-3455
> >
> >
> Yep. Hopefully will have time to do a blog post this weekend about it as
> well. In my testing, try as I might, I can't get my region servers to do a
> full GC anymore when this is enabled.
>
> -Todd
>
>
> > On Fri, Feb 18, 2011 at 12:05 PM, Chris Tarnas <cf...@email.com> wrote:
> > > Thank you , ad that bring me to my next question...
> > >
> > > What is the current recommendation on the max heap size for Hbase if
> RAM
> > on the server is not an issue? Right now I am at 8GB and have no issues,
> can
> > I safely do 12GB? The servers have plenty of RAM (48GB) so that should
> not
> > be an issue - I just want to minimize the risk that GC will cause
> problems.
> > >
> > > thanks again.
> > > -chris
> > >
> > > On Feb 18, 2011, at 11:59 AM, Jean-Daniel Cryans wrote:
> > >
> > >> That's what I usually recommend, the bigger the flushed files the
> > >> better. On the other hand, you only have so much memory to dedicate to
> > >> the MemStore...
> > >>
> > >> J-D
> > >>
> > >> On Fri, Feb 18, 2011 at 11:50 AM, Chris Tarnas <cf...@email.com> wrote:
> > >>> Would it be a good idea to raise the
> hbase.hregion.memstore.flush.size
> > if you have really large regions?
> > >>>
> > >>> -chris
> > >>>
> > >>> On Feb 18, 2011, at 11:43 AM, Jean-Daniel Cryans wrote:
> > >>>
> > >>>> Less regions, but it's often a good thing if you have a lot of data
> :)
> > >>>>
> > >>>> It's probably a good thing to bump the HDFS block size to 128 or
> 256MB
> > >>>> since you know you're going to have huge-ish files.
> > >>>>
> > >>>> But anyway regarding penalties, I can't think of one that clearly
> > >>>> comes out (unless you use a very small heap). The IO usage patterns
> > >>>> will change, but unless you flush very small files all the time and
> > >>>> need to recompact them into much bigger ones, then it shouldn't
> really
> > >>>> be an issue.
> > >>>>
> > >>>> J-D
> > >>>>
> > >>>> On Fri, Feb 18, 2011 at 11:36 AM, Jason Rutherglen
> > >>>> <ja...@gmail.com> wrote:
> > >>>>>>  We are also using a 5Gb region size to keep our region
> > >>>>>> counts in the 100-200 range/node per Jonathan Grey's
> recommendation.
> > >>>>>
> > >>>>> So there isn't a penalty incurred from increasing the max region
> size
> > >>>>> from 256MB to 5GB?
> > >>>>>
> > >>>
> > >>>
> > >
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Cluster Size/Node Density

Posted by Todd Lipcon <to...@cloudera.com>.

On Fri, Feb 18, 2011 at 12:10 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> The bigger the heap the longer the GC pause of the world when

fragmentation requires it, 8GB is "safer".
>

On my boxes, a stop-the-world on 8G heap is already around 80 seconds...
pretty catastrophic. Of course we've bumped the ZK timeout up to several
minutes these days, but it's just a bandaid.


>
> In 0.90.1 you can try enabling the new memstore allocator that seems
> to do a really good job, checkout the jira first:
> https://issues.apache.org/jira/browse/HBASE-3455
>
>
Yep. Hopefully will have time to do a blog post this weekend about it as
well. In my testing, try as I might, I can't get my region servers to do a
full GC anymore when this is enabled.

-Todd


> On Fri, Feb 18, 2011 at 12:05 PM, Chris Tarnas <cf...@email.com> wrote:
> > Thank you , ad that bring me to my next question...
> >
> > What is the current recommendation on the max heap size for Hbase if RAM
> on the server is not an issue? Right now I am at 8GB and have no issues, can
> I safely do 12GB? The servers have plenty of RAM (48GB) so that should not
> be an issue - I just want to minimize the risk that GC will cause problems.
> >
> > thanks again.
> > -chris
> >
> > On Feb 18, 2011, at 11:59 AM, Jean-Daniel Cryans wrote:
> >
> >> That's what I usually recommend, the bigger the flushed files the
> >> better. On the other hand, you only have so much memory to dedicate to
> >> the MemStore...
> >>
> >> J-D
> >>
> >> On Fri, Feb 18, 2011 at 11:50 AM, Chris Tarnas <cf...@email.com> wrote:
> >>> Would it be a good idea to raise the hbase.hregion.memstore.flush.size
> if you have really large regions?
> >>>
> >>> -chris
> >>>
> >>> On Feb 18, 2011, at 11:43 AM, Jean-Daniel Cryans wrote:
> >>>
> >>>> Less regions, but it's often a good thing if you have a lot of data :)
> >>>>
> >>>> It's probably a good thing to bump the HDFS block size to 128 or 256MB
> >>>> since you know you're going to have huge-ish files.
> >>>>
> >>>> But anyway regarding penalties, I can't think of one that clearly
> >>>> comes out (unless you use a very small heap). The IO usage patterns
> >>>> will change, but unless you flush very small files all the time and
> >>>> need to recompact them into much bigger ones, then it shouldn't really
> >>>> be an issue.
> >>>>
> >>>> J-D
> >>>>
> >>>> On Fri, Feb 18, 2011 at 11:36 AM, Jason Rutherglen
> >>>> <ja...@gmail.com> wrote:
> >>>>>>  We are also using a 5Gb region size to keep our region
> >>>>>> counts in the 100-200 range/node per Jonathan Grey's recommendation.
> >>>>>
> >>>>> So there isn't a penalty incurred from increasing the max region size
> >>>>> from 256MB to 5GB?
> >>>>>
> >>>
> >>>
> >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Cluster Size/Node Density

Posted by Jean-Daniel Cryans <jd...@apache.org>.

The bigger the heap the longer the GC pause of the world when
fragmentation requires it, 8GB is "safer".

In 0.90.1 you can try enabling the new memstore allocator that seems
to do a really good job, checkout the jira first:
https://issues.apache.org/jira/browse/HBASE-3455

J-D

On Fri, Feb 18, 2011 at 12:05 PM, Chris Tarnas <cf...@email.com> wrote:
> Thank you , ad that bring me to my next question...
>
> What is the current recommendation on the max heap size for Hbase if RAM on the server is not an issue? Right now I am at 8GB and have no issues, can I safely do 12GB? The servers have plenty of RAM (48GB) so that should not be an issue - I just want to minimize the risk that GC will cause problems.
>
> thanks again.
> -chris
>
> On Feb 18, 2011, at 11:59 AM, Jean-Daniel Cryans wrote:
>
>> That's what I usually recommend, the bigger the flushed files the
>> better. On the other hand, you only have so much memory to dedicate to
>> the MemStore...
>>
>> J-D
>>
>> On Fri, Feb 18, 2011 at 11:50 AM, Chris Tarnas <cf...@email.com> wrote:
>>> Would it be a good idea to raise the hbase.hregion.memstore.flush.size if you have really large regions?
>>>
>>> -chris
>>>
>>> On Feb 18, 2011, at 11:43 AM, Jean-Daniel Cryans wrote:
>>>
>>>> Less regions, but it's often a good thing if you have a lot of data :)
>>>>
>>>> It's probably a good thing to bump the HDFS block size to 128 or 256MB
>>>> since you know you're going to have huge-ish files.
>>>>
>>>> But anyway regarding penalties, I can't think of one that clearly
>>>> comes out (unless you use a very small heap). The IO usage patterns
>>>> will change, but unless you flush very small files all the time and
>>>> need to recompact them into much bigger ones, then it shouldn't really
>>>> be an issue.
>>>>
>>>> J-D
>>>>
>>>> On Fri, Feb 18, 2011 at 11:36 AM, Jason Rutherglen
>>>> <ja...@gmail.com> wrote:
>>>>>>  We are also using a 5Gb region size to keep our region
>>>>>> counts in the 100-200 range/node per Jonathan Grey's recommendation.
>>>>>
>>>>> So there isn't a penalty incurred from increasing the max region size
>>>>> from 256MB to 5GB?
>>>>>
>>>
>>>
>
>

Re: Cluster Size/Node Density

Posted by Ted Dunning <td...@maprtech.com>.

Actually, having a smaller heap will decrease the risk of a catastrophic GC.
 It probably wil also increase the likelihood of a full GC.

Having a larger heap will let you go long without a full GC, but with a very
large heap a full GC may take your region server off-line long enough to be
considered a failure.  Then you will have cascading badness.

On Fri, Feb 18, 2011 at 12:05 PM, Chris Tarnas <cf...@email.com> wrote:

> Thank you , ad that bring me to my next question...
>
> What is the current recommendation on the max heap size for Hbase if RAM on
> the server is not an issue? Right now I am at 8GB and have no issues, can I
> safely do 12GB? The servers have plenty of RAM (48GB) so that should not be
> an issue - I just want to minimize the risk that GC will cause problems.
>
> thanks again.
> -chris
>
> On Feb 18, 2011, at 11:59 AM, Jean-Daniel Cryans wrote:
>
> > That's what I usually recommend, the bigger the flushed files the
> > better. On the other hand, you only have so much memory to dedicate to
> > the MemStore...
> >
> > J-D
> >
> > On Fri, Feb 18, 2011 at 11:50 AM, Chris Tarnas <cf...@email.com> wrote:
> >> Would it be a good idea to raise the hbase.hregion.memstore.flush.size
> if you have really large regions?
> >>
> >> -chris
> >>
> >> On Feb 18, 2011, at 11:43 AM, Jean-Daniel Cryans wrote:
> >>
> >>> Less regions, but it's often a good thing if you have a lot of data :)
> >>>
> >>> It's probably a good thing to bump the HDFS block size to 128 or 256MB
> >>> since you know you're going to have huge-ish files.
> >>>
> >>> But anyway regarding penalties, I can't think of one that clearly
> >>> comes out (unless you use a very small heap). The IO usage patterns
> >>> will change, but unless you flush very small files all the time and
> >>> need to recompact them into much bigger ones, then it shouldn't really
> >>> be an issue.
> >>>
> >>> J-D
> >>>
> >>> On Fri, Feb 18, 2011 at 11:36 AM, Jason Rutherglen
> >>> <ja...@gmail.com> wrote:
> >>>>>  We are also using a 5Gb region size to keep our region
> >>>>> counts in the 100-200 range/node per Jonathan Grey's recommendation.
> >>>>
> >>>> So there isn't a penalty incurred from increasing the max region size
> >>>> from 256MB to 5GB?
> >>>>
> >>
> >>
>
>

Re: Cluster Size/Node Density

Posted by Chris Tarnas <cf...@email.com>.

Thank you , ad that bring me to my next question... 

What is the current recommendation on the max heap size for Hbase if RAM on the server is not an issue? Right now I am at 8GB and have no issues, can I safely do 12GB? The servers have plenty of RAM (48GB) so that should not be an issue - I just want to minimize the risk that GC will cause problems.

thanks again.
-chris

On Feb 18, 2011, at 11:59 AM, Jean-Daniel Cryans wrote:

> That's what I usually recommend, the bigger the flushed files the
> better. On the other hand, you only have so much memory to dedicate to
> the MemStore...
> 
> J-D
> 
> On Fri, Feb 18, 2011 at 11:50 AM, Chris Tarnas <cf...@email.com> wrote:
>> Would it be a good idea to raise the hbase.hregion.memstore.flush.size if you have really large regions?
>> 
>> -chris
>> 
>> On Feb 18, 2011, at 11:43 AM, Jean-Daniel Cryans wrote:
>> 
>>> Less regions, but it's often a good thing if you have a lot of data :)
>>> 
>>> It's probably a good thing to bump the HDFS block size to 128 or 256MB
>>> since you know you're going to have huge-ish files.
>>> 
>>> But anyway regarding penalties, I can't think of one that clearly
>>> comes out (unless you use a very small heap). The IO usage patterns
>>> will change, but unless you flush very small files all the time and
>>> need to recompact them into much bigger ones, then it shouldn't really
>>> be an issue.
>>> 
>>> J-D
>>> 
>>> On Fri, Feb 18, 2011 at 11:36 AM, Jason Rutherglen
>>> <ja...@gmail.com> wrote:
>>>>>  We are also using a 5Gb region size to keep our region
>>>>> counts in the 100-200 range/node per Jonathan Grey's recommendation.
>>>> 
>>>> So there isn't a penalty incurred from increasing the max region size
>>>> from 256MB to 5GB?
>>>> 
>> 
>>

Re: Cluster Size/Node Density

Posted by Jean-Daniel Cryans <jd...@apache.org>.

That's what I usually recommend, the bigger the flushed files the
better. On the other hand, you only have so much memory to dedicate to
the MemStore...

J-D

On Fri, Feb 18, 2011 at 11:50 AM, Chris Tarnas <cf...@email.com> wrote:
> Would it be a good idea to raise the hbase.hregion.memstore.flush.size if you have really large regions?
>
> -chris
>
> On Feb 18, 2011, at 11:43 AM, Jean-Daniel Cryans wrote:
>
>> Less regions, but it's often a good thing if you have a lot of data :)
>>
>> It's probably a good thing to bump the HDFS block size to 128 or 256MB
>> since you know you're going to have huge-ish files.
>>
>> But anyway regarding penalties, I can't think of one that clearly
>> comes out (unless you use a very small heap). The IO usage patterns
>> will change, but unless you flush very small files all the time and
>> need to recompact them into much bigger ones, then it shouldn't really
>> be an issue.
>>
>> J-D
>>
>> On Fri, Feb 18, 2011 at 11:36 AM, Jason Rutherglen
>> <ja...@gmail.com> wrote:
>>>>  We are also using a 5Gb region size to keep our region
>>>> counts in the 100-200 range/node per Jonathan Grey's recommendation.
>>>
>>> So there isn't a penalty incurred from increasing the max region size
>>> from 256MB to 5GB?
>>>
>
>

Re: Cluster Size/Node Density

Posted by Chris Tarnas <cf...@email.com>.

Would it be a good idea to raise the hbase.hregion.memstore.flush.size if you have really large regions?

-chris

On Feb 18, 2011, at 11:43 AM, Jean-Daniel Cryans wrote:

> Less regions, but it's often a good thing if you have a lot of data :)
> 
> It's probably a good thing to bump the HDFS block size to 128 or 256MB
> since you know you're going to have huge-ish files.
> 
> But anyway regarding penalties, I can't think of one that clearly
> comes out (unless you use a very small heap). The IO usage patterns
> will change, but unless you flush very small files all the time and
> need to recompact them into much bigger ones, then it shouldn't really
> be an issue.
> 
> J-D
> 
> On Fri, Feb 18, 2011 at 11:36 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>>>  We are also using a 5Gb region size to keep our region
>>> counts in the 100-200 range/node per Jonathan Grey's recommendation.
>> 
>> So there isn't a penalty incurred from increasing the max region size
>> from 256MB to 5GB?
>>

Re: Cluster Size/Node Density

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Less regions, but it's often a good thing if you have a lot of data :)

It's probably a good thing to bump the HDFS block size to 128 or 256MB
since you know you're going to have huge-ish files.

But anyway regarding penalties, I can't think of one that clearly
comes out (unless you use a very small heap). The IO usage patterns
will change, but unless you flush very small files all the time and
need to recompact them into much bigger ones, then it shouldn't really
be an issue.

J-D

On Fri, Feb 18, 2011 at 11:36 AM, Jason Rutherglen
<ja...@gmail.com> wrote:
>>  We are also using a 5Gb region size to keep our region
>> counts in the 100-200 range/node per Jonathan Grey's recommendation.
>
> So there isn't a penalty incurred from increasing the max region size
> from 256MB to 5GB?
>

Re: Cluster Size/Node Density

Posted by Jason Rutherglen <ja...@gmail.com>.

>  We are also using a 5Gb region size to keep our region
> counts in the 100-200 range/node per Jonathan Grey's recommendation.

So there isn't a penalty incurred from increasing the max region size
from 256MB to 5GB?

On Fri, Feb 18, 2011 at 10:12 AM, Wayne <wa...@gmail.com> wrote:
> We have managed to get a little more than 1k QPS to date with 10 nodes.
> Honestly we are not quite convinced that disk i/o seeks are our biggest
> bottleneck. Of course they should be...but waiting for RPC connections,
> network latency, thrift etc. all play into the time to get reads. The std
> dev. of read time is way too high for our comfort, but os cache and other
> things make it hard to benchmark this. Our average read time has jumped to
> ~65ms which is a lot more than we expected (we expected 30-40ms). Our reads
> are not as small as originally thought, but the 65ms still seems high. We
> would love to have added a scanOpenWithStop single rpc read call (open, get
> all rows, close scanner in one rpc call). We are moving to 10k disks (5 data
> disks per node) so once we get some of these nodes in place we can see how
> things compare. I suspect things won't change much...which will confirm disk
> i/o is not our only bottleneck.  I will be happy to be wrong...
>
> Based on the performance we have seen we expect to need 40 nodes. With some
> fine tuning the 40 nodes can deliver 5k QPS which is what we need to run our
> application processes. We have a substantial write load as well, so our
> planning and numbers allow spare cycles for dealing with significant writes
> at the same time. We have cache set way low to just be used by .META. and
> all our tables have cache turned off. Memcached will be sitting on top to
> help client access. We are also using a 5Gb region size to keep our region
> counts in the 100-200 range/node per Jonathan Grey's recommendation.
>
> We have thought about going to infiniband or 10g, but with our smaller node
> sizes we don't think it will make much difference. The cost of infiniband
> for all 40 nodes buys us another 6 nodes...money better spent on scaling
> out.
>
>
> On Fri, Feb 18, 2011 at 2:15 AM, M. C. Srivas <mc...@gmail.com> wrote:
>
>> I was reading this thread with interest. Here's my $.02
>>
>> On Fri, Dec 17, 2010 at 12:29 PM, Wayne <wa...@gmail.com> wrote:
>>
>>> Sorry, I am sure my questions were far too broad to answer.
>>>
>>> Let me *try* to ask more specific questions. Assuming all data requests
>>> are
>>> cold (random reading pattern) and everything comes from the disks (no
>>> block
>>> cache), what level of concurrency can HDFS handle?
>>
>>
>> Cold cache, random reads ==> totally governed by seeks, so governed by # of
>> spindles per box.
>>
>> A SATA drive can do about 100 random seeks per sec, ie, 100 reads/second
>>
>>
>>
>>> Almost all of the load is
>>> controlled data processing, but we have to do a lot of work at night
>>> during
>>> the batch window so something in the 15-20,000 QPS range would meet
>>> current
>>> worse case requirements. How many nodes would be required to effectively
>>> return data against a 50TB aggregate data store?
>>
>>
>> Assuming 12 drives per node, and a cache hit rate of 0% (since its most
>> cold cache), you will see about 12 * 100  = 1200 reads per second per node.
>> If your cache hit rate goes up to 25%, then, your read rate is 1600
>> reads/sec/node
>> Thus, 10 machines can serve about 12k-16k reads/sec, cold cache.
>>
>> 50TB of data on 10 machine => 5 TB/node. That might be bit too much for
>> each region server (a RS can do about 700 regions comfortably, each of 1G).
>> If you push, you might get 2TB/regionserver, or, 25 machines for total. If
>> the data compresses 50%, then 12-13 nodes.
>>
>> So, for your scenario, its about 12-13 RS's, with 12 drives each, and you
>> will comfortably do 24k QPS cold cache.
>>
>> Does that help?
>>
>>
>>
>>> Disk I/O assumedly starts
>>> to break down at a certain point in terms of concurrent readers/node/disk.
>>> We have in our control how many total concurrent readers there are, so if
>>> we
>>> can get 10ms response time with 100 readers that might be better than
>>> 100ms
>>> responses from 1000 concurrent readers.
>>>
>>> Thanks.
>>>
>>>
>>> On Fri, Dec 17, 2010 at 2:46 PM, Jean-Daniel Cryans <jdcryans@apache.org
>>> >wrote:
>>>
>>> > Hi Wayne,
>>> >
>>> > This question has such a large scope but is applicable to such a tiny
>>> > subset of workloads (eg yours) that fielding all the questions in
>>> > details would probably end up just wasting everyone's cycles. So first
>>> > I'd like to clear up some confusion.
>>> >
>>> > > We would like some help with cluster sizing estimates. We have 15TB of
>>> > > currently relational data we want to store in hbase.
>>> >
>>> > There's the 3x replication factor, but also you have to account that
>>> > each value is stored with it's row key, family name, qualifier and
>>> > timestamp. That could be a lot more data to store, but at the same
>>> > time you can use LZO compression to bring that down ~4x.
>>> >
>>> > > How many nodes, regions, etc. are we going to need?
>>> >
>>> > You don't really have the control over regions, they are created for
>>> > you as your data grows.
>>> >
>>> > > What will our read latency be for 30 vs. 100? Sure we can pack 20
>>> nodes
>>> > with 3TB
>>> > > of data each but will it take 1+s for every get?
>>> >
>>> > I'm not sure what kind of back-of-the-envelope calculations took you
>>> > to 1sec, but latency will be strictly determined by concurrency and
>>> > actual machine load. Even if you were able to pack 20TB in one onde
>>> > but using a tiny portion of it, you would still get sub 100ms
>>> > latencies. Or if you have only 10GB on that node but it's getting
>>> > hammered by 10000 clients, then you should expect much higher
>>> > latencies.
>>> >
>>> > > Will compaction run for 3 days?
>>> >
>>> > Which compactions? Major ones? If you don't insert new data in a
>>> > region, it won't be major compacted. Also if you have that much data,
>>> > I would set the time between major compactions to be bigger than 1
>>> > day. Heck, since you are doing time series, this means you'll never
>>> > delete anything right? So you might as well disable them.
>>> >
>>> > And now for the meaty part...
>>> >
>>> > The size of your dataset is only one part of the equation, the other
>>> > being traffic you would be pushing to the cluster which I think wasn't
>>> > covered at all in your email. Like I said previously, you can pack a
>>> > lot of data in a single node and can retrieve it really fast as long
>>> > as concurrency is low. Another thing is how random your reading
>>> > pattern is... can you even leverage the block cache at all? If yes,
>>> > then you can accept more concurrency, if not then hitting HDFS is a
>>> > lot slower (and it's still not very good at handling many clients).
>>> >
>>> > Unfortunately, even if you gave us exactly how many QPS you want to do
>>> > per second, we'd have a hard time recommending any number of nodes.
>>> >
>>> > What I would recommend then is to benchmark it. Try to grab 5-6
>>> > machines, load a subset of the data, generate traffic, see how it
>>> > behaves.
>>> >
>>> > Hope that helps,
>>> >
>>> > J-D
>>> >
>>> > On Fri, Dec 17, 2010 at 9:09 AM, Wayne <wa...@gmail.com> wrote:
>>> > > We would like some help with cluster sizing estimates. We have 15TB of
>>> > > currently relational data we want to store in hbase. Once that is
>>> > replicated
>>> > > to a factor of 3 and stored with secondary indexes etc. we assume will
>>> > have
>>> > > 50TB+ of data. The data is basically data warehouse style time series
>>> > data
>>> > > where much of it is cold, however want good read latency to get access
>>> to
>>> > > all of it. Not memory based latency but < 25ms latency for a small
>>> chunks
>>> > of
>>> > > data.
>>> > >
>>> > > How many nodes, regions, etc. are we going to need? Assuming a typical
>>> 6
>>> > > disk, 24GB ram, 16 core data node, how many of these do we need to
>>> > > sufficiently manage this volume of data? Obviously there are a million
>>> > "it
>>> > > depends", but the bigger drivers are how much data can a node handle?
>>> How
>>> > > long will compaction take? How many regions can a node handle and how
>>> big
>>> > > can those regions get? Can we really have 1.5TB of data on a single
>>> node
>>> > in
>>> > > 6,000 regions? What are the true drivers between more nodes vs. bigger
>>> > > nodes? Do we need 30 nodes to handle our 50GB of data or 100 nodes?
>>> What
>>> > > will our read latency be for 30 vs. 100? Sure we can pack 20 nodes
>>> with
>>> > 3TB
>>> > > of data each but will it take 1+s for every get? Will compaction run
>>> for
>>> > 3
>>> > > days? How much data is really "too much" on an hbase data node?
>>> > >
>>> > > Any help or advice would be greatly appreciated.
>>> > >
>>> > > Thanks
>>> > >
>>> > > Wayne
>>> > >
>>> >
>>>
>>
>>
>

Re: Cluster Size/Node Density

Posted by Wayne <wa...@gmail.com>.

We have managed to get a little more than 1k QPS to date with 10 nodes.
Honestly we are not quite convinced that disk i/o seeks are our biggest
bottleneck. Of course they should be...but waiting for RPC connections,
network latency, thrift etc. all play into the time to get reads. The std
dev. of read time is way too high for our comfort, but os cache and other
things make it hard to benchmark this. Our average read time has jumped to
~65ms which is a lot more than we expected (we expected 30-40ms). Our reads
are not as small as originally thought, but the 65ms still seems high. We
would love to have added a scanOpenWithStop single rpc read call (open, get
all rows, close scanner in one rpc call). We are moving to 10k disks (5 data
disks per node) so once we get some of these nodes in place we can see how
things compare. I suspect things won't change much...which will confirm disk
i/o is not our only bottleneck.  I will be happy to be wrong...

Based on the performance we have seen we expect to need 40 nodes. With some
fine tuning the 40 nodes can deliver 5k QPS which is what we need to run our
application processes. We have a substantial write load as well, so our
planning and numbers allow spare cycles for dealing with significant writes
at the same time. We have cache set way low to just be used by .META. and
all our tables have cache turned off. Memcached will be sitting on top to
help client access. We are also using a 5Gb region size to keep our region
counts in the 100-200 range/node per Jonathan Grey's recommendation.

We have thought about going to infiniband or 10g, but with our smaller node
sizes we don't think it will make much difference. The cost of infiniband
for all 40 nodes buys us another 6 nodes...money better spent on scaling
out.


On Fri, Feb 18, 2011 at 2:15 AM, M. C. Srivas <mc...@gmail.com> wrote:

> I was reading this thread with interest. Here's my $.02
>
> On Fri, Dec 17, 2010 at 12:29 PM, Wayne <wa...@gmail.com> wrote:
>
>> Sorry, I am sure my questions were far too broad to answer.
>>
>> Let me *try* to ask more specific questions. Assuming all data requests
>> are
>> cold (random reading pattern) and everything comes from the disks (no
>> block
>> cache), what level of concurrency can HDFS handle?
>
>
> Cold cache, random reads ==> totally governed by seeks, so governed by # of
> spindles per box.
>
> A SATA drive can do about 100 random seeks per sec, ie, 100 reads/second
>
>
>
>> Almost all of the load is
>> controlled data processing, but we have to do a lot of work at night
>> during
>> the batch window so something in the 15-20,000 QPS range would meet
>> current
>> worse case requirements. How many nodes would be required to effectively
>> return data against a 50TB aggregate data store?
>
>
> Assuming 12 drives per node, and a cache hit rate of 0% (since its most
> cold cache), you will see about 12 * 100  = 1200 reads per second per node.
> If your cache hit rate goes up to 25%, then, your read rate is 1600
> reads/sec/node
> Thus, 10 machines can serve about 12k-16k reads/sec, cold cache.
>
> 50TB of data on 10 machine => 5 TB/node. That might be bit too much for
> each region server (a RS can do about 700 regions comfortably, each of 1G).
> If you push, you might get 2TB/regionserver, or, 25 machines for total. If
> the data compresses 50%, then 12-13 nodes.
>
> So, for your scenario, its about 12-13 RS's, with 12 drives each, and you
> will comfortably do 24k QPS cold cache.
>
> Does that help?
>
>
>
>> Disk I/O assumedly starts
>> to break down at a certain point in terms of concurrent readers/node/disk.
>> We have in our control how many total concurrent readers there are, so if
>> we
>> can get 10ms response time with 100 readers that might be better than
>> 100ms
>> responses from 1000 concurrent readers.
>>
>> Thanks.
>>
>>
>> On Fri, Dec 17, 2010 at 2:46 PM, Jean-Daniel Cryans <jdcryans@apache.org
>> >wrote:
>>
>> > Hi Wayne,
>> >
>> > This question has such a large scope but is applicable to such a tiny
>> > subset of workloads (eg yours) that fielding all the questions in
>> > details would probably end up just wasting everyone's cycles. So first
>> > I'd like to clear up some confusion.
>> >
>> > > We would like some help with cluster sizing estimates. We have 15TB of
>> > > currently relational data we want to store in hbase.
>> >
>> > There's the 3x replication factor, but also you have to account that
>> > each value is stored with it's row key, family name, qualifier and
>> > timestamp. That could be a lot more data to store, but at the same
>> > time you can use LZO compression to bring that down ~4x.
>> >
>> > > How many nodes, regions, etc. are we going to need?
>> >
>> > You don't really have the control over regions, they are created for
>> > you as your data grows.
>> >
>> > > What will our read latency be for 30 vs. 100? Sure we can pack 20
>> nodes
>> > with 3TB
>> > > of data each but will it take 1+s for every get?
>> >
>> > I'm not sure what kind of back-of-the-envelope calculations took you
>> > to 1sec, but latency will be strictly determined by concurrency and
>> > actual machine load. Even if you were able to pack 20TB in one onde
>> > but using a tiny portion of it, you would still get sub 100ms
>> > latencies. Or if you have only 10GB on that node but it's getting
>> > hammered by 10000 clients, then you should expect much higher
>> > latencies.
>> >
>> > > Will compaction run for 3 days?
>> >
>> > Which compactions? Major ones? If you don't insert new data in a
>> > region, it won't be major compacted. Also if you have that much data,
>> > I would set the time between major compactions to be bigger than 1
>> > day. Heck, since you are doing time series, this means you'll never
>> > delete anything right? So you might as well disable them.
>> >
>> > And now for the meaty part...
>> >
>> > The size of your dataset is only one part of the equation, the other
>> > being traffic you would be pushing to the cluster which I think wasn't
>> > covered at all in your email. Like I said previously, you can pack a
>> > lot of data in a single node and can retrieve it really fast as long
>> > as concurrency is low. Another thing is how random your reading
>> > pattern is... can you even leverage the block cache at all? If yes,
>> > then you can accept more concurrency, if not then hitting HDFS is a
>> > lot slower (and it's still not very good at handling many clients).
>> >
>> > Unfortunately, even if you gave us exactly how many QPS you want to do
>> > per second, we'd have a hard time recommending any number of nodes.
>> >
>> > What I would recommend then is to benchmark it. Try to grab 5-6
>> > machines, load a subset of the data, generate traffic, see how it
>> > behaves.
>> >
>> > Hope that helps,
>> >
>> > J-D
>> >
>> > On Fri, Dec 17, 2010 at 9:09 AM, Wayne <wa...@gmail.com> wrote:
>> > > We would like some help with cluster sizing estimates. We have 15TB of
>> > > currently relational data we want to store in hbase. Once that is
>> > replicated
>> > > to a factor of 3 and stored with secondary indexes etc. we assume will
>> > have
>> > > 50TB+ of data. The data is basically data warehouse style time series
>> > data
>> > > where much of it is cold, however want good read latency to get access
>> to
>> > > all of it. Not memory based latency but < 25ms latency for a small
>> chunks
>> > of
>> > > data.
>> > >
>> > > How many nodes, regions, etc. are we going to need? Assuming a typical
>> 6
>> > > disk, 24GB ram, 16 core data node, how many of these do we need to
>> > > sufficiently manage this volume of data? Obviously there are a million
>> > "it
>> > > depends", but the bigger drivers are how much data can a node handle?
>> How
>> > > long will compaction take? How many regions can a node handle and how
>> big
>> > > can those regions get? Can we really have 1.5TB of data on a single
>> node
>> > in
>> > > 6,000 regions? What are the true drivers between more nodes vs. bigger
>> > > nodes? Do we need 30 nodes to handle our 50GB of data or 100 nodes?
>> What
>> > > will our read latency be for 30 vs. 100? Sure we can pack 20 nodes
>> with
>> > 3TB
>> > > of data each but will it take 1+s for every get? Will compaction run
>> for
>> > 3
>> > > days? How much data is really "too much" on an hbase data node?
>> > >
>> > > Any help or advice would be greatly appreciated.
>> > >
>> > > Thanks
>> > >
>> > > Wayne
>> > >
>> >
>>
>
>

Re: Cluster Size/Node Density

Posted by Ryan Rawson <ry...@gmail.com>.

And dont forget that reading that data from the RS does not use
compression, so you are limited to about 120 MB/sec of read bandwidth
per node, minus bandwidth used for HDFS replication and other
incidentals.

gige is just too damn slow.

I look forward to 10g, perhaps we'll start seeing DC buildouts next year of 10g.

-ryan

On Thu, Feb 17, 2011 at 11:15 PM, M. C. Srivas <mc...@gmail.com> wrote:
> I was reading this thread with interest. Here's my $.02
>
> On Fri, Dec 17, 2010 at 12:29 PM, Wayne <wa...@gmail.com> wrote:
>
>> Sorry, I am sure my questions were far too broad to answer.
>>
>> Let me *try* to ask more specific questions. Assuming all data requests are
>> cold (random reading pattern) and everything comes from the disks (no block
>> cache), what level of concurrency can HDFS handle?
>
>
> Cold cache, random reads ==> totally governed by seeks, so governed by # of
> spindles per box.
>
> A SATA drive can do about 100 random seeks per sec, ie, 100 reads/second
>
>
>
>> Almost all of the load is
>> controlled data processing, but we have to do a lot of work at night during
>> the batch window so something in the 15-20,000 QPS range would meet current
>> worse case requirements. How many nodes would be required to effectively
>> return data against a 50TB aggregate data store?
>
>
> Assuming 12 drives per node, and a cache hit rate of 0% (since its most cold
> cache), you will see about 12 * 100  = 1200 reads per second per node.
> If your cache hit rate goes up to 25%, then, your read rate is 1600
> reads/sec/node
> Thus, 10 machines can serve about 12k-16k reads/sec, cold cache.
>
> 50TB of data on 10 machine => 5 TB/node. That might be bit too much for each
> region server (a RS can do about 700 regions comfortably, each of 1G). If
> you push, you might get 2TB/regionserver, or, 25 machines for total. If the
> data compresses 50%, then 12-13 nodes.
>
> So, for your scenario, its about 12-13 RS's, with 12 drives each, and you
> will comfortably do 24k QPS cold cache.
>
> Does that help?
>
>
>
>> Disk I/O assumedly starts
>> to break down at a certain point in terms of concurrent readers/node/disk.
>> We have in our control how many total concurrent readers there are, so if
>> we
>> can get 10ms response time with 100 readers that might be better than 100ms
>> responses from 1000 concurrent readers.
>>
>> Thanks.
>>
>>
>> On Fri, Dec 17, 2010 at 2:46 PM, Jean-Daniel Cryans <jdcryans@apache.org
>> >wrote:
>>
>> > Hi Wayne,
>> >
>> > This question has such a large scope but is applicable to such a tiny
>> > subset of workloads (eg yours) that fielding all the questions in
>> > details would probably end up just wasting everyone's cycles. So first
>> > I'd like to clear up some confusion.
>> >
>> > > We would like some help with cluster sizing estimates. We have 15TB of
>> > > currently relational data we want to store in hbase.
>> >
>> > There's the 3x replication factor, but also you have to account that
>> > each value is stored with it's row key, family name, qualifier and
>> > timestamp. That could be a lot more data to store, but at the same
>> > time you can use LZO compression to bring that down ~4x.
>> >
>> > > How many nodes, regions, etc. are we going to need?
>> >
>> > You don't really have the control over regions, they are created for
>> > you as your data grows.
>> >
>> > > What will our read latency be for 30 vs. 100? Sure we can pack 20 nodes
>> > with 3TB
>> > > of data each but will it take 1+s for every get?
>> >
>> > I'm not sure what kind of back-of-the-envelope calculations took you
>> > to 1sec, but latency will be strictly determined by concurrency and
>> > actual machine load. Even if you were able to pack 20TB in one onde
>> > but using a tiny portion of it, you would still get sub 100ms
>> > latencies. Or if you have only 10GB on that node but it's getting
>> > hammered by 10000 clients, then you should expect much higher
>> > latencies.
>> >
>> > > Will compaction run for 3 days?
>> >
>> > Which compactions? Major ones? If you don't insert new data in a
>> > region, it won't be major compacted. Also if you have that much data,
>> > I would set the time between major compactions to be bigger than 1
>> > day. Heck, since you are doing time series, this means you'll never
>> > delete anything right? So you might as well disable them.
>> >
>> > And now for the meaty part...
>> >
>> > The size of your dataset is only one part of the equation, the other
>> > being traffic you would be pushing to the cluster which I think wasn't
>> > covered at all in your email. Like I said previously, you can pack a
>> > lot of data in a single node and can retrieve it really fast as long
>> > as concurrency is low. Another thing is how random your reading
>> > pattern is... can you even leverage the block cache at all? If yes,
>> > then you can accept more concurrency, if not then hitting HDFS is a
>> > lot slower (and it's still not very good at handling many clients).
>> >
>> > Unfortunately, even if you gave us exactly how many QPS you want to do
>> > per second, we'd have a hard time recommending any number of nodes.
>> >
>> > What I would recommend then is to benchmark it. Try to grab 5-6
>> > machines, load a subset of the data, generate traffic, see how it
>> > behaves.
>> >
>> > Hope that helps,
>> >
>> > J-D
>> >
>> > On Fri, Dec 17, 2010 at 9:09 AM, Wayne <wa...@gmail.com> wrote:
>> > > We would like some help with cluster sizing estimates. We have 15TB of
>> > > currently relational data we want to store in hbase. Once that is
>> > replicated
>> > > to a factor of 3 and stored with secondary indexes etc. we assume will
>> > have
>> > > 50TB+ of data. The data is basically data warehouse style time series
>> > data
>> > > where much of it is cold, however want good read latency to get access
>> to
>> > > all of it. Not memory based latency but < 25ms latency for a small
>> chunks
>> > of
>> > > data.
>> > >
>> > > How many nodes, regions, etc. are we going to need? Assuming a typical
>> 6
>> > > disk, 24GB ram, 16 core data node, how many of these do we need to
>> > > sufficiently manage this volume of data? Obviously there are a million
>> > "it
>> > > depends", but the bigger drivers are how much data can a node handle?
>> How
>> > > long will compaction take? How many regions can a node handle and how
>> big
>> > > can those regions get? Can we really have 1.5TB of data on a single
>> node
>> > in
>> > > 6,000 regions? What are the true drivers between more nodes vs. bigger
>> > > nodes? Do we need 30 nodes to handle our 50GB of data or 100 nodes?
>> What
>> > > will our read latency be for 30 vs. 100? Sure we can pack 20 nodes with
>> > 3TB
>> > > of data each but will it take 1+s for every get? Will compaction run
>> for
>> > 3
>> > > days? How much data is really "too much" on an hbase data node?
>> > >
>> > > Any help or advice would be greatly appreciated.
>> > >
>> > > Thanks
>> > >
>> > > Wayne
>> > >
>> >
>>
>