You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Anthony Urso <an...@cs.ucla.edu> on 2011/10/07 21:43:06 UTC

High throughput input, low latency output?

We have a use case that will require a ten to twenty EC2 node HBase
cluster to take several hundred million rows of input from a larger
number of EMR instances in daily bursts, and then serve those rows via
low latency random reads, say on the order of 300 or so rows per
second. Before we start coding, I thought it best to ask the experts
for their advice.

1) Is this something that HBase will be able to handle gracefully?
2) Does anyone have any pointers on how to tune HBase for performance
and stability under this load?
3) Would HBase perform better under this sort of load on twelve large
EC2 instances, six xlarge or three xxlarge?

Thanks,
Anthony

Re: High throughput input, low latency output?

Posted by Stack <st...@duboce.net>.
On Sat, Oct 8, 2011 at 12:18 PM, Anthony Urso <an...@cs.ucla.edu> wrote:
> Is that because of the slow disk I/O?
>

If you are sharing the box, your cotenant could be trashing the i/o on you.

You for sure are sharing a network -- as best as I understand AWS --
and this can be oversubscribed from time to time (look back on this
list for others input on hbase on ec2 for gist of what you are up for
running on ec2).


>> Any chance of caching working?  Are the reads totally random or will
>> there be 'hot' areas?  If so, you might have some hope.
>>
>
> Hopefully.  Do you mean external caching like memcache or OS-level disk caching?
>

I was more talking about hbase block cache; if you were reading same
values over and over then this will have an effect; reading from cache
you will get low latency reads.

St.Ack

Re: High throughput input, low latency output?

Posted by Anthony Urso <an...@cs.ucla.edu>.
On Fri, Oct 7, 2011 at 8:58 PM, Stack <st...@duboce.net> wrote:
> On Fri, Oct 7, 2011 at 12:43 PM, Anthony Urso <an...@cs.ucla.edu> wrote:
>> We have a use case that will require a ten to twenty EC2 node HBase
>> cluster to take several hundred million rows of input from a larger
>> number of EMR instances in daily bursts, and then serve those rows via
>> low latency random reads, say on the order of 300 or so rows per
>> second. Before we start coding, I thought it best to ask the experts
>> for their advice.
>>
>> 1) Is this something that HBase will be able to handle gracefully?
>
> You might have some chance if you were not on EC2.
>

Is that because of the slow disk I/O?

> Any chance of caching working?  Are the reads totally random or will
> there be 'hot' areas?  If so, you might have some hope.
>

Hopefully.  Do you mean external caching like memcache or OS-level disk caching?

>
>> 2) Does anyone have any pointers on how to tune HBase for performance
>> and stability under this load?
>
> See performance section on book up on hbase.org (though there should
> probably be EC2 caveats...)

TY.

>
>> 3) Would HBase perform better under this sort of load on twelve large
>> EC2 instances, six xlarge or three xxlarge?
>>
>
> The more nodes the better.  And if those nodes are not virtualized,
> better still.  But then there is the network and if its saturated....
>
>
> Can you run some tests before you start coding?

Good idea.

> St.Ack
>

Re: High throughput input, low latency output?

Posted by Stack <st...@duboce.net>.
On Fri, Oct 7, 2011 at 12:43 PM, Anthony Urso <an...@cs.ucla.edu> wrote:
> We have a use case that will require a ten to twenty EC2 node HBase
> cluster to take several hundred million rows of input from a larger
> number of EMR instances in daily bursts, and then serve those rows via
> low latency random reads, say on the order of 300 or so rows per
> second. Before we start coding, I thought it best to ask the experts
> for their advice.
>
> 1) Is this something that HBase will be able to handle gracefully?

You might have some chance if you were not on EC2.

Any chance of caching working?  Are the reads totally random or will
there be 'hot' areas?  If so, you might have some hope.


> 2) Does anyone have any pointers on how to tune HBase for performance
> and stability under this load?

See performance section on book up on hbase.org (though there should
probably be EC2 caveats...)

> 3) Would HBase perform better under this sort of load on twelve large
> EC2 instances, six xlarge or three xxlarge?
>

The more nodes the better.  And if those nodes are not virtualized,
better still.  But then there is the network and if its saturated....


Can you run some tests before you start coding?
St.Ack

Re: High throughput input, low latency output?

Posted by Matt Corgan <mc...@hotpads.com>.
We found that 2 cores is not enough to run hbase.  1 core can easily get
tied up with a compaction while the other is doing garbage collection.  That
doesn't leave any headroom for gets/scans, especially on compressed data
and/or when multiple are happening at the same time.  Try to do all of that
at the same time and some of the other background tasks start choking, like
memstore flushes.

We run the c1.xlarge instances (8 cores, 8gb mem) and everything works well,
though not much room for block cache.

Matt

On Fri, Oct 7, 2011 at 12:43 PM, Anthony Urso <an...@cs.ucla.edu> wrote:

> We have a use case that will require a ten to twenty EC2 node HBase
> cluster to take several hundred million rows of input from a larger
> number of EMR instances in daily bursts, and then serve those rows via
> low latency random reads, say on the order of 300 or so rows per
> second. Before we start coding, I thought it best to ask the experts
> for their advice.
>
> 1) Is this something that HBase will be able to handle gracefully?
> 2) Does anyone have any pointers on how to tune HBase for performance
> and stability under this load?
> 3) Would HBase perform better under this sort of load on twelve large
> EC2 instances, six xlarge or three xxlarge?
>
> Thanks,
> Anthony
>