You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Pradeep Gollakota <pr...@gmail.com> on 2013/12/11 02:19:42 UTC

Client API best practices for my use case

Hi All,

I'm trying to understand how different configuration will affect
performance for my use cases. My table has the following the following
schema. I'm storing event logs in a single column family. The row key is in
the format [company][timestamp][uuid].

My access pattern is fairly simple. Every X retrieve the last X worth of
events. The X is typically small... e.g. Every min give me the last min of
events or every hour give me the last hour of events. Occasionally, I might
request historical data, e.g. Give me all events from August 2012. I need
the queries requesting the most recent data to be really fast and am ok
with the historical queries being slow.

The configuration options I'm interested in are: scanner-caching and
block-cache usage. I noticed in the Java api to create column families that
there is an option to "setCacheDataOnWrite". What does this do exactly?
It's also recommended that for sequential queries, the blockCache on scan
be disabled. How does scanner caching work? Is this per Scan or is it a
shared cache? Does scanner caching use the same cache as the block cache?
If I have multiple Scan's with caching enabled AND it's a shared cache how
does eviction work? Ideally I always want the most recently written data to
be in the cache with as few cache evictions as possible.

For my use case, if I want the best performance to be on the most recent
events, what configuration of block cache and scanner caching should I use?

Thanks in advance.
- Pradeep

RE: Client API best practices for my use case

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

1. Scanner caching has nothing to do with caching. It sets the number of rows per one RPC call which server transfers to client.
   Usually, it helps to improve scanner performance. It used to be 1, by default, therefore one must set it to something large than 1
to achieve good scanner performance. This setting is per Scan.

2. "setCacheDataOnWrite" is per table:column family. It tells HBase to cache data on write (surprisingly :)) but keep in mind, that HBase block cache
has its own caching/eviction optimization and won't allow you to fill the cache up with data which is being written - the priority is always to cache data
which is being read. Therefore to guarantee, that you cached everything you need you should periodically read data (unless your data is less than 25% of cache
space).

3. Recommendation to disable block cache for large scan operations is to avoid cache trashing - that is normal cache optimization.

4. Eviction is LRU (least recently used). There are 3 buckets in the block cache: 25% - young gen, 50% - tenured gen, 25% - permanent gen

When block is cached first time it goes to young gen. It got promoted to tenured gen on READ from young gen. Permanent gen is used to cache
CFs which has IN_MEMORY = true.

Eviction is independent in all 3 buckets, when cache is full, eviction starts from yound gen, then continues in tenured gen and if still not enough free space -
in permanent gen space.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Pradeep Gollakota [pradeepg26@gmail.com]
Sent: Tuesday, December 10, 2013 5:19 PM
To: user@hbase.apache.org
Subject: Client API best practices for my use case

Hi All,

I'm trying to understand how different configuration will affect
performance for my use cases. My table has the following the following
schema. I'm storing event logs in a single column family. The row key is in
the format [company][timestamp][uuid].

My access pattern is fairly simple. Every X retrieve the last X worth of
events. The X is typically small... e.g. Every min give me the last min of
events or every hour give me the last hour of events. Occasionally, I might
request historical data, e.g. Give me all events from August 2012. I need
the queries requesting the most recent data to be really fast and am ok
with the historical queries being slow.

The configuration options I'm interested in are: scanner-caching and
block-cache usage. I noticed in the Java api to create column families that
there is an option to "setCacheDataOnWrite". What does this do exactly?
It's also recommended that for sequential queries, the blockCache on scan
be disabled. How does scanner caching work? Is this per Scan or is it a
shared cache? Does scanner caching use the same cache as the block cache?
If I have multiple Scan's with caching enabled AND it's a shared cache how
does eviction work? Ideally I always want the most recently written data to
be in the cache with as few cache evictions as possible.

For my use case, if I want the best performance to be on the most recent
events, what configuration of block cache and scanner caching should I use?

Thanks in advance.
- Pradeep

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.