You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Lin Ma <li...@gmail.com> on 2012/08/23 08:30:50 UTC

client cache for all region server information?

Hello HBase masters,

I am wondering whether in current implementation, each client of HBase
cache all information of region server, for example, where is region server
(physical hosting machine of region server), and also cache row-key range
managed by the region server. If so, two more questions,

- will there be too much overhead (e.g. memory footprint) of each client?
- when such information is downloaded and cached at client side, and when
the information is refreshed (does it only triggered by region server
change and failure to fetch such information from client -- e.g. when
client use cache to access machine A for region B, but find nothing, so the
client needs to refresh the information in cache to see which machine owns
region B)?

regards,
Lin

Re: client cache for all region server information?

Posted by Lin Ma <li...@gmail.com>.

Thanks for the detailed reply, Harsh.

Some further comments / thoughts,

1. For Scan function used in mapper/reducer, supposing we are using 500
size configuration, I am not sure whether the returned 500 items in one
batch call must from one region server? Or it could from multiple region
servers -- if so, is the underlying implementation smart enough to make one
RPC call one region server?
2. For batch-call API you referred below, I am not sure whether it may
cross multiple region server?
3. Supposing we call batch-call API, what are the performance benefits (I
am not sure whether underlying implementation is simple -- just wrapping
individual calls one by one, or more fancy implementations -- wrap one RPC
call when accessing one region server)?
4. One minor comments about wording about this twiki =>
http://hbase.apache.org/book.html#perf.hbase.client.caching, I think Scan
function means batch retrieve items, it does not mean caching (using the
word caching is confusing, since caching makes me think about reuse
something next time without doing again, but actually when next time we
Scan, region server still need to fetch 500 items again). Please feel free
to correct me if I am wrong.

regards,
Lin

On Tue, Aug 28, 2012 at 2:41 PM, Harsh J <ha...@cloudera.com> wrote:

> Lin,
>
> On Tue, Aug 28, 2012 at 9:09 AM, Lin Ma <li...@gmail.com> wrote:
> > Thanks Harsh,
> >
> > A two more comments / thoughts,
> >
> > 1. For mapper: mapper normally runs on the same regional server which
> owns
> > the row-key range for the mapper input because of locality reasons (I am
> not
> > 100% confident whether it is always true mapper always runs on the same
> > region server, please feel free to correct me if I am wrong) -- so it is
> > already local I/O, is there big benefit to return 500 at one time? Could
> you
> > show me an example when there is big benefit?
>
> The data locality of MR tasks when speaking with HBase services is
> slightly different than if it were speaking with HDFS services.
>
> We do try to get MR to schedule RS-local maps to avoid extra network
> transfer of rows, but the TableInputFormat still reads from the
> RegionServer and not the HDFS underneath it (i.e. the RegionServer is
> the one that reads HDFS blocks, not the client task). So a 500-caching
> per RPC next() call still makes sense, since the client isn't exactly
> reading the data directly from the local files or the memstore, and
> its rather requesting the RS to read it on its behalf and requests
> data row-by-row from it or in batches.
>
> > 2. For reducer: we could also use Scan object, and it works in the same
> way
> > of Mapper? I have this confusion since normally reducer writes to HBase,
> > could you show me an example when we need to read HBase in Reducer by
> using
> > Scan?
>
> I think (1) should explain this?
>
> > 3. What means RS in your reply?
>
> Sorry, by RS I meant the RegionServer.
>
> > 4. For non map-reduce job (e.g. when using HBase GET API directly), any
> > kinds of similar batch function which HBase provides or 3rd party
> provides?
>
> The batching is for the scanner function of HBase's client APIs. For
> Gets or Multi-Gets, since the results may span several different
> regions (And hence different region servers), caching in this manner
> isn't exactly possible, hence there is just a batch-call APIs for a
> list of requests, as seen at
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
> .
>
> > On Mon, Aug 27, 2012 at 11:55 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Not necessarily consecutive, unless the request itself is so. It only
> >> returns 500 rows that match the user's request.
> >>
> >> User's request of a specific row-range and filters are usually
> >> embedded into the Scan object, sent to the RS. Whatever is accumulated
> >> as the result of the Scan operation (server-side) is accumulated in
> >> sizes of 500 rows and returned in one Scanner.next() call from the
> >> client.
> >>
> >> Does this clear it up Lin?
> >>
> >> On Mon, Aug 27, 2012 at 8:40 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Hi Harsh,
> >> >
> >> > I read through the document you referred, for the below comment, I am
> >> > confused. Major confusion is, does it mean HBase will transfer
> >> > consecutive
> >> > 500 rows to client (supposing client mapper want row with row-key 100,
> >> > Hbase
> >> > will return row-key from 100 to 600 at one time to client, similar to
> >> > batch
> >> > read?), how to ensure such 500 rows are all desired input for client
> >> > mapper
> >> > job (e.g. how do HBase know client mapper job wants row-key from 101
> to
> >> > 600)?
> >> >
> >> > "Using the default value means that the map-task will make call back
> to
> >> > the
> >> > region-server for every record processed. Setting this value to 500,
> for
> >> > example, will transfer 500 rows at a time to the client to be
> >> > processed."
> >> >
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Thu, Aug 23, 2012 at 11:37 PM, Harsh J <ha...@cloudera.com> wrote:
> >> >>
> >> >> Hi Lin,
> >> >>
> >> >> On Thu, Aug 23, 2012 at 7:56 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> > Harsh, thanks for the detailed information.
> >> >> >
> >> >> > Two more comments,
> >> >> >
> >> >> > 1. I want to confirm my understanding is correct. At the beginning
> >> >> > client
> >> >> > cache has nothing, when it issue request for a table, if the region
> >> >> > server
> >> >> > location is not known, it will request from root META region to get
> >> >> > region
> >> >> > server information step by step, then cache the region server
> >> >> > information.
> >> >> > If cache already contain the requested region information, it will
> >> >> > use
> >> >> > directly from cache. In this way, cache grows when cache miss for
> >> >> > requested
> >> >> > region information;
> >> >>
> >> >> You have it correct now. Region locations are cached only if they are
> >> >> not available. And they are cached on need-basis, not all at once.
> >> >>
> >> >> > 2. "far outweighs the other items it caches (scan results, etc.)",
> >> >> > you
> >> >> > mean
> >> >> > GET API of HBase cache results? Sorry I am not aware of this
> feature
> >> >> > before.
> >> >> > How the results are cached, and whether we can control it
> (supposing
> >> >> > a
> >> >> > client is doing random read pattern, we do not want to cache
> >> >> > information
> >> >> > since each read may be unique row-key access)? Appreciate if you
> >> >> > could
> >> >> > point
> >> >> > me to some more detailed information.
> >> >>
> >> >> Am speaking of Scanner value caching, not Gets exactly. See more
> about
> >> >> Scanner (client) caching at
> >> >> http://hbase.apache.org/book.html#perf.hbase.client.caching
> >> >>
> >> >> > regards,
> >> >> > Lin
> >> >> >
> >> >> >
> >> >> > On Thu, Aug 23, 2012 at 9:35 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >> >> >>
> >> >> >> Hi Lin,
> >> >> >>
> >> >> >> On Thu, Aug 23, 2012 at 4:31 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> >> > Thank you Abhishek,
> >> >> >> >
> >> >> >> > Two more comments,
> >> >> >> >
> >> >> >> > -- "Client only caches information as needed for its queries and
> >> >> >> > not
> >> >> >> > necessarily for 'all' region servers." -- how did client know
> >> >> >> > which
> >> >> >> > region
> >> >> >> > server information is necessary to be cached in current HBase
> >> >> >> > implementation?
> >> >> >>
> >> >> >> What Abhishek meant here is that it caches only the needed table's
> >> >> >> rows from META. It also only caches the specific region required
> for
> >> >> >> the row you're looking up/operating on, AFAICT.
> >> >> >>
> >> >> >> > -- When the client loads region server information for the first
> >> >> >> > time?
> >> >> >> > Did
> >> >> >> > client persistent cache information at client side about region
> >> >> >> > server
> >> >> >> > information?
> >> >> >>
> >> >> >> The client loads up regionserver information for a table, when it
> is
> >> >> >> requested to perform an operation on that table (on a specific row
> >> >> >> or
> >> >> >> the whole). It does not immediately, upon initialization, cache
> the
> >> >> >> whole of META's contents.
> >> >> >>
> >> >> >> Your question makes sense though, that it does seem to be such
> that
> >> >> >> a
> >> >> >> client *may* use quite a bit of memory space in trying to cache
> the
> >> >> >> META entries locally, but practically we've not had this cause
> >> >> >> issues
> >> >> >> for users yet. The amount of memory cached for META far outweighs
> >> >> >> the
> >> >> >> other items it caches (scan results, etc.). At least I have not
> seen
> >> >> >> any reports of excessive client memory usage just due to region
> >> >> >> locations of tables being cached.
> >> >> >>
> >> >> >> I think there's more benefits storing/caching it than not doing
> so,
> >> >> >> and so far we've not needed the extra complexity of persisting the
> >> >> >> cache to a local or non-RAM storage than keeping it in memory.
> >> >> >>
> >> >> >> --
> >> >> >> Harsh J
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: client cache for all region server information?

Posted by Lin Ma <li...@gmail.com>.

Thanks Harsh,

A two more comments / thoughts,

1. For mapper: mapper normally runs on the same regional server which owns
the row-key range for the mapper input because of locality reasons (I am
not 100% confident whether it is always true mapper always runs on the same
region server, please feel free to correct me if I am wrong) -- so it is
already local I/O, is there big benefit to return 500 at one time? Could
you show me an example when there is big benefit?
2. For reducer: we could also use Scan object, and it works in the same way
of Mapper? I have this confusion since normally reducer writes to HBase,
could you show me an example when we need to read HBase in Reducer by using
Scan?
3. What means RS in your reply?
4. For non map-reduce job (e.g. when using HBase GET API directly), any
kinds of similar batch function which HBase provides or 3rd party provides?

regards,
Lin

On Mon, Aug 27, 2012 at 11:55 PM, Harsh J <ha...@cloudera.com> wrote:

> Not necessarily consecutive, unless the request itself is so. It only
> returns 500 rows that match the user's request.
>
> User's request of a specific row-range and filters are usually
> embedded into the Scan object, sent to the RS. Whatever is accumulated
> as the result of the Scan operation (server-side) is accumulated in
> sizes of 500 rows and returned in one Scanner.next() call from the
> client.
>
> Does this clear it up Lin?
>
> On Mon, Aug 27, 2012 at 8:40 PM, Lin Ma <li...@gmail.com> wrote:
> > Hi Harsh,
> >
> > I read through the document you referred, for the below comment, I am
> > confused. Major confusion is, does it mean HBase will transfer
> consecutive
> > 500 rows to client (supposing client mapper want row with row-key 100,
> Hbase
> > will return row-key from 100 to 600 at one time to client, similar to
> batch
> > read?), how to ensure such 500 rows are all desired input for client
> mapper
> > job (e.g. how do HBase know client mapper job wants row-key from 101 to
> > 600)?
> >
> > "Using the default value means that the map-task will make call back to
> the
> > region-server for every record processed. Setting this value to 500, for
> > example, will transfer 500 rows at a time to the client to be processed."
> >
> > regards,
> > Lin
> >
> >
> > On Thu, Aug 23, 2012 at 11:37 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Lin,
> >>
> >> On Thu, Aug 23, 2012 at 7:56 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Harsh, thanks for the detailed information.
> >> >
> >> > Two more comments,
> >> >
> >> > 1. I want to confirm my understanding is correct. At the beginning
> >> > client
> >> > cache has nothing, when it issue request for a table, if the region
> >> > server
> >> > location is not known, it will request from root META region to get
> >> > region
> >> > server information step by step, then cache the region server
> >> > information.
> >> > If cache already contain the requested region information, it will use
> >> > directly from cache. In this way, cache grows when cache miss for
> >> > requested
> >> > region information;
> >>
> >> You have it correct now. Region locations are cached only if they are
> >> not available. And they are cached on need-basis, not all at once.
> >>
> >> > 2. "far outweighs the other items it caches (scan results, etc.)", you
> >> > mean
> >> > GET API of HBase cache results? Sorry I am not aware of this feature
> >> > before.
> >> > How the results are cached, and whether we can control it (supposing a
> >> > client is doing random read pattern, we do not want to cache
> information
> >> > since each read may be unique row-key access)? Appreciate if you could
> >> > point
> >> > me to some more detailed information.
> >>
> >> Am speaking of Scanner value caching, not Gets exactly. See more about
> >> Scanner (client) caching at
> >> http://hbase.apache.org/book.html#perf.hbase.client.caching
> >>
> >> > regards,
> >> > Lin
> >> >
> >> >
> >> > On Thu, Aug 23, 2012 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote:
> >> >>
> >> >> Hi Lin,
> >> >>
> >> >> On Thu, Aug 23, 2012 at 4:31 PM, Lin Ma <li...@gmail.com> wrote:
> >> >> > Thank you Abhishek,
> >> >> >
> >> >> > Two more comments,
> >> >> >
> >> >> > -- "Client only caches information as needed for its queries and
> not
> >> >> > necessarily for 'all' region servers." -- how did client know which
> >> >> > region
> >> >> > server information is necessary to be cached in current HBase
> >> >> > implementation?
> >> >>
> >> >> What Abhishek meant here is that it caches only the needed table's
> >> >> rows from META. It also only caches the specific region required for
> >> >> the row you're looking up/operating on, AFAICT.
> >> >>
> >> >> > -- When the client loads region server information for the first
> >> >> > time?
> >> >> > Did
> >> >> > client persistent cache information at client side about region
> >> >> > server
> >> >> > information?
> >> >>
> >> >> The client loads up regionserver information for a table, when it is
> >> >> requested to perform an operation on that table (on a specific row or
> >> >> the whole). It does not immediately, upon initialization, cache the
> >> >> whole of META's contents.
> >> >>
> >> >> Your question makes sense though, that it does seem to be such that a
> >> >> client *may* use quite a bit of memory space in trying to cache the
> >> >> META entries locally, but practically we've not had this cause issues
> >> >> for users yet. The amount of memory cached for META far outweighs the
> >> >> other items it caches (scan results, etc.). At least I have not seen
> >> >> any reports of excessive client memory usage just due to region
> >> >> locations of tables being cached.
> >> >>
> >> >> I think there's more benefits storing/caching it than not doing so,
> >> >> and so far we've not needed the extra complexity of persisting the
> >> >> cache to a local or non-RAM storage than keeping it in memory.
> >> >>
> >> >> --
> >> >> Harsh J
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: client cache for all region server information?

Posted by Harsh J <ha...@cloudera.com>.

Not necessarily consecutive, unless the request itself is so. It only
returns 500 rows that match the user's request.

User's request of a specific row-range and filters are usually
embedded into the Scan object, sent to the RS. Whatever is accumulated
as the result of the Scan operation (server-side) is accumulated in
sizes of 500 rows and returned in one Scanner.next() call from the
client.

Does this clear it up Lin?

On Mon, Aug 27, 2012 at 8:40 PM, Lin Ma <li...@gmail.com> wrote:
> Hi Harsh,
>
> I read through the document you referred, for the below comment, I am
> confused. Major confusion is, does it mean HBase will transfer consecutive
> 500 rows to client (supposing client mapper want row with row-key 100, Hbase
> will return row-key from 100 to 600 at one time to client, similar to batch
> read?), how to ensure such 500 rows are all desired input for client mapper
> job (e.g. how do HBase know client mapper job wants row-key from 101 to
> 600)?
>
> "Using the default value means that the map-task will make call back to the
> region-server for every record processed. Setting this value to 500, for
> example, will transfer 500 rows at a time to the client to be processed."
>
> regards,
> Lin
>
>
> On Thu, Aug 23, 2012 at 11:37 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Lin,
>>
>> On Thu, Aug 23, 2012 at 7:56 PM, Lin Ma <li...@gmail.com> wrote:
>> > Harsh, thanks for the detailed information.
>> >
>> > Two more comments,
>> >
>> > 1. I want to confirm my understanding is correct. At the beginning
>> > client
>> > cache has nothing, when it issue request for a table, if the region
>> > server
>> > location is not known, it will request from root META region to get
>> > region
>> > server information step by step, then cache the region server
>> > information.
>> > If cache already contain the requested region information, it will use
>> > directly from cache. In this way, cache grows when cache miss for
>> > requested
>> > region information;
>>
>> You have it correct now. Region locations are cached only if they are
>> not available. And they are cached on need-basis, not all at once.
>>
>> > 2. "far outweighs the other items it caches (scan results, etc.)", you
>> > mean
>> > GET API of HBase cache results? Sorry I am not aware of this feature
>> > before.
>> > How the results are cached, and whether we can control it (supposing a
>> > client is doing random read pattern, we do not want to cache information
>> > since each read may be unique row-key access)? Appreciate if you could
>> > point
>> > me to some more detailed information.
>>
>> Am speaking of Scanner value caching, not Gets exactly. See more about
>> Scanner (client) caching at
>> http://hbase.apache.org/book.html#perf.hbase.client.caching
>>
>> > regards,
>> > Lin
>> >
>> >
>> > On Thu, Aug 23, 2012 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Hi Lin,
>> >>
>> >> On Thu, Aug 23, 2012 at 4:31 PM, Lin Ma <li...@gmail.com> wrote:
>> >> > Thank you Abhishek,
>> >> >
>> >> > Two more comments,
>> >> >
>> >> > -- "Client only caches information as needed for its queries and not
>> >> > necessarily for 'all' region servers." -- how did client know which
>> >> > region
>> >> > server information is necessary to be cached in current HBase
>> >> > implementation?
>> >>
>> >> What Abhishek meant here is that it caches only the needed table's
>> >> rows from META. It also only caches the specific region required for
>> >> the row you're looking up/operating on, AFAICT.
>> >>
>> >> > -- When the client loads region server information for the first
>> >> > time?
>> >> > Did
>> >> > client persistent cache information at client side about region
>> >> > server
>> >> > information?
>> >>
>> >> The client loads up regionserver information for a table, when it is
>> >> requested to perform an operation on that table (on a specific row or
>> >> the whole). It does not immediately, upon initialization, cache the
>> >> whole of META's contents.
>> >>
>> >> Your question makes sense though, that it does seem to be such that a
>> >> client *may* use quite a bit of memory space in trying to cache the
>> >> META entries locally, but practically we've not had this cause issues
>> >> for users yet. The amount of memory cached for META far outweighs the
>> >> other items it caches (scan results, etc.). At least I have not seen
>> >> any reports of excessive client memory usage just due to region
>> >> locations of tables being cached.
>> >>
>> >> I think there's more benefits storing/caching it than not doing so,
>> >> and so far we've not needed the extra complexity of persisting the
>> >> cache to a local or non-RAM storage than keeping it in memory.
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: client cache for all region server information?

Posted by Lin Ma <li...@gmail.com>.

Hi Harsh,

I read through the document you referred, for the below comment, I am
confused. Major confusion is, does it mean HBase will transfer consecutive
500 rows to client (supposing client mapper want row with row-key 100,
Hbase will return row-key from 100 to 600 at one time to client, similar to
batch read?), how to ensure such 500 rows are all desired input for client
mapper job (e.g. how do HBase know client mapper job wants row-key from 101
to 600)?

*"Using the default value means that the map-task will make call back to
the region-server for every record processed. Setting this value to 500,
for example, will transfer 500 rows at a time to the client to be
processed."*

regards,
Lin

On Thu, Aug 23, 2012 at 11:37 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Lin,
>
> On Thu, Aug 23, 2012 at 7:56 PM, Lin Ma <li...@gmail.com> wrote:
> > Harsh, thanks for the detailed information.
> >
> > Two more comments,
> >
> > 1. I want to confirm my understanding is correct. At the beginning client
> > cache has nothing, when it issue request for a table, if the region
> server
> > location is not known, it will request from root META region to get
> region
> > server information step by step, then cache the region server
> information.
> > If cache already contain the requested region information, it will use
> > directly from cache. In this way, cache grows when cache miss for
> requested
> > region information;
>
> You have it correct now. Region locations are cached only if they are
> not available. And they are cached on need-basis, not all at once.
>
> > 2. "far outweighs the other items it caches (scan results, etc.)", you
> mean
> > GET API of HBase cache results? Sorry I am not aware of this feature
> before.
> > How the results are cached, and whether we can control it (supposing a
> > client is doing random read pattern, we do not want to cache information
> > since each read may be unique row-key access)? Appreciate if you could
> point
> > me to some more detailed information.
>
> Am speaking of Scanner value caching, not Gets exactly. See more about
> Scanner (client) caching at
> http://hbase.apache.org/book.html#perf.hbase.client.caching
>
> > regards,
> > Lin
> >
> >
> > On Thu, Aug 23, 2012 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Lin,
> >>
> >> On Thu, Aug 23, 2012 at 4:31 PM, Lin Ma <li...@gmail.com> wrote:
> >> > Thank you Abhishek,
> >> >
> >> > Two more comments,
> >> >
> >> > -- "Client only caches information as needed for its queries and not
> >> > necessarily for 'all' region servers." -- how did client know which
> >> > region
> >> > server information is necessary to be cached in current HBase
> >> > implementation?
> >>
> >> What Abhishek meant here is that it caches only the needed table's
> >> rows from META. It also only caches the specific region required for
> >> the row you're looking up/operating on, AFAICT.
> >>
> >> > -- When the client loads region server information for the first time?
> >> > Did
> >> > client persistent cache information at client side about region server
> >> > information?
> >>
> >> The client loads up regionserver information for a table, when it is
> >> requested to perform an operation on that table (on a specific row or
> >> the whole). It does not immediately, upon initialization, cache the
> >> whole of META's contents.
> >>
> >> Your question makes sense though, that it does seem to be such that a
> >> client *may* use quite a bit of memory space in trying to cache the
> >> META entries locally, but practically we've not had this cause issues
> >> for users yet. The amount of memory cached for META far outweighs the
> >> other items it caches (scan results, etc.). At least I have not seen
> >> any reports of excessive client memory usage just due to region
> >> locations of tables being cached.
> >>
> >> I think there's more benefits storing/caching it than not doing so,
> >> and so far we've not needed the extra complexity of persisting the
> >> cache to a local or non-RAM storage than keeping it in memory.
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: client cache for all region server information?

Posted by Harsh J <ha...@cloudera.com>.

Hi Lin,

On Thu, Aug 23, 2012 at 7:56 PM, Lin Ma <li...@gmail.com> wrote:
> Harsh, thanks for the detailed information.
>
> Two more comments,
>
> 1. I want to confirm my understanding is correct. At the beginning client
> cache has nothing, when it issue request for a table, if the region server
> location is not known, it will request from root META region to get region
> server information step by step, then cache the region server information.
> If cache already contain the requested region information, it will use
> directly from cache. In this way, cache grows when cache miss for requested
> region information;

You have it correct now. Region locations are cached only if they are
not available. And they are cached on need-basis, not all at once.

> 2. "far outweighs the other items it caches (scan results, etc.)", you mean
> GET API of HBase cache results? Sorry I am not aware of this feature before.
> How the results are cached, and whether we can control it (supposing a
> client is doing random read pattern, we do not want to cache information
> since each read may be unique row-key access)? Appreciate if you could point
> me to some more detailed information.

Am speaking of Scanner value caching, not Gets exactly. See more about
Scanner (client) caching at
http://hbase.apache.org/book.html#perf.hbase.client.caching

> regards,
> Lin
>
>
> On Thu, Aug 23, 2012 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Lin,
>>
>> On Thu, Aug 23, 2012 at 4:31 PM, Lin Ma <li...@gmail.com> wrote:
>> > Thank you Abhishek,
>> >
>> > Two more comments,
>> >
>> > -- "Client only caches information as needed for its queries and not
>> > necessarily for 'all' region servers." -- how did client know which
>> > region
>> > server information is necessary to be cached in current HBase
>> > implementation?
>>
>> What Abhishek meant here is that it caches only the needed table's
>> rows from META. It also only caches the specific region required for
>> the row you're looking up/operating on, AFAICT.
>>
>> > -- When the client loads region server information for the first time?
>> > Did
>> > client persistent cache information at client side about region server
>> > information?
>>
>> The client loads up regionserver information for a table, when it is
>> requested to perform an operation on that table (on a specific row or
>> the whole). It does not immediately, upon initialization, cache the
>> whole of META's contents.
>>
>> Your question makes sense though, that it does seem to be such that a
>> client *may* use quite a bit of memory space in trying to cache the
>> META entries locally, but practically we've not had this cause issues
>> for users yet. The amount of memory cached for META far outweighs the
>> other items it caches (scan results, etc.). At least I have not seen
>> any reports of excessive client memory usage just due to region
>> locations of tables being cached.
>>
>> I think there's more benefits storing/caching it than not doing so,
>> and so far we've not needed the extra complexity of persisting the
>> cache to a local or non-RAM storage than keeping it in memory.
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: client cache for all region server information?

Posted by Lin Ma <li...@gmail.com>.

Harsh, thanks for the detailed information.

Two more comments,

1. I want to confirm my understanding is correct. At the beginning client
cache has nothing, when it issue request for a table, if the region server
location is not known, it will request from root META region to get region
server information step by step, then cache the region server information.
If cache already contain the requested region information, it will use
directly from cache. In this way, cache grows when cache miss for requested
region information;
2. "far outweighs the other items it caches (scan results, etc.)", you mean
GET API of HBase cache results? Sorry I am not aware of this feature
before. How the results are cached, and whether we can control it
(supposing a client is doing random read pattern, we do not want to cache
information since each read may be unique row-key access)? Appreciate if
you could point me to some more detailed information.

regards,
Lin

On Thu, Aug 23, 2012 at 9:35 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Lin,
>
> On Thu, Aug 23, 2012 at 4:31 PM, Lin Ma <li...@gmail.com> wrote:
> > Thank you Abhishek,
> >
> > Two more comments,
> >
> > -- "Client only caches information as needed for its queries and not
> > necessarily for 'all' region servers." -- how did client know which
> region
> > server information is necessary to be cached in current HBase
> > implementation?
>
> What Abhishek meant here is that it caches only the needed table's
> rows from META. It also only caches the specific region required for
> the row you're looking up/operating on, AFAICT.
>
> > -- When the client loads region server information for the first time?
> Did
> > client persistent cache information at client side about region server
> > information?
>
> The client loads up regionserver information for a table, when it is
> requested to perform an operation on that table (on a specific row or
> the whole). It does not immediately, upon initialization, cache the
> whole of META's contents.
>
> Your question makes sense though, that it does seem to be such that a
> client *may* use quite a bit of memory space in trying to cache the
> META entries locally, but practically we've not had this cause issues
> for users yet. The amount of memory cached for META far outweighs the
> other items it caches (scan results, etc.). At least I have not seen
> any reports of excessive client memory usage just due to region
> locations of tables being cached.
>
> I think there's more benefits storing/caching it than not doing so,
> and so far we've not needed the extra complexity of persisting the
> cache to a local or non-RAM storage than keeping it in memory.
>
> --
> Harsh J
>

Re: client cache for all region server information?

Posted by Harsh J <ha...@cloudera.com>.

Hi Lin,

On Thu, Aug 23, 2012 at 4:31 PM, Lin Ma <li...@gmail.com> wrote:
> Thank you Abhishek,
>
> Two more comments,
>
> -- "Client only caches information as needed for its queries and not
> necessarily for 'all' region servers." -- how did client know which region
> server information is necessary to be cached in current HBase
> implementation?

What Abhishek meant here is that it caches only the needed table's
rows from META. It also only caches the specific region required for
the row you're looking up/operating on, AFAICT.

> -- When the client loads region server information for the first time? Did
> client persistent cache information at client side about region server
> information?

The client loads up regionserver information for a table, when it is
requested to perform an operation on that table (on a specific row or
the whole). It does not immediately, upon initialization, cache the
whole of META's contents.

Your question makes sense though, that it does seem to be such that a
client *may* use quite a bit of memory space in trying to cache the
META entries locally, but practically we've not had this cause issues
for users yet. The amount of memory cached for META far outweighs the
other items it caches (scan results, etc.). At least I have not seen
any reports of excessive client memory usage just due to region
locations of tables being cached.

I think there's more benefits storing/caching it than not doing so,
and so far we've not needed the extra complexity of persisting the
cache to a local or non-RAM storage than keeping it in memory.

-- 
Harsh J

Re: client cache for all region server information?

Posted by Lin Ma <li...@gmail.com>.

Thank you Abhishek,

Two more comments,

-- "Client only caches information as needed for its queries and not
necessarily for 'all' region servers." -- how did client know which region
server information is necessary to be cached in current HBase
implementation?

-- When the client loads region server information for the first time? Did
client persistent cache information at client side about region server
information?

regards,
Lin

On Thu, Aug 23, 2012 at 2:47 PM, Pamecha, Abhishek <ap...@x.com> wrote:

> I think for the refresh case, client first uses the older region server
> derived from its cache  it then connects to that older  region server which
> responds with a failure code.  and then client talks to the zookeeper and
> then the meta node server to find the new region server for that key. The
> client then reissues the original request to the new region server.
>
> Btw,Client only caches information as needed for its queries and not
> necessarily for 'all' region servers.
>
> Abhishek
>
>
> i Sent from my iPad with iMstakes
>
> On Aug 22, 2012, at 23:31, "Lin Ma" <li...@gmail.com> wrote:
>
> > Hello HBase masters,
> >
> > I am wondering whether in current implementation, each client of HBase
> > cache all information of region server, for example, where is region
> server
> > (physical hosting machine of region server), and also cache row-key range
> > managed by the region server. If so, two more questions,
> >
> > - will there be too much overhead (e.g. memory footprint) of each client?
> > - when such information is downloaded and cached at client side, and when
> > the information is refreshed (does it only triggered by region server
> > change and failure to fetch such information from client -- e.g. when
> > client use cache to access machine A for region B, but find nothing, so
> the
> > client needs to refresh the information in cache to see which machine
> owns
> > region B)?
> >
> > regards,
> > Lin
>

Re: client cache for all region server information?

Posted by "Pamecha, Abhishek" <ap...@x.com>.

I think for the refresh case, client first uses the older region server derived from its cache  it then connects to that older  region server which responds with a failure code.  and then client talks to the zookeeper and then the meta node server to find the new region server for that key. The client then reissues the original request to the new region server. 

Btw,Client only caches information as needed for its queries and not necessarily for 'all' region servers. 

Abhishek

i Sent from my iPad with iMstakes 

On Aug 22, 2012, at 23:31, "Lin Ma" <li...@gmail.com> wrote:

> Hello HBase masters,
> 
> I am wondering whether in current implementation, each client of HBase
> cache all information of region server, for example, where is region server
> (physical hosting machine of region server), and also cache row-key range
> managed by the region server. If so, two more questions,
> 
> - will there be too much overhead (e.g. memory footprint) of each client?
> - when such information is downloaded and cached at client side, and when
> the information is refreshed (does it only triggered by region server
> change and failure to fetch such information from client -- e.g. when
> client use cache to access machine A for region B, but find nothing, so the
> client needs to refresh the information in cache to see which machine owns
> region B)?
> 
> regards,
> Lin