You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Akshat Mahajan <am...@brightedge.com> on 2017/02/03 20:27:17 UTC

Hbase Architecture Questions

Hello, all,

We would like advice on our current Hbase configuration and setup.

It differs from the standard advice because we believe our use case is sufficiently unique to justify it, and our attempts to optimise its performance have not had much success. We want to 1) check if our understanding is correct, and 2) receive feedback on how to improve our read/write performance.

Our use-case is as follows:

a) We use Hbase 1.0.0. to store high-volume data (approximately 0.5 to 1 TB) on which we perform lone Hadoop (CDH 5.5) mapping jobs (with no reduce component) that does scanning reads. This batch collection and processing runs weekly over a period of two to three days with no pauses. We have two clusters, each set up as 1 master with 3 regionserver nodes, that are independent of each other. They are fairly high-end machines in terms of disk, memory and processors.

b) Every table in our Hbase is associated with a unique collection of items, and all tables exhibit the same column families (2 column families). After Hadoop runs, we no longer require the tables, so we delete them. Individual rows are never removed; instead, entire tables are removed at a time.

c) Typically, these tables are written to very quickly (anywhere between 100 to 200 requests per second on each regionserver is normal). They are also deleted very frequently. Reads and writes happen concurrently - it is not unusual for simultaneous high read counts and high write counts to occur. Their frequencies are about 20 tables are created, populated and deleted within the space of an hour. An individual table may be about 1 to 10 GBs in size.

d) Finally, all reads are carried out by multiple Hadoop map jobs through the native Hbase interface. All writes, however, are carried out through Hbase REST by Python scripts which collect our data for us. A read for an individual table never coincides with a write to the same table - reads and writes can both happen, but never on the same table at the same time.

Our current design decisions are as follows:

a) _We have turned major compaction off_.

We are aware this is against recommended advice. Our reasoning for this is that

1) periods of major compaction degrade both our read and write performance heavily (to the point our schedule is delayed beyond tolerance), and
2) all our tables are temporary - we do not intend to keep them around, and disabling/deleting old tables closes entire regions altogether and should have the same effect as major compaction processing tombstone markers on rows. Read performance should then theoretically not be impacted - we expect that the RegionServer will never even consult that region in doing reads, so storefile buildup overall should not be an issue.

This last point is based on prior understanding with Cassandra - we have not been able to find an adequate source for this in Hbase, so we may be incorrect. Currently, we are proceeding with this understanding.

b) _We have made additional efforts to turn off minor compactions as much as possible_.

In particular, our hbase.hstore.compaction.max.size is set to 500 MB, our hbase.hstore.compactionThreshold is set to 1000 bytes. We do this in order to prevent a minor compaction from becoming a major compaction - since we cannot prevent that, we were forced to try and prevent minor compactions running at all.

c)  We have tried to make REST more performant by improving the number of REST threads to about 9000.

This figure is derived from counting the number of connections on REST during periods of high write load.

d) We have turned on bloom filters, use an LRUBlockCache which caches data only on reads, and have set tcpnodelay to true. These were in place before we turned major compaction off.

Our observations with these settings in performance:

a) We are seeing an impact on both read/write performance correlated strongly with store file buildup. Our store files number between 500 to 1500 on each RS - the total size on each RegionServer are on the order 100 to 200 GBs at worst.
b) As number of connections on Hbase REST rises, write performance is impacted. We originally believed this was due to high frequency of memstore flushes - but increasing the memstore buffer sizes has had no discernible impact on read/write. Currently, our callqueue.handler.size is set to 100 - since we experience over 100 requests/second on each RS, we are considering increasing this to about 300 so we can handle more requests concurrently. Is this a good change, or are other changes needed as well?

Unfortunately, we cannot provide raw metrics on the magnitude of read/write performance degradation as we do not have sufficient tracking for them. A rough proxy - we do know our clusters are capable of processing 200 jobs in an hour. This now goes down to as low as 30-50 jobs per hour with minimal changes to the jobs themselves. We wish to be able to get back to our original performance.

For now, in periods of high stress (large jobs or quick reads/writes), we are manually clearing out the hbase folder in HDFS (including store files, WALs, oldWALs and archived files), and resetting our clusters to an empty state. We are aware this is not ideal, and are looking for ways to not have to do this. Our understanding of how Hbase works is probably imperfect, and we would appreciate any advice or feedback in this regard.

Yours,
Akshat Mahajan

Re: Hbase Architecture Questions

Posted by Ted Yu <yu...@gmail.com>.
bq. mapping job that does scanning reads

I assume proper time range (possibly with Filter's) is specified on the
Scan object so that only data necessary for the job is fetched.

Constantly creating / disabling / dropping tables is not amenable to the
cluster.

See if you can reuse table whose data has been scanned. To drop the data
before next round of writes come in, you can set TTL for the table properly.

Cheers

On Fri, Feb 3, 2017 at 1:46 PM, Ted Yu <yu...@gmail.com> wrote:

> bq. We use Hbase 1.0.0
>
> 1.0.0 was quite old.
>
> Can you try more recent releases such as 1.3.0 (the hbase-thrift module
> should be more robust) ?
>
> If your nodes have enough memory, have you thought of using bucket cache
> to improve read performance ?
>
> Cheers
>
> On Fri, Feb 3, 2017 at 1:34 PM, Akshat Mahajan <am...@brightedge.com>
> wrote:
>
>> > By "After Hadoop runs", you mean a batch collection/processing job?
>> MapReduce? Hadoop is a collection of distributed processing tools:
>> Filesystem (HDFS) and distributed execution (YARN)...
>>
>> Yes, we mean a batch collection/processing job. We use Yarn and HDFS, and
>> we employ the Hadoop API to run only mappers (we do not need to perform
>> reductions on our data) in Java. But the collection happens through Python,
>> necessitating the use of Thrift, and the actual processing happens through
>> Yarn.
>>
>> > If you are deleting the table in a short period of time, then, yes,
>> disabling major compactions is probably "OK". However, not running major
>> compactions will have read performance implications. While there are
>> tombstones, these represent extra work that your regionservers must perform.
>>
>> Can you please clarify? Our understanding is that a table deletion will
>> close the entire region corresponding to that table across all
>> RegionServers. If that is correct, why should there be a read performance
>> issue? (I'm assuming that closed regions are never accessed again by the
>> regionservers - am I correct?).
>>
>> > It sounds like you could just adopt a bulk-loading approach instead of
>> doing live writes into HBase..
>>
>> This is certainly a possibility, but would require a fairly sizable
>> rewrite for our application.
>>
>> >  I doubt REST is ever going to be a "well-performing" tool. You're
>> likely chasing your tail to try to optimize this. Your time would be better
>> spent using the HBase Java API directly.
>>
>> We are constrained by having to use Python, so we can't use the native
>> API. We switched from Thrift to REST when we found Thrift kept dying under
>> the load we put it under.
>>
>> > Are you taxing your machines' physical resources? What does I/O, CPU
>> and memory utilization look like?
>>
>> Let me get back to you on this front more fully in a followup.
>>
>> Currently providing estimates for our bulkier and more problematic
>> cluster:
>>
>> We are not constrained by memory - our free memory utilisation on all the
>> regionservers is close to 99%, but about 40% of that in each RS is used in
>> caches that will be readily given to any programs that require it by the
>> kernel (as assessed by `free`). On the master node, it is closer to 60%
>> memory utilisation.
>>
>> Our CPU utilisation varies, but under regular operation, user time is at
>> 60 - 40 percent on all regionservers. CPU idle time is very high, usually,
>> about 90%, and CPU system time is about 5%.
>>
>> I/O wait time is very, very low - about 0.23% on average.
>>
>> > Yeah I'm not sure why you are doing this. There's no reason I can think
>> of as to why you would need to do this...
>>
>> Believe it or not, it is the only thing we have found that helps us
>> restore performance temporarily.
>>
>> It's not ideal, and we don't want to keep doing it, though.
>>
>> Akshat
>>
>>
>> -----Original Message-----
>> From: Josh Elser [mailto:elserj@apache.org]
>> Sent: Friday, February 03, 2017 1:05 PM
>> To: user@hbase.apache.org
>> Subject: Re: Hbase Architecture Questions
>>
>>
>>
>> Akshat Mahajan wrote:
>>
>> > b) Every table in our Hbase is associated with a unique collection of
>> items, and all tables exhibit the same column families (2 column families).
>> After Hadoop runs, we no longer require the tables, so we delete them.
>> Individual rows are never removed; instead, entire tables are removed at a
>> time.
>>
>> By "After Hadoop runs", you mean a batch collection/processing job?
>> MapReduce? Hadoop is a collection of distributed processing tools:
>> Filesystem (HDFS) and distributed execution (YARN)...
>>
>> > a) _We have turned major compaction off_.
>> >
>> > We are aware this is against recommended advice. Our reasoning for
>> > this is that
>> >
>> > 1) periods of major compaction degrade both our read and write
>> > performance heavily (to the point our schedule is delayed beyond
>> > tolerance), and
>> > 2) all our tables are temporary - we do not intend to keep them around,
>> and disabling/deleting old tables closes entire regions altogether and
>> should have the same effect as major compaction processing tombstone
>> markers on rows. Read performance should then theoretically not be impacted
>> - we expect that the RegionServer will never even consult that region in
>> doing reads, so storefile buildup overall should not be an issue.
>>
>> If you are deleting the table in a short period of time, then, yes,
>> disabling major compactions is probably "OK".
>>
>> However, not running major compactions will have read performance
>> implications. While there are tombstones, these represent extra work that
>> your regionservers must perform.
>>
>> >
>> > b) _We have made additional efforts to turn off minor compactions as
>> much as possible_.
>> >
>> > In particular, our hbase.hstore.compaction.max.size is set to 500 MB,
>> our hbase.hstore.compactionThreshold is set to 1000 bytes. We do this in
>> order to prevent a minor compaction from becoming a major compaction -
>> since we cannot prevent that, we were forced to try and prevent minor
>> compactions running at all.
>>
>> It sounds like you could just adopt a bulk-loading approach instead of
>> doing live writes into HBase..
>>
>> > c)  We have tried to make REST more performant by improving the number
>> of REST threads to about 9000.
>> >
>> > This figure is derived from counting the number of connections on REST
>> during periods of high write load.
>>
>> I doubt REST is ever going to be a "well-performing" tool. You're likely
>> chasing your tail to try to optimize this. Your time would be better
>> spent using the HBase Java API directly.
>>
>> > d) We have turned on bloom filters, use an LRUBlockCache which caches
>> data only on reads, and have set tcpnodelay to true. These were in place
>> before we turned major compaction off.
>> >
>> > Our observations with these settings in performance:
>> >
>> > a) We are seeing an impact on both read/write performance correlated
>> strongly with store file buildup. Our store files number between 500 to
>> 1500 on each RS - the total size on each RegionServer are on the order 100
>> to 200 GBs at worst.
>> > b) As number of connections on Hbase REST rises, write performance is
>> impacted. We originally believed this was due to high frequency of memstore
>> flushes - but increasing the memstore buffer sizes has had no discernible
>> impact on read/write. Currently, our callqueue.handler.size is set to 100 -
>> since we experience over 100 requests/second on each RS, we are considering
>> increasing this to about 300 so we can handle more requests concurrently.
>> Is this a good change, or are other changes needed as well?
>>
>> Are you taxing your machines' physical resources? What does I/O, CPU and
>> memory utilization look like?
>>
>> > Unfortunately, we cannot provide raw metrics on the magnitude of
>> read/write performance degradation as we do not have sufficient tracking
>> for them. A rough proxy - we do know our clusters are capable of processing
>> 200 jobs in an hour. This now goes down to as low as 30-50 jobs per hour
>> with minimal changes to the jobs themselves. We wish to be able to get back
>> to our original performance.
>> >
>> > For now, in periods of high stress (large jobs or quick reads/writes),
>> we are manually clearing out the hbase folder in HDFS (including store
>> files, WALs, oldWALs and archived files), and resetting our clusters to an
>> empty state. We are aware this is not ideal, and are looking for ways to
>> not have to do this. Our understanding of how Hbase works is probably
>> imperfect, and we would appreciate any advice or feedback in this regard.
>>
>> Yeah I'm not sure why you are doing this. There's no reason I can think
>> of as to why you would need to do this...
>>
>
>

Re: Hbase Architecture Questions

Posted by Ted Yu <yu...@gmail.com>.
bq. We use Hbase 1.0.0

1.0.0 was quite old.

Can you try more recent releases such as 1.3.0 (the hbase-thrift module
should be more robust) ?

If your nodes have enough memory, have you thought of using bucket cache to
improve read performance ?

Cheers

On Fri, Feb 3, 2017 at 1:34 PM, Akshat Mahajan <am...@brightedge.com>
wrote:

> > By "After Hadoop runs", you mean a batch collection/processing job?
> MapReduce? Hadoop is a collection of distributed processing tools:
> Filesystem (HDFS) and distributed execution (YARN)...
>
> Yes, we mean a batch collection/processing job. We use Yarn and HDFS, and
> we employ the Hadoop API to run only mappers (we do not need to perform
> reductions on our data) in Java. But the collection happens through Python,
> necessitating the use of Thrift, and the actual processing happens through
> Yarn.
>
> > If you are deleting the table in a short period of time, then, yes,
> disabling major compactions is probably "OK". However, not running major
> compactions will have read performance implications. While there are
> tombstones, these represent extra work that your regionservers must perform.
>
> Can you please clarify? Our understanding is that a table deletion will
> close the entire region corresponding to that table across all
> RegionServers. If that is correct, why should there be a read performance
> issue? (I'm assuming that closed regions are never accessed again by the
> regionservers - am I correct?).
>
> > It sounds like you could just adopt a bulk-loading approach instead of
> doing live writes into HBase..
>
> This is certainly a possibility, but would require a fairly sizable
> rewrite for our application.
>
> >  I doubt REST is ever going to be a "well-performing" tool. You're
> likely chasing your tail to try to optimize this. Your time would be better
> spent using the HBase Java API directly.
>
> We are constrained by having to use Python, so we can't use the native
> API. We switched from Thrift to REST when we found Thrift kept dying under
> the load we put it under.
>
> > Are you taxing your machines' physical resources? What does I/O, CPU and
> memory utilization look like?
>
> Let me get back to you on this front more fully in a followup.
>
> Currently providing estimates for our bulkier and more problematic cluster:
>
> We are not constrained by memory - our free memory utilisation on all the
> regionservers is close to 99%, but about 40% of that in each RS is used in
> caches that will be readily given to any programs that require it by the
> kernel (as assessed by `free`). On the master node, it is closer to 60%
> memory utilisation.
>
> Our CPU utilisation varies, but under regular operation, user time is at
> 60 - 40 percent on all regionservers. CPU idle time is very high, usually,
> about 90%, and CPU system time is about 5%.
>
> I/O wait time is very, very low - about 0.23% on average.
>
> > Yeah I'm not sure why you are doing this. There's no reason I can think
> of as to why you would need to do this...
>
> Believe it or not, it is the only thing we have found that helps us
> restore performance temporarily.
>
> It's not ideal, and we don't want to keep doing it, though.
>
> Akshat
>
>
> -----Original Message-----
> From: Josh Elser [mailto:elserj@apache.org]
> Sent: Friday, February 03, 2017 1:05 PM
> To: user@hbase.apache.org
> Subject: Re: Hbase Architecture Questions
>
>
>
> Akshat Mahajan wrote:
>
> > b) Every table in our Hbase is associated with a unique collection of
> items, and all tables exhibit the same column families (2 column families).
> After Hadoop runs, we no longer require the tables, so we delete them.
> Individual rows are never removed; instead, entire tables are removed at a
> time.
>
> By "After Hadoop runs", you mean a batch collection/processing job?
> MapReduce? Hadoop is a collection of distributed processing tools:
> Filesystem (HDFS) and distributed execution (YARN)...
>
> > a) _We have turned major compaction off_.
> >
> > We are aware this is against recommended advice. Our reasoning for
> > this is that
> >
> > 1) periods of major compaction degrade both our read and write
> > performance heavily (to the point our schedule is delayed beyond
> > tolerance), and
> > 2) all our tables are temporary - we do not intend to keep them around,
> and disabling/deleting old tables closes entire regions altogether and
> should have the same effect as major compaction processing tombstone
> markers on rows. Read performance should then theoretically not be impacted
> - we expect that the RegionServer will never even consult that region in
> doing reads, so storefile buildup overall should not be an issue.
>
> If you are deleting the table in a short period of time, then, yes,
> disabling major compactions is probably "OK".
>
> However, not running major compactions will have read performance
> implications. While there are tombstones, these represent extra work that
> your regionservers must perform.
>
> >
> > b) _We have made additional efforts to turn off minor compactions as
> much as possible_.
> >
> > In particular, our hbase.hstore.compaction.max.size is set to 500 MB,
> our hbase.hstore.compactionThreshold is set to 1000 bytes. We do this in
> order to prevent a minor compaction from becoming a major compaction -
> since we cannot prevent that, we were forced to try and prevent minor
> compactions running at all.
>
> It sounds like you could just adopt a bulk-loading approach instead of
> doing live writes into HBase..
>
> > c)  We have tried to make REST more performant by improving the number
> of REST threads to about 9000.
> >
> > This figure is derived from counting the number of connections on REST
> during periods of high write load.
>
> I doubt REST is ever going to be a "well-performing" tool. You're likely
> chasing your tail to try to optimize this. Your time would be better
> spent using the HBase Java API directly.
>
> > d) We have turned on bloom filters, use an LRUBlockCache which caches
> data only on reads, and have set tcpnodelay to true. These were in place
> before we turned major compaction off.
> >
> > Our observations with these settings in performance:
> >
> > a) We are seeing an impact on both read/write performance correlated
> strongly with store file buildup. Our store files number between 500 to
> 1500 on each RS - the total size on each RegionServer are on the order 100
> to 200 GBs at worst.
> > b) As number of connections on Hbase REST rises, write performance is
> impacted. We originally believed this was due to high frequency of memstore
> flushes - but increasing the memstore buffer sizes has had no discernible
> impact on read/write. Currently, our callqueue.handler.size is set to 100 -
> since we experience over 100 requests/second on each RS, we are considering
> increasing this to about 300 so we can handle more requests concurrently.
> Is this a good change, or are other changes needed as well?
>
> Are you taxing your machines' physical resources? What does I/O, CPU and
> memory utilization look like?
>
> > Unfortunately, we cannot provide raw metrics on the magnitude of
> read/write performance degradation as we do not have sufficient tracking
> for them. A rough proxy - we do know our clusters are capable of processing
> 200 jobs in an hour. This now goes down to as low as 30-50 jobs per hour
> with minimal changes to the jobs themselves. We wish to be able to get back
> to our original performance.
> >
> > For now, in periods of high stress (large jobs or quick reads/writes),
> we are manually clearing out the hbase folder in HDFS (including store
> files, WALs, oldWALs and archived files), and resetting our clusters to an
> empty state. We are aware this is not ideal, and are looking for ways to
> not have to do this. Our understanding of how Hbase works is probably
> imperfect, and we would appreciate any advice or feedback in this regard.
>
> Yeah I'm not sure why you are doing this. There's no reason I can think
> of as to why you would need to do this...
>

RE: Hbase Architecture Questions

Posted by Akshat Mahajan <am...@brightedge.com>.
> By "After Hadoop runs", you mean a batch collection/processing job? MapReduce? Hadoop is a collection of distributed processing tools: Filesystem (HDFS) and distributed execution (YARN)...

Yes, we mean a batch collection/processing job. We use Yarn and HDFS, and we employ the Hadoop API to run only mappers (we do not need to perform reductions on our data) in Java. But the collection happens through Python, necessitating the use of Thrift, and the actual processing happens through Yarn.

> If you are deleting the table in a short period of time, then, yes, disabling major compactions is probably "OK". However, not running major compactions will have read performance implications. While there are tombstones, these represent extra work that your regionservers must perform.

Can you please clarify? Our understanding is that a table deletion will close the entire region corresponding to that table across all RegionServers. If that is correct, why should there be a read performance issue? (I'm assuming that closed regions are never accessed again by the regionservers - am I correct?).

> It sounds like you could just adopt a bulk-loading approach instead of doing live writes into HBase..

This is certainly a possibility, but would require a fairly sizable rewrite for our application. 

>  I doubt REST is ever going to be a "well-performing" tool. You're likely chasing your tail to try to optimize this. Your time would be better spent using the HBase Java API directly.

We are constrained by having to use Python, so we can't use the native API. We switched from Thrift to REST when we found Thrift kept dying under the load we put it under. 

> Are you taxing your machines' physical resources? What does I/O, CPU and memory utilization look like?

Let me get back to you on this front more fully in a followup. 

Currently providing estimates for our bulkier and more problematic cluster:

We are not constrained by memory - our free memory utilisation on all the regionservers is close to 99%, but about 40% of that in each RS is used in caches that will be readily given to any programs that require it by the kernel (as assessed by `free`). On the master node, it is closer to 60% memory utilisation. 

Our CPU utilisation varies, but under regular operation, user time is at 60 - 40 percent on all regionservers. CPU idle time is very high, usually, about 90%, and CPU system time is about 5%.

I/O wait time is very, very low - about 0.23% on average.

> Yeah I'm not sure why you are doing this. There's no reason I can think of as to why you would need to do this...

Believe it or not, it is the only thing we have found that helps us restore performance temporarily. 

It's not ideal, and we don't want to keep doing it, though.

Akshat


-----Original Message-----
From: Josh Elser [mailto:elserj@apache.org] 
Sent: Friday, February 03, 2017 1:05 PM
To: user@hbase.apache.org
Subject: Re: Hbase Architecture Questions



Akshat Mahajan wrote:

> b) Every table in our Hbase is associated with a unique collection of items, and all tables exhibit the same column families (2 column families). After Hadoop runs, we no longer require the tables, so we delete them. Individual rows are never removed; instead, entire tables are removed at a time.

By "After Hadoop runs", you mean a batch collection/processing job? 
MapReduce? Hadoop is a collection of distributed processing tools: 
Filesystem (HDFS) and distributed execution (YARN)...

> a) _We have turned major compaction off_.
>
> We are aware this is against recommended advice. Our reasoning for 
> this is that
>
> 1) periods of major compaction degrade both our read and write 
> performance heavily (to the point our schedule is delayed beyond 
> tolerance), and
> 2) all our tables are temporary - we do not intend to keep them around, and disabling/deleting old tables closes entire regions altogether and should have the same effect as major compaction processing tombstone markers on rows. Read performance should then theoretically not be impacted - we expect that the RegionServer will never even consult that region in doing reads, so storefile buildup overall should not be an issue.

If you are deleting the table in a short period of time, then, yes, disabling major compactions is probably "OK".

However, not running major compactions will have read performance implications. While there are tombstones, these represent extra work that your regionservers must perform.

>
> b) _We have made additional efforts to turn off minor compactions as much as possible_.
>
> In particular, our hbase.hstore.compaction.max.size is set to 500 MB, our hbase.hstore.compactionThreshold is set to 1000 bytes. We do this in order to prevent a minor compaction from becoming a major compaction - since we cannot prevent that, we were forced to try and prevent minor compactions running at all.

It sounds like you could just adopt a bulk-loading approach instead of 
doing live writes into HBase..

> c)  We have tried to make REST more performant by improving the number of REST threads to about 9000.
>
> This figure is derived from counting the number of connections on REST during periods of high write load.

I doubt REST is ever going to be a "well-performing" tool. You're likely 
chasing your tail to try to optimize this. Your time would be better 
spent using the HBase Java API directly.

> d) We have turned on bloom filters, use an LRUBlockCache which caches data only on reads, and have set tcpnodelay to true. These were in place before we turned major compaction off.
>
> Our observations with these settings in performance:
>
> a) We are seeing an impact on both read/write performance correlated strongly with store file buildup. Our store files number between 500 to 1500 on each RS - the total size on each RegionServer are on the order 100 to 200 GBs at worst.
> b) As number of connections on Hbase REST rises, write performance is impacted. We originally believed this was due to high frequency of memstore flushes - but increasing the memstore buffer sizes has had no discernible impact on read/write. Currently, our callqueue.handler.size is set to 100 - since we experience over 100 requests/second on each RS, we are considering increasing this to about 300 so we can handle more requests concurrently. Is this a good change, or are other changes needed as well?

Are you taxing your machines' physical resources? What does I/O, CPU and 
memory utilization look like?

> Unfortunately, we cannot provide raw metrics on the magnitude of read/write performance degradation as we do not have sufficient tracking for them. A rough proxy - we do know our clusters are capable of processing 200 jobs in an hour. This now goes down to as low as 30-50 jobs per hour with minimal changes to the jobs themselves. We wish to be able to get back to our original performance.
>
> For now, in periods of high stress (large jobs or quick reads/writes), we are manually clearing out the hbase folder in HDFS (including store files, WALs, oldWALs and archived files), and resetting our clusters to an empty state. We are aware this is not ideal, and are looking for ways to not have to do this. Our understanding of how Hbase works is probably imperfect, and we would appreciate any advice or feedback in this regard.

Yeah I'm not sure why you are doing this. There's no reason I can think 
of as to why you would need to do this...

Re: Hbase Architecture Questions

Posted by Josh Elser <el...@apache.org>.

Akshat Mahajan wrote:

> b) Every table in our Hbase is associated with a unique collection of items, and all tables exhibit the same column families (2 column families). After Hadoop runs, we no longer require the tables, so we delete them. Individual rows are never removed; instead, entire tables are removed at a time.

By "After Hadoop runs", you mean a batch collection/processing job? 
MapReduce? Hadoop is a collection of distributed processing tools: 
Filesystem (HDFS) and distributed execution (YARN)...

> a) _We have turned major compaction off_.
>
> We are aware this is against recommended advice. Our reasoning for this is that
>
> 1) periods of major compaction degrade both our read and write performance heavily (to the point our schedule is delayed beyond tolerance), and
> 2) all our tables are temporary - we do not intend to keep them around, and disabling/deleting old tables closes entire regions altogether and should have the same effect as major compaction processing tombstone markers on rows. Read performance should then theoretically not be impacted - we expect that the RegionServer will never even consult that region in doing reads, so storefile buildup overall should not be an issue.

If you are deleting the table in a short period of time, then, yes, 
disabling major compactions is probably "OK".

However, not running major compactions will have read performance 
implications. While there are tombstones, these represent extra work 
that your regionservers must perform.

>
> b) _We have made additional efforts to turn off minor compactions as much as possible_.
>
> In particular, our hbase.hstore.compaction.max.size is set to 500 MB, our hbase.hstore.compactionThreshold is set to 1000 bytes. We do this in order to prevent a minor compaction from becoming a major compaction - since we cannot prevent that, we were forced to try and prevent minor compactions running at all.

It sounds like you could just adopt a bulk-loading approach instead of 
doing live writes into HBase..

> c)  We have tried to make REST more performant by improving the number of REST threads to about 9000.
>
> This figure is derived from counting the number of connections on REST during periods of high write load.

I doubt REST is ever going to be a "well-performing" tool. You're likely 
chasing your tail to try to optimize this. Your time would be better 
spent using the HBase Java API directly.

> d) We have turned on bloom filters, use an LRUBlockCache which caches data only on reads, and have set tcpnodelay to true. These were in place before we turned major compaction off.
>
> Our observations with these settings in performance:
>
> a) We are seeing an impact on both read/write performance correlated strongly with store file buildup. Our store files number between 500 to 1500 on each RS - the total size on each RegionServer are on the order 100 to 200 GBs at worst.
> b) As number of connections on Hbase REST rises, write performance is impacted. We originally believed this was due to high frequency of memstore flushes - but increasing the memstore buffer sizes has had no discernible impact on read/write. Currently, our callqueue.handler.size is set to 100 - since we experience over 100 requests/second on each RS, we are considering increasing this to about 300 so we can handle more requests concurrently. Is this a good change, or are other changes needed as well?

Are you taxing your machines' physical resources? What does I/O, CPU and 
memory utilization look like?

> Unfortunately, we cannot provide raw metrics on the magnitude of read/write performance degradation as we do not have sufficient tracking for them. A rough proxy - we do know our clusters are capable of processing 200 jobs in an hour. This now goes down to as low as 30-50 jobs per hour with minimal changes to the jobs themselves. We wish to be able to get back to our original performance.
>
> For now, in periods of high stress (large jobs or quick reads/writes), we are manually clearing out the hbase folder in HDFS (including store files, WALs, oldWALs and archived files), and resetting our clusters to an empty state. We are aware this is not ideal, and are looking for ways to not have to do this. Our understanding of how Hbase works is probably imperfect, and we would appreciate any advice or feedback in this regard.

Yeah I'm not sure why you are doing this. There's no reason I can think 
of as to why you would need to do this...