You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by "Perko, Ralph J" <Ra...@pnnl.gov> on 2015/09/01 23:26:53 UTC

help diagnosing issue

Hi I have run into an issue several times now and could really use some help diagnosing the problem.

Environment:
phoenix 4.4
hbase 0.98
34 node cluster
Tables are defined with 40 salt buckets
We are continuously loading large, bz2, csv files into Phoenix via Pig.
The data is in the hundred of TB’s per month

The process runs well for a few weeks but as the regions split and the number of regions gets into the hundreds per table we begin to get “RegionTooBusy” exceptions around Phoenix write code when the Pig jobs run.

Something else I have noticed is the number of requests on the regions becomes really unbalanced.  While the number of regions is around 40, 80, 120 the number of requests per region (via the hbase master site) is pretty well balanced.  But as the number gets into the 200’s many of the regions have 0 requests while the other regions have hundreds of millions of requests.

If I drop the tables and start over the issue goes away.  But we are approaching a production deadline and this is no longer an option.

The cluster is on a closed network so sending log files is not possible although I can send scanned images of logs and answer specific questions.

Can you please help me diagnose this issue.

Thanks!
Ralph


Re: help diagnosing issue

Posted by Vladimir Rodionov <vl...@gmail.com>.
You have too many issues with your cluster configuration and sizing.

1. Hundreds TBs per 34 nodes is too much. What is your projected data set
size per one RS?
2. Uncontrollable splits and major compactions are a direct way to disaster.
3. Splits are bad by themselves - you have to try to avoid them by
pre-splitting table in advance.

What you have to do:

1. Disable automatic splits by setting very large MAX_FILESIZE
(hbase.hregion.max.filesize). Run periodic script (you will have to create
it yourself) during off peak hours to split regions one by one (do not
forget to check major compaction).

2. Disable periodic major compactions.  Run periodic script (you will have
to create it yourself) during off peak hours to major compact regions of
your choice.

hbase.hregion.majorcompaction=0

3. To avoid unexpected large compactions during peak hours, limit maximum
size of compaction by setting

hbase.hstore.compaction.max to, say, 500MB-1GB, but it is up to you.

This will guarantee that only small compactions will be running
automatically, but you will have only two choices: small compactions and
full major compactions.

4. Enable off peak hours compaction : hbase.offpeak.start.hour
hbase.offpeak.end.hour to pick up files which were not eligible for
compaction during peak hours.

5. The best way to reduce compaction pressure is to pre-split table in
advance and disable splits completely.

All these config changes can be made for your table(s) only.

-Vlad

On Fri, Sep 4, 2015 at 7:23 AM, Perko, Ralph J <Ra...@pnnl.gov> wrote:

> Thank you for the response.  Here is an example of how our tables are
> generally defined:
>
> CONSTRAINT pkey PRIMARY KEY (fsd, s, sa, da,dp,cg,p)
> )
> TTL='5616000',KEEP_DELETED_CELLS='false',IMMUTABLE_ROWS=true,COMPRESSION='S
> NAPPY',SALT_BUCKETS=40,MAX_FILESIZE='10000000000',SPLIT_POLICY='org.apache.
> hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy';
>
> CREATE INDEX IF NOT EXISTS table1_site_idx ON table1(s)
> TTL='5616000',KEEP_DELETED_CELLS='false',COMPRESSION='SNAPPY',MAX_FILESIZE
> =¹10000000000',SPLIT_POLICY='org.apache.hadoop.hbase.regionserver.ConstantS
> izeRegionSplitPolicy';
>
>
>
>
>
> To your questions,
>
> 1) I believe it is compactions cannot keep up with load.  We¹ve had
> compactions go wrong and had to delete tables. Lots of compactions going
> on.
> 2) I can split to 256 but we have 5 tables (similar data volumes) with 5
> global indexes each.  That¹s 25 * 256 = 6400 regions across 34 nodes by
> default, is that ok?  Should I make the indexes with less buckets?
>
> Regarding stats - they are not disabled.  I thought those were
> automatically connected on compactions?
>
> I will get the stack trace
>
> On 9/1/15, 3:47 PM, "Vladimir Rodionov" <vl...@gmail.com> wrote:
>
> >OK, from beginning
> >
> >1. RegionTooBusy is thrown when Memstore size exceeds region flush size X
> >flush multiplier. THIS is a sign of a great imbalance on a write path -
> >some regions are much hotter than other or .... compaction can not keep up
> >with load , you hit blocking store count and flushes get disabled (as well
> >as writes) for 90 sec by default. Choose one - what is your case?
> >
> >2. Your region load is unbalanced because default region split  algorithm
> >does not do its job well - try to presplit (salt) to more than 40 buckets,
> >can you do 256?
> >
> >-Vlad
> >
> >On Tue, Sep 1, 2015 at 3:29 PM, Samarth Jain <sa...@apache.org> wrote:
> >
> >> Ralph,
> >>
> >> Couple of questions.
> >>
> >> Do you have phoenix stats enabled?
> >>
> >> Can you send us a stacktrace of RegionTooBusy exception? Looking at
> >>HBase
> >> code it is thrown in a few places. Would be good to check where the
> >> resource crunch is occurring at.
> >>
> >>
> >>
> >> On Tue, Sep 1, 2015 at 2:26 PM, Perko, Ralph J <Ra...@pnnl.gov>
> >> wrote:
> >>
> >>> Hi I have run into an issue several times now and could really use some
> >>> help diagnosing the problem.
> >>>
> >>> Environment:
> >>> phoenix 4.4
> >>> hbase 0.98
> >>> 34 node cluster
> >>> Tables are defined with 40 salt buckets
> >>> We are continuously loading large, bz2, csv files into Phoenix via Pig.
> >>> The data is in the hundred of TB¹s per month
> >>>
> >>> The process runs well for a few weeks but as the regions split and the
> >>> number of regions gets into the hundreds per table we begin to get
> >>> ³RegionTooBusy² exceptions around Phoenix write code when the Pig jobs
> >>>run.
> >>>
> >>> Something else I have noticed is the number of requests on the regions
> >>> becomes really unbalanced.  While the number of regions is around 40,
> >>>80,
> >>> 120 the number of requests per region (via the hbase master site) is
> >>>pretty
> >>> well balanced.  But as the number gets into the 200¹s many of the
> >>>regions
> >>> have 0 requests while the other regions have hundreds of millions of
> >>> requests.
> >>>
> >>> If I drop the tables and start over the issue goes away.  But we are
> >>> approaching a production deadline and this is no longer an option.
> >>>
> >>> The cluster is on a closed network so sending log files is not possible
> >>> although I can send scanned images of logs and answer specific
> >>>questions
>
>

Re: help diagnosing issue

Posted by "Perko, Ralph J" <Ra...@pnnl.gov>.
Thank you for the response.  Here is an example of how our tables are
generally defined:

CONSTRAINT pkey PRIMARY KEY (fsd, s, sa, da,dp,cg,p)
) 
TTL='5616000',KEEP_DELETED_CELLS='false',IMMUTABLE_ROWS=true,COMPRESSION='S
NAPPY',SALT_BUCKETS=40,MAX_FILESIZE='10000000000',SPLIT_POLICY='org.apache.
hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy';

CREATE INDEX IF NOT EXISTS table1_site_idx ON table1(s)
TTL='5616000',KEEP_DELETED_CELLS='false',COMPRESSION='SNAPPY',MAX_FILESIZE
=¹10000000000',SPLIT_POLICY='org.apache.hadoop.hbase.regionserver.ConstantS
izeRegionSplitPolicy';





To your questions,

1) I believe it is compactions cannot keep up with load.  We¹ve had
compactions go wrong and had to delete tables. Lots of compactions going
on.
2) I can split to 256 but we have 5 tables (similar data volumes) with 5
global indexes each.  That¹s 25 * 256 = 6400 regions across 34 nodes by
default, is that ok?  Should I make the indexes with less buckets?

Regarding stats - they are not disabled.  I thought those were
automatically connected on compactions?

I will get the stack trace

On 9/1/15, 3:47 PM, "Vladimir Rodionov" <vl...@gmail.com> wrote:

>OK, from beginning
>
>1. RegionTooBusy is thrown when Memstore size exceeds region flush size X
>flush multiplier. THIS is a sign of a great imbalance on a write path -
>some regions are much hotter than other or .... compaction can not keep up
>with load , you hit blocking store count and flushes get disabled (as well
>as writes) for 90 sec by default. Choose one - what is your case?
>
>2. Your region load is unbalanced because default region split  algorithm
>does not do its job well - try to presplit (salt) to more than 40 buckets,
>can you do 256?
>
>-Vlad
>
>On Tue, Sep 1, 2015 at 3:29 PM, Samarth Jain <sa...@apache.org> wrote:
>
>> Ralph,
>>
>> Couple of questions.
>>
>> Do you have phoenix stats enabled?
>>
>> Can you send us a stacktrace of RegionTooBusy exception? Looking at
>>HBase
>> code it is thrown in a few places. Would be good to check where the
>> resource crunch is occurring at.
>>
>>
>>
>> On Tue, Sep 1, 2015 at 2:26 PM, Perko, Ralph J <Ra...@pnnl.gov>
>> wrote:
>>
>>> Hi I have run into an issue several times now and could really use some
>>> help diagnosing the problem.
>>>
>>> Environment:
>>> phoenix 4.4
>>> hbase 0.98
>>> 34 node cluster
>>> Tables are defined with 40 salt buckets
>>> We are continuously loading large, bz2, csv files into Phoenix via Pig.
>>> The data is in the hundred of TB¹s per month
>>>
>>> The process runs well for a few weeks but as the regions split and the
>>> number of regions gets into the hundreds per table we begin to get
>>> ³RegionTooBusy² exceptions around Phoenix write code when the Pig jobs
>>>run.
>>>
>>> Something else I have noticed is the number of requests on the regions
>>> becomes really unbalanced.  While the number of regions is around 40,
>>>80,
>>> 120 the number of requests per region (via the hbase master site) is
>>>pretty
>>> well balanced.  But as the number gets into the 200¹s many of the
>>>regions
>>> have 0 requests while the other regions have hundreds of millions of
>>> requests.
>>>
>>> If I drop the tables and start over the issue goes away.  But we are
>>> approaching a production deadline and this is no longer an option.
>>>
>>> The cluster is on a closed network so sending log files is not possible
>>> although I can send scanned images of logs and answer specific
>>>questions


Re: help diagnosing issue

Posted by Vladimir Rodionov <vl...@gmail.com>.
OK, from beginning

1. RegionTooBusy is thrown when Memstore size exceeds region flush size X
flush multiplier. THIS is a sign of a great imbalance on a write path -
some regions are much hotter than other or .... compaction can not keep up
with load , you hit blocking store count and flushes get disabled (as well
as writes) for 90 sec by default. Choose one - what is your case?

2. Your region load is unbalanced because default region split  algorithm
does not do its job well - try to presplit (salt) to more than 40 buckets,
can you do 256?

-Vlad

On Tue, Sep 1, 2015 at 3:29 PM, Samarth Jain <sa...@apache.org> wrote:

> Ralph,
>
> Couple of questions.
>
> Do you have phoenix stats enabled?
>
> Can you send us a stacktrace of RegionTooBusy exception? Looking at HBase
> code it is thrown in a few places. Would be good to check where the
> resource crunch is occurring at.
>
>
>
> On Tue, Sep 1, 2015 at 2:26 PM, Perko, Ralph J <Ra...@pnnl.gov>
> wrote:
>
>> Hi I have run into an issue several times now and could really use some
>> help diagnosing the problem.
>>
>> Environment:
>> phoenix 4.4
>> hbase 0.98
>> 34 node cluster
>> Tables are defined with 40 salt buckets
>> We are continuously loading large, bz2, csv files into Phoenix via Pig.
>> The data is in the hundred of TB’s per month
>>
>> The process runs well for a few weeks but as the regions split and the
>> number of regions gets into the hundreds per table we begin to get
>> “RegionTooBusy” exceptions around Phoenix write code when the Pig jobs run.
>>
>> Something else I have noticed is the number of requests on the regions
>> becomes really unbalanced.  While the number of regions is around 40, 80,
>> 120 the number of requests per region (via the hbase master site) is pretty
>> well balanced.  But as the number gets into the 200’s many of the regions
>> have 0 requests while the other regions have hundreds of millions of
>> requests.
>>
>> If I drop the tables and start over the issue goes away.  But we are
>> approaching a production deadline and this is no longer an option.
>>
>> The cluster is on a closed network so sending log files is not possible
>> although I can send scanned images of logs and answer specific questions.
>>
>> Can you please help me diagnose this issue.
>>
>> Thanks!
>> Ralph
>>
>>
>

help diagnosing issue

Posted by Samarth Jain <sa...@apache.org>.
Ralph,

Couple of questions.

Do you have phoenix stats enabled?

Can you send us a stacktrace of RegionTooBusy exception? Looking at HBase
code it is thrown in a few places. Would be good to check where the
resource crunch is occurring at.



On Tue, Sep 1, 2015 at 2:26 PM, Perko, Ralph J <Ra...@pnnl.gov> wrote:

> Hi I have run into an issue several times now and could really use some
> help diagnosing the problem.
>
> Environment:
> phoenix 4.4
> hbase 0.98
> 34 node cluster
> Tables are defined with 40 salt buckets
> We are continuously loading large, bz2, csv files into Phoenix via Pig.
> The data is in the hundred of TB’s per month
>
> The process runs well for a few weeks but as the regions split and the
> number of regions gets into the hundreds per table we begin to get
> “RegionTooBusy” exceptions around Phoenix write code when the Pig jobs run.
>
> Something else I have noticed is the number of requests on the regions
> becomes really unbalanced.  While the number of regions is around 40, 80,
> 120 the number of requests per region (via the hbase master site) is pretty
> well balanced.  But as the number gets into the 200’s many of the regions
> have 0 requests while the other regions have hundreds of millions of
> requests.
>
> If I drop the tables and start over the issue goes away.  But we are
> approaching a production deadline and this is no longer an option.
>
> The cluster is on a closed network so sending log files is not possible
> although I can send scanned images of logs and answer specific questions.
>
> Can you please help me diagnose this issue.
>
> Thanks!
> Ralph
>
>