You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Lars George <la...@gmail.com> on 2017/02/04 10:54:36 UTC

Canary Test Tool and write sniffing

Hi,

Looking at the Canary tool, it tries to ensure that all canary test
table regions are spread across all region servers. If that is not the
case, it calls:

if (numberOfCoveredServers < numberOfServers) {
  admin.balancer();
}

I doubt this will help with the StochasticLoadBalancer, which is known
to consider per-table balancing as one of many factors. In practice,
the SLB will most likely _not_ distribute the canary regions
sufficiently, leaving gap in the check. Switching on the per-table
option is discouraged against to let it do its thing.

Just pointing it out for vetting.

Lars

Re: Canary Test Tool and write sniffing

Posted by Lars George <la...@gmail.com>.
Please keep in mind we are talking about two issues here:

1) The short default interval time, and
2) the issue that the canary table regions might not be on all servers.

Anyone here that tried write sniffing on a current cluster with the
SLB and saw it work?

Best,
Lars


On Mon, Feb 6, 2017 at 10:38 PM, Enis Söztutar <en...@apache.org> wrote:
> Open an issue?
> Enis
>
> On Mon, Feb 6, 2017 at 9:39 AM, Stack <st...@duboce.net> wrote:
>
>> On Sun, Feb 5, 2017 at 2:25 AM, Lars George <la...@gmail.com> wrote:
>>
>> > The next example is wrong too, claiming to show 60 secs, while it
>> > shows 600 secs (the default value as well).
>> >
>> > The question is still, what is a good value for intervals? Anyone here
>> > that uses the Canary that would like to chime in?
>> >
>> >
>> I was hanging out with a user where on a mid-sized cluster with Canary
>> running with defaults, the regionserver carrying meta was 100% CPU because
>> of all the requests from Canary doing repeated full-table Scans.
>>
>> 6 seconds is too short. Seems like a typo that should be 60seconds. It is
>> not as though the Canary is going to do anything about it if it finds
>> something wrong.
>>
>> S
>>
>>
>>
>>
>> > On Sat, Feb 4, 2017 at 5:40 PM, Ted Yu <yu...@gmail.com> wrote:
>> > > Brief search on HBASE-4393 didn't reveal why the interval was
>> shortened.
>> > >
>> > > If you read the first paragraph of:
>> > > http://hbase.apache.org/book.html#_run_canary_test_as_daemon_mode
>> > >
>> > > possibly the reasoning was that canary would exit upon seeing some
>> error
>> > > (the first time).
>> > >
>> > > BTW There was a mismatch in the description for this command: (5
>> seconds
>> > > vs. 50000 milliseconds)
>> > >
>> > > ${HBASE_HOME}/bin/hbase canary -daemon -interval 50000 -f false
>> > >
>> > >
>> > > On Sat, Feb 4, 2017 at 8:21 AM, Lars George <la...@gmail.com>
>> > wrote:
>> > >
>> > >> Oh right, Ted. An earlier patch attached to the JIRA had 60 secs, the
>> > >> last one has 6 secs. Am I reading this right? It hands 6000 into the
>> > >> Thread.sleep() call, which takes millisecs. So that makes 6 secs
>> > >> between checks, which seems super short, no? I might just dull here.
>> > >>
>> > >> On Sat, Feb 4, 2017 at 5:00 PM, Ted Yu <yu...@gmail.com> wrote:
>> > >> > For the default interval , if you were looking at:
>> > >> >
>> > >> >   private static final long DEFAULT_INTERVAL = 6000;
>> > >> >
>> > >> > The above was from:
>> > >> >
>> > >> >     HBASE-4393 Implement a canary monitoring program
>> > >> >
>> > >> > which was integrated on Tue Apr 24 07:20:16 2012
>> > >> >
>> > >> > FYI
>> > >> >
>> > >> > On Sat, Feb 4, 2017 at 4:06 AM, Lars George <la...@gmail.com>
>> > >> wrote:
>> > >> >
>> > >> >> Also, the default interval used to be 60 secs, but is now 6 secs.
>> > Does
>> > >> >> that make sense? Seems awfully short for a default, assuming you
>> have
>> > >> >> many regions or servers.
>> > >> >>
>> > >> >> On Sat, Feb 4, 2017 at 11:54 AM, Lars George <
>> lars.george@gmail.com>
>> > >> >> wrote:
>> > >> >> > Hi,
>> > >> >> >
>> > >> >> > Looking at the Canary tool, it tries to ensure that all canary
>> test
>> > >> >> > table regions are spread across all region servers. If that is
>> not
>> > the
>> > >> >> > case, it calls:
>> > >> >> >
>> > >> >> > if (numberOfCoveredServers < numberOfServers) {
>> > >> >> >   admin.balancer();
>> > >> >> > }
>> > >> >> >
>> > >> >> > I doubt this will help with the StochasticLoadBalancer, which is
>> > known
>> > >> >> > to consider per-table balancing as one of many factors. In
>> > practice,
>> > >> >> > the SLB will most likely _not_ distribute the canary regions
>> > >> >> > sufficiently, leaving gap in the check. Switching on the
>> per-table
>> > >> >> > option is discouraged against to let it do its thing.
>> > >> >> >
>> > >> >> > Just pointing it out for vetting.
>> > >> >> >
>> > >> >> > Lars
>> > >> >>
>> > >>
>> >
>>

Re: Canary Test Tool and write sniffing

Posted by Enis Söztutar <en...@apache.org>.
Open an issue?
Enis

On Mon, Feb 6, 2017 at 9:39 AM, Stack <st...@duboce.net> wrote:

> On Sun, Feb 5, 2017 at 2:25 AM, Lars George <la...@gmail.com> wrote:
>
> > The next example is wrong too, claiming to show 60 secs, while it
> > shows 600 secs (the default value as well).
> >
> > The question is still, what is a good value for intervals? Anyone here
> > that uses the Canary that would like to chime in?
> >
> >
> I was hanging out with a user where on a mid-sized cluster with Canary
> running with defaults, the regionserver carrying meta was 100% CPU because
> of all the requests from Canary doing repeated full-table Scans.
>
> 6 seconds is too short. Seems like a typo that should be 60seconds. It is
> not as though the Canary is going to do anything about it if it finds
> something wrong.
>
> S
>
>
>
>
> > On Sat, Feb 4, 2017 at 5:40 PM, Ted Yu <yu...@gmail.com> wrote:
> > > Brief search on HBASE-4393 didn't reveal why the interval was
> shortened.
> > >
> > > If you read the first paragraph of:
> > > http://hbase.apache.org/book.html#_run_canary_test_as_daemon_mode
> > >
> > > possibly the reasoning was that canary would exit upon seeing some
> error
> > > (the first time).
> > >
> > > BTW There was a mismatch in the description for this command: (5
> seconds
> > > vs. 50000 milliseconds)
> > >
> > > ${HBASE_HOME}/bin/hbase canary -daemon -interval 50000 -f false
> > >
> > >
> > > On Sat, Feb 4, 2017 at 8:21 AM, Lars George <la...@gmail.com>
> > wrote:
> > >
> > >> Oh right, Ted. An earlier patch attached to the JIRA had 60 secs, the
> > >> last one has 6 secs. Am I reading this right? It hands 6000 into the
> > >> Thread.sleep() call, which takes millisecs. So that makes 6 secs
> > >> between checks, which seems super short, no? I might just dull here.
> > >>
> > >> On Sat, Feb 4, 2017 at 5:00 PM, Ted Yu <yu...@gmail.com> wrote:
> > >> > For the default interval , if you were looking at:
> > >> >
> > >> >   private static final long DEFAULT_INTERVAL = 6000;
> > >> >
> > >> > The above was from:
> > >> >
> > >> >     HBASE-4393 Implement a canary monitoring program
> > >> >
> > >> > which was integrated on Tue Apr 24 07:20:16 2012
> > >> >
> > >> > FYI
> > >> >
> > >> > On Sat, Feb 4, 2017 at 4:06 AM, Lars George <la...@gmail.com>
> > >> wrote:
> > >> >
> > >> >> Also, the default interval used to be 60 secs, but is now 6 secs.
> > Does
> > >> >> that make sense? Seems awfully short for a default, assuming you
> have
> > >> >> many regions or servers.
> > >> >>
> > >> >> On Sat, Feb 4, 2017 at 11:54 AM, Lars George <
> lars.george@gmail.com>
> > >> >> wrote:
> > >> >> > Hi,
> > >> >> >
> > >> >> > Looking at the Canary tool, it tries to ensure that all canary
> test
> > >> >> > table regions are spread across all region servers. If that is
> not
> > the
> > >> >> > case, it calls:
> > >> >> >
> > >> >> > if (numberOfCoveredServers < numberOfServers) {
> > >> >> >   admin.balancer();
> > >> >> > }
> > >> >> >
> > >> >> > I doubt this will help with the StochasticLoadBalancer, which is
> > known
> > >> >> > to consider per-table balancing as one of many factors. In
> > practice,
> > >> >> > the SLB will most likely _not_ distribute the canary regions
> > >> >> > sufficiently, leaving gap in the check. Switching on the
> per-table
> > >> >> > option is discouraged against to let it do its thing.
> > >> >> >
> > >> >> > Just pointing it out for vetting.
> > >> >> >
> > >> >> > Lars
> > >> >>
> > >>
> >
>

Re: Canary Test Tool and write sniffing

Posted by Stack <st...@duboce.net>.
On Sun, Feb 5, 2017 at 2:25 AM, Lars George <la...@gmail.com> wrote:

> The next example is wrong too, claiming to show 60 secs, while it
> shows 600 secs (the default value as well).
>
> The question is still, what is a good value for intervals? Anyone here
> that uses the Canary that would like to chime in?
>
>
I was hanging out with a user where on a mid-sized cluster with Canary
running with defaults, the regionserver carrying meta was 100% CPU because
of all the requests from Canary doing repeated full-table Scans.

6 seconds is too short. Seems like a typo that should be 60seconds. It is
not as though the Canary is going to do anything about it if it finds
something wrong.

S




> On Sat, Feb 4, 2017 at 5:40 PM, Ted Yu <yu...@gmail.com> wrote:
> > Brief search on HBASE-4393 didn't reveal why the interval was shortened.
> >
> > If you read the first paragraph of:
> > http://hbase.apache.org/book.html#_run_canary_test_as_daemon_mode
> >
> > possibly the reasoning was that canary would exit upon seeing some error
> > (the first time).
> >
> > BTW There was a mismatch in the description for this command: (5 seconds
> > vs. 50000 milliseconds)
> >
> > ${HBASE_HOME}/bin/hbase canary -daemon -interval 50000 -f false
> >
> >
> > On Sat, Feb 4, 2017 at 8:21 AM, Lars George <la...@gmail.com>
> wrote:
> >
> >> Oh right, Ted. An earlier patch attached to the JIRA had 60 secs, the
> >> last one has 6 secs. Am I reading this right? It hands 6000 into the
> >> Thread.sleep() call, which takes millisecs. So that makes 6 secs
> >> between checks, which seems super short, no? I might just dull here.
> >>
> >> On Sat, Feb 4, 2017 at 5:00 PM, Ted Yu <yu...@gmail.com> wrote:
> >> > For the default interval , if you were looking at:
> >> >
> >> >   private static final long DEFAULT_INTERVAL = 6000;
> >> >
> >> > The above was from:
> >> >
> >> >     HBASE-4393 Implement a canary monitoring program
> >> >
> >> > which was integrated on Tue Apr 24 07:20:16 2012
> >> >
> >> > FYI
> >> >
> >> > On Sat, Feb 4, 2017 at 4:06 AM, Lars George <la...@gmail.com>
> >> wrote:
> >> >
> >> >> Also, the default interval used to be 60 secs, but is now 6 secs.
> Does
> >> >> that make sense? Seems awfully short for a default, assuming you have
> >> >> many regions or servers.
> >> >>
> >> >> On Sat, Feb 4, 2017 at 11:54 AM, Lars George <la...@gmail.com>
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Looking at the Canary tool, it tries to ensure that all canary test
> >> >> > table regions are spread across all region servers. If that is not
> the
> >> >> > case, it calls:
> >> >> >
> >> >> > if (numberOfCoveredServers < numberOfServers) {
> >> >> >   admin.balancer();
> >> >> > }
> >> >> >
> >> >> > I doubt this will help with the StochasticLoadBalancer, which is
> known
> >> >> > to consider per-table balancing as one of many factors. In
> practice,
> >> >> > the SLB will most likely _not_ distribute the canary regions
> >> >> > sufficiently, leaving gap in the check. Switching on the per-table
> >> >> > option is discouraged against to let it do its thing.
> >> >> >
> >> >> > Just pointing it out for vetting.
> >> >> >
> >> >> > Lars
> >> >>
> >>
>

Re: Canary Test Tool and write sniffing

Posted by Lars George <la...@gmail.com>.
The next example is wrong too, claiming to show 60 secs, while it
shows 600 secs (the default value as well).

The question is still, what is a good value for intervals? Anyone here
that uses the Canary that would like to chime in?

On Sat, Feb 4, 2017 at 5:40 PM, Ted Yu <yu...@gmail.com> wrote:
> Brief search on HBASE-4393 didn't reveal why the interval was shortened.
>
> If you read the first paragraph of:
> http://hbase.apache.org/book.html#_run_canary_test_as_daemon_mode
>
> possibly the reasoning was that canary would exit upon seeing some error
> (the first time).
>
> BTW There was a mismatch in the description for this command: (5 seconds
> vs. 50000 milliseconds)
>
> ${HBASE_HOME}/bin/hbase canary -daemon -interval 50000 -f false
>
>
> On Sat, Feb 4, 2017 at 8:21 AM, Lars George <la...@gmail.com> wrote:
>
>> Oh right, Ted. An earlier patch attached to the JIRA had 60 secs, the
>> last one has 6 secs. Am I reading this right? It hands 6000 into the
>> Thread.sleep() call, which takes millisecs. So that makes 6 secs
>> between checks, which seems super short, no? I might just dull here.
>>
>> On Sat, Feb 4, 2017 at 5:00 PM, Ted Yu <yu...@gmail.com> wrote:
>> > For the default interval , if you were looking at:
>> >
>> >   private static final long DEFAULT_INTERVAL = 6000;
>> >
>> > The above was from:
>> >
>> >     HBASE-4393 Implement a canary monitoring program
>> >
>> > which was integrated on Tue Apr 24 07:20:16 2012
>> >
>> > FYI
>> >
>> > On Sat, Feb 4, 2017 at 4:06 AM, Lars George <la...@gmail.com>
>> wrote:
>> >
>> >> Also, the default interval used to be 60 secs, but is now 6 secs. Does
>> >> that make sense? Seems awfully short for a default, assuming you have
>> >> many regions or servers.
>> >>
>> >> On Sat, Feb 4, 2017 at 11:54 AM, Lars George <la...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Looking at the Canary tool, it tries to ensure that all canary test
>> >> > table regions are spread across all region servers. If that is not the
>> >> > case, it calls:
>> >> >
>> >> > if (numberOfCoveredServers < numberOfServers) {
>> >> >   admin.balancer();
>> >> > }
>> >> >
>> >> > I doubt this will help with the StochasticLoadBalancer, which is known
>> >> > to consider per-table balancing as one of many factors. In practice,
>> >> > the SLB will most likely _not_ distribute the canary regions
>> >> > sufficiently, leaving gap in the check. Switching on the per-table
>> >> > option is discouraged against to let it do its thing.
>> >> >
>> >> > Just pointing it out for vetting.
>> >> >
>> >> > Lars
>> >>
>>

Re: Canary Test Tool and write sniffing

Posted by Ted Yu <yu...@gmail.com>.
Brief search on HBASE-4393 didn't reveal why the interval was shortened.

If you read the first paragraph of:
http://hbase.apache.org/book.html#_run_canary_test_as_daemon_mode

possibly the reasoning was that canary would exit upon seeing some error
(the first time).

BTW There was a mismatch in the description for this command: (5 seconds
vs. 50000 milliseconds)

${HBASE_HOME}/bin/hbase canary -daemon -interval 50000 -f false


On Sat, Feb 4, 2017 at 8:21 AM, Lars George <la...@gmail.com> wrote:

> Oh right, Ted. An earlier patch attached to the JIRA had 60 secs, the
> last one has 6 secs. Am I reading this right? It hands 6000 into the
> Thread.sleep() call, which takes millisecs. So that makes 6 secs
> between checks, which seems super short, no? I might just dull here.
>
> On Sat, Feb 4, 2017 at 5:00 PM, Ted Yu <yu...@gmail.com> wrote:
> > For the default interval , if you were looking at:
> >
> >   private static final long DEFAULT_INTERVAL = 6000;
> >
> > The above was from:
> >
> >     HBASE-4393 Implement a canary monitoring program
> >
> > which was integrated on Tue Apr 24 07:20:16 2012
> >
> > FYI
> >
> > On Sat, Feb 4, 2017 at 4:06 AM, Lars George <la...@gmail.com>
> wrote:
> >
> >> Also, the default interval used to be 60 secs, but is now 6 secs. Does
> >> that make sense? Seems awfully short for a default, assuming you have
> >> many regions or servers.
> >>
> >> On Sat, Feb 4, 2017 at 11:54 AM, Lars George <la...@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > Looking at the Canary tool, it tries to ensure that all canary test
> >> > table regions are spread across all region servers. If that is not the
> >> > case, it calls:
> >> >
> >> > if (numberOfCoveredServers < numberOfServers) {
> >> >   admin.balancer();
> >> > }
> >> >
> >> > I doubt this will help with the StochasticLoadBalancer, which is known
> >> > to consider per-table balancing as one of many factors. In practice,
> >> > the SLB will most likely _not_ distribute the canary regions
> >> > sufficiently, leaving gap in the check. Switching on the per-table
> >> > option is discouraged against to let it do its thing.
> >> >
> >> > Just pointing it out for vetting.
> >> >
> >> > Lars
> >>
>

Re: Canary Test Tool and write sniffing

Posted by Lars George <la...@gmail.com>.
Oh right, Ted. An earlier patch attached to the JIRA had 60 secs, the
last one has 6 secs. Am I reading this right? It hands 6000 into the
Thread.sleep() call, which takes millisecs. So that makes 6 secs
between checks, which seems super short, no? I might just dull here.

On Sat, Feb 4, 2017 at 5:00 PM, Ted Yu <yu...@gmail.com> wrote:
> For the default interval , if you were looking at:
>
>   private static final long DEFAULT_INTERVAL = 6000;
>
> The above was from:
>
>     HBASE-4393 Implement a canary monitoring program
>
> which was integrated on Tue Apr 24 07:20:16 2012
>
> FYI
>
> On Sat, Feb 4, 2017 at 4:06 AM, Lars George <la...@gmail.com> wrote:
>
>> Also, the default interval used to be 60 secs, but is now 6 secs. Does
>> that make sense? Seems awfully short for a default, assuming you have
>> many regions or servers.
>>
>> On Sat, Feb 4, 2017 at 11:54 AM, Lars George <la...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Looking at the Canary tool, it tries to ensure that all canary test
>> > table regions are spread across all region servers. If that is not the
>> > case, it calls:
>> >
>> > if (numberOfCoveredServers < numberOfServers) {
>> >   admin.balancer();
>> > }
>> >
>> > I doubt this will help with the StochasticLoadBalancer, which is known
>> > to consider per-table balancing as one of many factors. In practice,
>> > the SLB will most likely _not_ distribute the canary regions
>> > sufficiently, leaving gap in the check. Switching on the per-table
>> > option is discouraged against to let it do its thing.
>> >
>> > Just pointing it out for vetting.
>> >
>> > Lars
>>

Re: Canary Test Tool and write sniffing

Posted by Ted Yu <yu...@gmail.com>.
For the default interval , if you were looking at:

  private static final long DEFAULT_INTERVAL = 6000;

The above was from:

    HBASE-4393 Implement a canary monitoring program

which was integrated on Tue Apr 24 07:20:16 2012

FYI

On Sat, Feb 4, 2017 at 4:06 AM, Lars George <la...@gmail.com> wrote:

> Also, the default interval used to be 60 secs, but is now 6 secs. Does
> that make sense? Seems awfully short for a default, assuming you have
> many regions or servers.
>
> On Sat, Feb 4, 2017 at 11:54 AM, Lars George <la...@gmail.com>
> wrote:
> > Hi,
> >
> > Looking at the Canary tool, it tries to ensure that all canary test
> > table regions are spread across all region servers. If that is not the
> > case, it calls:
> >
> > if (numberOfCoveredServers < numberOfServers) {
> >   admin.balancer();
> > }
> >
> > I doubt this will help with the StochasticLoadBalancer, which is known
> > to consider per-table balancing as one of many factors. In practice,
> > the SLB will most likely _not_ distribute the canary regions
> > sufficiently, leaving gap in the check. Switching on the per-table
> > option is discouraged against to let it do its thing.
> >
> > Just pointing it out for vetting.
> >
> > Lars
>

Re: Canary Test Tool and write sniffing

Posted by Lars George <la...@gmail.com>.
Also, the default interval used to be 60 secs, but is now 6 secs. Does
that make sense? Seems awfully short for a default, assuming you have
many regions or servers.

On Sat, Feb 4, 2017 at 11:54 AM, Lars George <la...@gmail.com> wrote:
> Hi,
>
> Looking at the Canary tool, it tries to ensure that all canary test
> table regions are spread across all region servers. If that is not the
> case, it calls:
>
> if (numberOfCoveredServers < numberOfServers) {
>   admin.balancer();
> }
>
> I doubt this will help with the StochasticLoadBalancer, which is known
> to consider per-table balancing as one of many factors. In practice,
> the SLB will most likely _not_ distribute the canary regions
> sufficiently, leaving gap in the check. Switching on the per-table
> option is discouraged against to let it do its thing.
>
> Just pointing it out for vetting.
>
> Lars