You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Josh Williams <jw...@endpoint.com> on 2014/09/18 00:21:57 UTC

Performance oddity between AWS instance sizes

Hi, everyone.  Here's a strange one, at least to me.

I'm doing some performance profiling, and as a rudimentary test I've
been using YCSB to drive HBase (originally 0.98.3, recently updated to
0.98.6.)  The problem happens on a few different instance sizes, but
this is probably the closest comparison...

On m3.2xlarge instances, works as expected.
On c3.2xlarge instances, HBase barely responds at all during workloads
that involve read activity, falling silent for ~62 second intervals,
with the YCSB throughput output resembling:

 0 sec: 0 operations;
 2 sec: 918 operations; 459 current ops/sec; [UPDATE AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26]
 4 sec: 918 operations; 0 current ops/sec;
 6 sec: 918 operations; 0 current ops/sec;
<snip>
 62 sec: 918 operations; 0 current ops/sec;
 64 sec: 5302 operations; 2192 current ops/sec; [UPDATE AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56]
 66 sec: 5302 operations; 0 current ops/sec;
 68 sec: 5302 operations; 0 current ops/sec;
(And so on...)

While that happens there's almost no activity on either side, the CPU's
and disks are idle, no iowait at all.

There isn't much that jumps out at me when digging through the Hadoop
and HBase logs, except that those 62-second intervals are often (but
note always) associated with ClosedChannelExceptions in the regionserver
logs.  But I believe that's just HBase finding that a TCP connection it
wants to reply on had been closed.

As far as I've seen this happens every time on this or any of the larger
c3 class of instances, surprisingly.  The m3 instance class sizes all
seem to work fine.  These are built with a custom AMI that has HBase and
all installed, and run via a script, so the different instance type
should be the only difference between them.

Anyone seen anything like this?  Any pointers as to what I could look at
to help diagnose this odd problem?  Could there be something I'm
overlooking in the logs?

Thanks!

-- Josh

Re: Performance oddity between AWS instance sizes

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

The oddity in this thread is that there is no mention of metrics (sorry if
I missed them being mentioned!).  For example, that 1GB heap makes me think
a graph showing JVM heap memory pool sizes/utilization and GC counts/times
would quickly tell us/you if you are simply not giving the JVM enough
memory and are making the JVM GC too much...

If it helps, SPM <http://sematext.com/spm/> has good HBase / JVM / server
monitoring, although I recently learned we really need to update it for
HBase 0.98+ because almost all metrics seem to have changed.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Thu, Sep 18, 2014 at 6:02 PM, Andrew Purtell <ap...@apache.org> wrote:

> 1 GB heap is nowhere enough to run if you're tying to test something
> real (or approximate it with YCSB). Try 4 or 8, anything up to 31 GB,
> use case dependent. >= 32 GB gives away compressed OOPs and maybe GC
> issues.
>
> Also, I recently redid the HBase YCSB client in a modern way for >=
> 0.98. See https://github.com/apurtell/YCSB/tree/new_hbase_client . It
> performs in an IMHO more useful fashion than the previous for what
> YCSB is intended, but might need some tuning (haven't tried it on a
> cluster of significant size). One difference you should see is we
> won't back up for 30-60 seconds after a bunch of threads flush fat 12+
> MB write buffers.
>
> On Thu, Sep 18, 2014 at 2:31 PM, Josh Williams <jw...@endpoint.com>
> wrote:
> > Ted,
> >
> > Stack trace, that's definitely a good idea.  Here's one jstack snapshot
> > from the region server while there's no apparent activity going on:
> > https://gist.github.com/joshwilliams/4950c1d92382ea7f3160
> >
> > If it's helpful, this is the YCSB side of the equation right around the
> > same time:
> > https://gist.github.com/joshwilliams/6fa3623088af9d1446a3
> >
> >
> > And Gary,
> >
> > As far as the memory configuration, that's a good question.  Looks like
> > HBASE_HEAPSIZE isn't set, which I now see has a default of 1GB.  There
> > isn't any swap configured, and 12G of the memory on the instance is
> > going to file cache, so there's definitely room to spare.
> >
> > Maybe it'd help if I gave it more room by setting HBASE_HEAPSIZE.
> > Couldn't hurt to try that now...
> >
> > What's strange is running on m3.xlarge, which also has 15G of RAM but
> > fewer CPU cores, it runs fine.
> >
> > Thanks to you both for the insight!
> >
> > -- Josh
> >
> >
> >
> > On Thu, 2014-09-18 at 11:42 -0700, Gary Helmling wrote:
> >> What do you have HBASE_HEAPSIZE set to in hbase-env.sh?  Is it
> >> possible that you're overcommitting memory and the instance is
> >> swapping?  Just a shot in the dark, but I see that the m3.2xlarge
> >> instance has 30G of memory vs. 15G for c3.2xlarge.
> >>
> >> On Wed, Sep 17, 2014 at 3:28 PM, Ted Yu <yu...@gmail.com> wrote:
> >> > bq. there's almost no activity on either side
> >> >
> >> > During this period, can you capture stack trace for the region server
> and
> >> > pastebin the stack ?
> >> >
> >> > Cheers
> >> >
> >> > On Wed, Sep 17, 2014 at 3:21 PM, Josh Williams <
> jwilliams@endpoint.com>
> >> > wrote:
> >> >
> >> >> Hi, everyone.  Here's a strange one, at least to me.
> >> >>
> >> >> I'm doing some performance profiling, and as a rudimentary test I've
> >> >> been using YCSB to drive HBase (originally 0.98.3, recently updated
> to
> >> >> 0.98.6.)  The problem happens on a few different instance sizes, but
> >> >> this is probably the closest comparison...
> >> >>
> >> >> On m3.2xlarge instances, works as expected.
> >> >> On c3.2xlarge instances, HBase barely responds at all during
> workloads
> >> >> that involve read activity, falling silent for ~62 second intervals,
> >> >> with the YCSB throughput output resembling:
> >> >>
> >> >>  0 sec: 0 operations;
> >> >>  2 sec: 918 operations; 459 current ops/sec; [UPDATE
> >> >> AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26]
> >> >>  4 sec: 918 operations; 0 current ops/sec;
> >> >>  6 sec: 918 operations; 0 current ops/sec;
> >> >> <snip>
> >> >>  62 sec: 918 operations; 0 current ops/sec;
> >> >>  64 sec: 5302 operations; 2192 current ops/sec; [UPDATE
> >> >> AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56]
> >> >>  66 sec: 5302 operations; 0 current ops/sec;
> >> >>  68 sec: 5302 operations; 0 current ops/sec;
> >> >> (And so on...)
> >> >>
> >> >> While that happens there's almost no activity on either side, the
> CPU's
> >> >> and disks are idle, no iowait at all.
> >> >>
> >> >> There isn't much that jumps out at me when digging through the Hadoop
> >> >> and HBase logs, except that those 62-second intervals are often (but
> >> >> note always) associated with ClosedChannelExceptions in the
> regionserver
> >> >> logs.  But I believe that's just HBase finding that a TCP connection
> it
> >> >> wants to reply on had been closed.
> >> >>
> >> >> As far as I've seen this happens every time on this or any of the
> larger
> >> >> c3 class of instances, surprisingly.  The m3 instance class sizes all
> >> >> seem to work fine.  These are built with a custom AMI that has HBase
> and
> >> >> all installed, and run via a script, so the different instance type
> >> >> should be the only difference between them.
> >> >>
> >> >> Anyone seen anything like this?  Any pointers as to what I could
> look at
> >> >> to help diagnose this odd problem?  Could there be something I'm
> >> >> overlooking in the logs?
> >> >>
> >> >> Thanks!
> >> >>
> >> >> -- Josh
> >> >>
> >> >>
> >> >>
> >
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)
>

Re: Performance oddity between AWS instance sizes

Posted by Andrew Purtell <an...@gmail.com>.

FWIW, I pushed a fix for that NPE


On Fri, Sep 19, 2014 at 9:13 AM, Andrew Purtell
<an...@gmail.com> wrote:
> Thanks for trying the new client out. Shame about that NPE, I'll look into it.
>
>
>
>> On Sep 18, 2014, at 8:43 PM, Josh Williams <jw...@endpoint.com> wrote:
>>
>> Hi Andrew,
>>
>> I'll definitely bump up the heap on subsequent tests -- thanks for the
>> tip.  It was increased to 8 GB, but that didn't make any difference for
>> the older YCSB.
>>
>> Using your YCSB branch with the updated HBase client definitely makes a
>> difference, however, showing consistent throughput for a little while.
>> After a little bit of time, so far under about 5 minutes in the few
>> times I ran it, it'll hit a NullPointerException[1] ... but it
>> definitely seems to point more at a problem in the older YCSB.
>>
>> [1] https://gist.github.com/joshwilliams/0570a3095ad6417ca74f
>>
>> Thanks for your help,

Re: Performance oddity between AWS instance sizes

Posted by Andrew Purtell <an...@gmail.com>.

Thanks for trying the new client out. Shame about that NPE, I'll look into it. 


> On Sep 18, 2014, at 8:43 PM, Josh Williams <jw...@endpoint.com> wrote:
> 
> Hi Andrew,
> 
> I'll definitely bump up the heap on subsequent tests -- thanks for the
> tip.  It was increased to 8 GB, but that didn't make any difference for
> the older YCSB.
> 
> Using your YCSB branch with the updated HBase client definitely makes a
> difference, however, showing consistent throughput for a little while.
> After a little bit of time, so far under about 5 minutes in the few
> times I ran it, it'll hit a NullPointerException[1] ... but it
> definitely seems to point more at a problem in the older YCSB.
> 
> [1] https://gist.github.com/joshwilliams/0570a3095ad6417ca74f
> 
> Thanks for your help,
> 
> -- Josh
> 
> 
>> On Thu, 2014-09-18 at 15:02 -0700, Andrew Purtell wrote:
>> 1 GB heap is nowhere enough to run if you're tying to test something
>> real (or approximate it with YCSB). Try 4 or 8, anything up to 31 GB,
>> use case dependent. >= 32 GB gives away compressed OOPs and maybe GC
>> issues.
>> 
>> Also, I recently redid the HBase YCSB client in a modern way for >=
>> 0.98. See https://github.com/apurtell/YCSB/tree/new_hbase_client . It
>> performs in an IMHO more useful fashion than the previous for what
>> YCSB is intended, but might need some tuning (haven't tried it on a
>> cluster of significant size). One difference you should see is we
>> won't back up for 30-60 seconds after a bunch of threads flush fat 12+
>> MB write buffers.
>> 
>>> On Thu, Sep 18, 2014 at 2:31 PM, Josh Williams <jw...@endpoint.com> wrote:
>>> Ted,
>>> 
>>> Stack trace, that's definitely a good idea.  Here's one jstack snapshot
>>> from the region server while there's no apparent activity going on:
>>> https://gist.github.com/joshwilliams/4950c1d92382ea7f3160
>>> 
>>> If it's helpful, this is the YCSB side of the equation right around the
>>> same time:
>>> https://gist.github.com/joshwilliams/6fa3623088af9d1446a3
>>> 
>>> 
>>> And Gary,
>>> 
>>> As far as the memory configuration, that's a good question.  Looks like
>>> HBASE_HEAPSIZE isn't set, which I now see has a default of 1GB.  There
>>> isn't any swap configured, and 12G of the memory on the instance is
>>> going to file cache, so there's definitely room to spare.
>>> 
>>> Maybe it'd help if I gave it more room by setting HBASE_HEAPSIZE.
>>> Couldn't hurt to try that now...
>>> 
>>> What's strange is running on m3.xlarge, which also has 15G of RAM but
>>> fewer CPU cores, it runs fine.
>>> 
>>> Thanks to you both for the insight!
>>> 
>>> -- Josh
>>> 
>>> 
>>> 
>>>> On Thu, 2014-09-18 at 11:42 -0700, Gary Helmling wrote:
>>>> What do you have HBASE_HEAPSIZE set to in hbase-env.sh?  Is it
>>>> possible that you're overcommitting memory and the instance is
>>>> swapping?  Just a shot in the dark, but I see that the m3.2xlarge
>>>> instance has 30G of memory vs. 15G for c3.2xlarge.
>>>> 
>>>>> On Wed, Sep 17, 2014 at 3:28 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>> bq. there's almost no activity on either side
>>>>> 
>>>>> During this period, can you capture stack trace for the region server and
>>>>> pastebin the stack ?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> On Wed, Sep 17, 2014 at 3:21 PM, Josh Williams <jw...@endpoint.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi, everyone.  Here's a strange one, at least to me.
>>>>>> 
>>>>>> I'm doing some performance profiling, and as a rudimentary test I've
>>>>>> been using YCSB to drive HBase (originally 0.98.3, recently updated to
>>>>>> 0.98.6.)  The problem happens on a few different instance sizes, but
>>>>>> this is probably the closest comparison...
>>>>>> 
>>>>>> On m3.2xlarge instances, works as expected.
>>>>>> On c3.2xlarge instances, HBase barely responds at all during workloads
>>>>>> that involve read activity, falling silent for ~62 second intervals,
>>>>>> with the YCSB throughput output resembling:
>>>>>> 
>>>>>> 0 sec: 0 operations;
>>>>>> 2 sec: 918 operations; 459 current ops/sec; [UPDATE
>>>>>> AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26]
>>>>>> 4 sec: 918 operations; 0 current ops/sec;
>>>>>> 6 sec: 918 operations; 0 current ops/sec;
>>>>>> <snip>
>>>>>> 62 sec: 918 operations; 0 current ops/sec;
>>>>>> 64 sec: 5302 operations; 2192 current ops/sec; [UPDATE
>>>>>> AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56]
>>>>>> 66 sec: 5302 operations; 0 current ops/sec;
>>>>>> 68 sec: 5302 operations; 0 current ops/sec;
>>>>>> (And so on...)
>>>>>> 
>>>>>> While that happens there's almost no activity on either side, the CPU's
>>>>>> and disks are idle, no iowait at all.
>>>>>> 
>>>>>> There isn't much that jumps out at me when digging through the Hadoop
>>>>>> and HBase logs, except that those 62-second intervals are often (but
>>>>>> note always) associated with ClosedChannelExceptions in the regionserver
>>>>>> logs.  But I believe that's just HBase finding that a TCP connection it
>>>>>> wants to reply on had been closed.
>>>>>> 
>>>>>> As far as I've seen this happens every time on this or any of the larger
>>>>>> c3 class of instances, surprisingly.  The m3 instance class sizes all
>>>>>> seem to work fine.  These are built with a custom AMI that has HBase and
>>>>>> all installed, and run via a script, so the different instance type
>>>>>> should be the only difference between them.
>>>>>> 
>>>>>> Anyone seen anything like this?  Any pointers as to what I could look at
>>>>>> to help diagnose this odd problem?  Could there be something I'm
>>>>>> overlooking in the logs?
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> -- Josh
> 
>

Re: Performance oddity between AWS instance sizes

Posted by Josh Williams <jw...@endpoint.com>.

Hi Andrew,

I'll definitely bump up the heap on subsequent tests -- thanks for the
tip.  It was increased to 8 GB, but that didn't make any difference for
the older YCSB.

Using your YCSB branch with the updated HBase client definitely makes a
difference, however, showing consistent throughput for a little while.
After a little bit of time, so far under about 5 minutes in the few
times I ran it, it'll hit a NullPointerException[1] ... but it
definitely seems to point more at a problem in the older YCSB.

[1] https://gist.github.com/joshwilliams/0570a3095ad6417ca74f

Thanks for your help,

-- Josh


On Thu, 2014-09-18 at 15:02 -0700, Andrew Purtell wrote:
> 1 GB heap is nowhere enough to run if you're tying to test something
> real (or approximate it with YCSB). Try 4 or 8, anything up to 31 GB,
> use case dependent. >= 32 GB gives away compressed OOPs and maybe GC
> issues.
> 
> Also, I recently redid the HBase YCSB client in a modern way for >=
> 0.98. See https://github.com/apurtell/YCSB/tree/new_hbase_client . It
> performs in an IMHO more useful fashion than the previous for what
> YCSB is intended, but might need some tuning (haven't tried it on a
> cluster of significant size). One difference you should see is we
> won't back up for 30-60 seconds after a bunch of threads flush fat 12+
> MB write buffers.
> 
> On Thu, Sep 18, 2014 at 2:31 PM, Josh Williams <jw...@endpoint.com> wrote:
> > Ted,
> >
> > Stack trace, that's definitely a good idea.  Here's one jstack snapshot
> > from the region server while there's no apparent activity going on:
> > https://gist.github.com/joshwilliams/4950c1d92382ea7f3160
> >
> > If it's helpful, this is the YCSB side of the equation right around the
> > same time:
> > https://gist.github.com/joshwilliams/6fa3623088af9d1446a3
> >
> >
> > And Gary,
> >
> > As far as the memory configuration, that's a good question.  Looks like
> > HBASE_HEAPSIZE isn't set, which I now see has a default of 1GB.  There
> > isn't any swap configured, and 12G of the memory on the instance is
> > going to file cache, so there's definitely room to spare.
> >
> > Maybe it'd help if I gave it more room by setting HBASE_HEAPSIZE.
> > Couldn't hurt to try that now...
> >
> > What's strange is running on m3.xlarge, which also has 15G of RAM but
> > fewer CPU cores, it runs fine.
> >
> > Thanks to you both for the insight!
> >
> > -- Josh
> >
> >
> >
> > On Thu, 2014-09-18 at 11:42 -0700, Gary Helmling wrote:
> >> What do you have HBASE_HEAPSIZE set to in hbase-env.sh?  Is it
> >> possible that you're overcommitting memory and the instance is
> >> swapping?  Just a shot in the dark, but I see that the m3.2xlarge
> >> instance has 30G of memory vs. 15G for c3.2xlarge.
> >>
> >> On Wed, Sep 17, 2014 at 3:28 PM, Ted Yu <yu...@gmail.com> wrote:
> >> > bq. there's almost no activity on either side
> >> >
> >> > During this period, can you capture stack trace for the region server and
> >> > pastebin the stack ?
> >> >
> >> > Cheers
> >> >
> >> > On Wed, Sep 17, 2014 at 3:21 PM, Josh Williams <jw...@endpoint.com>
> >> > wrote:
> >> >
> >> >> Hi, everyone.  Here's a strange one, at least to me.
> >> >>
> >> >> I'm doing some performance profiling, and as a rudimentary test I've
> >> >> been using YCSB to drive HBase (originally 0.98.3, recently updated to
> >> >> 0.98.6.)  The problem happens on a few different instance sizes, but
> >> >> this is probably the closest comparison...
> >> >>
> >> >> On m3.2xlarge instances, works as expected.
> >> >> On c3.2xlarge instances, HBase barely responds at all during workloads
> >> >> that involve read activity, falling silent for ~62 second intervals,
> >> >> with the YCSB throughput output resembling:
> >> >>
> >> >>  0 sec: 0 operations;
> >> >>  2 sec: 918 operations; 459 current ops/sec; [UPDATE
> >> >> AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26]
> >> >>  4 sec: 918 operations; 0 current ops/sec;
> >> >>  6 sec: 918 operations; 0 current ops/sec;
> >> >> <snip>
> >> >>  62 sec: 918 operations; 0 current ops/sec;
> >> >>  64 sec: 5302 operations; 2192 current ops/sec; [UPDATE
> >> >> AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56]
> >> >>  66 sec: 5302 operations; 0 current ops/sec;
> >> >>  68 sec: 5302 operations; 0 current ops/sec;
> >> >> (And so on...)
> >> >>
> >> >> While that happens there's almost no activity on either side, the CPU's
> >> >> and disks are idle, no iowait at all.
> >> >>
> >> >> There isn't much that jumps out at me when digging through the Hadoop
> >> >> and HBase logs, except that those 62-second intervals are often (but
> >> >> note always) associated with ClosedChannelExceptions in the regionserver
> >> >> logs.  But I believe that's just HBase finding that a TCP connection it
> >> >> wants to reply on had been closed.
> >> >>
> >> >> As far as I've seen this happens every time on this or any of the larger
> >> >> c3 class of instances, surprisingly.  The m3 instance class sizes all
> >> >> seem to work fine.  These are built with a custom AMI that has HBase and
> >> >> all installed, and run via a script, so the different instance type
> >> >> should be the only difference between them.
> >> >>
> >> >> Anyone seen anything like this?  Any pointers as to what I could look at
> >> >> to help diagnose this odd problem?  Could there be something I'm
> >> >> overlooking in the logs?
> >> >>
> >> >> Thanks!
> >> >>
> >> >> -- Josh
> >> >>
> >> >>
> >> >>
> >
> >
> 
> 
>

Re: Performance oddity between AWS instance sizes

Posted by Andrew Purtell <ap...@apache.org>.

1 GB heap is nowhere enough to run if you're tying to test something
real (or approximate it with YCSB). Try 4 or 8, anything up to 31 GB,
use case dependent. >= 32 GB gives away compressed OOPs and maybe GC
issues.

Also, I recently redid the HBase YCSB client in a modern way for >=
0.98. See https://github.com/apurtell/YCSB/tree/new_hbase_client . It
performs in an IMHO more useful fashion than the previous for what
YCSB is intended, but might need some tuning (haven't tried it on a
cluster of significant size). One difference you should see is we
won't back up for 30-60 seconds after a bunch of threads flush fat 12+
MB write buffers.

On Thu, Sep 18, 2014 at 2:31 PM, Josh Williams <jw...@endpoint.com> wrote:
> Ted,
>
> Stack trace, that's definitely a good idea.  Here's one jstack snapshot
> from the region server while there's no apparent activity going on:
> https://gist.github.com/joshwilliams/4950c1d92382ea7f3160
>
> If it's helpful, this is the YCSB side of the equation right around the
> same time:
> https://gist.github.com/joshwilliams/6fa3623088af9d1446a3
>
>
> And Gary,
>
> As far as the memory configuration, that's a good question.  Looks like
> HBASE_HEAPSIZE isn't set, which I now see has a default of 1GB.  There
> isn't any swap configured, and 12G of the memory on the instance is
> going to file cache, so there's definitely room to spare.
>
> Maybe it'd help if I gave it more room by setting HBASE_HEAPSIZE.
> Couldn't hurt to try that now...
>
> What's strange is running on m3.xlarge, which also has 15G of RAM but
> fewer CPU cores, it runs fine.
>
> Thanks to you both for the insight!
>
> -- Josh
>
>
>
> On Thu, 2014-09-18 at 11:42 -0700, Gary Helmling wrote:
>> What do you have HBASE_HEAPSIZE set to in hbase-env.sh?  Is it
>> possible that you're overcommitting memory and the instance is
>> swapping?  Just a shot in the dark, but I see that the m3.2xlarge
>> instance has 30G of memory vs. 15G for c3.2xlarge.
>>
>> On Wed, Sep 17, 2014 at 3:28 PM, Ted Yu <yu...@gmail.com> wrote:
>> > bq. there's almost no activity on either side
>> >
>> > During this period, can you capture stack trace for the region server and
>> > pastebin the stack ?
>> >
>> > Cheers
>> >
>> > On Wed, Sep 17, 2014 at 3:21 PM, Josh Williams <jw...@endpoint.com>
>> > wrote:
>> >
>> >> Hi, everyone.  Here's a strange one, at least to me.
>> >>
>> >> I'm doing some performance profiling, and as a rudimentary test I've
>> >> been using YCSB to drive HBase (originally 0.98.3, recently updated to
>> >> 0.98.6.)  The problem happens on a few different instance sizes, but
>> >> this is probably the closest comparison...
>> >>
>> >> On m3.2xlarge instances, works as expected.
>> >> On c3.2xlarge instances, HBase barely responds at all during workloads
>> >> that involve read activity, falling silent for ~62 second intervals,
>> >> with the YCSB throughput output resembling:
>> >>
>> >>  0 sec: 0 operations;
>> >>  2 sec: 918 operations; 459 current ops/sec; [UPDATE
>> >> AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26]
>> >>  4 sec: 918 operations; 0 current ops/sec;
>> >>  6 sec: 918 operations; 0 current ops/sec;
>> >> <snip>
>> >>  62 sec: 918 operations; 0 current ops/sec;
>> >>  64 sec: 5302 operations; 2192 current ops/sec; [UPDATE
>> >> AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56]
>> >>  66 sec: 5302 operations; 0 current ops/sec;
>> >>  68 sec: 5302 operations; 0 current ops/sec;
>> >> (And so on...)
>> >>
>> >> While that happens there's almost no activity on either side, the CPU's
>> >> and disks are idle, no iowait at all.
>> >>
>> >> There isn't much that jumps out at me when digging through the Hadoop
>> >> and HBase logs, except that those 62-second intervals are often (but
>> >> note always) associated with ClosedChannelExceptions in the regionserver
>> >> logs.  But I believe that's just HBase finding that a TCP connection it
>> >> wants to reply on had been closed.
>> >>
>> >> As far as I've seen this happens every time on this or any of the larger
>> >> c3 class of instances, surprisingly.  The m3 instance class sizes all
>> >> seem to work fine.  These are built with a custom AMI that has HBase and
>> >> all installed, and run via a script, so the different instance type
>> >> should be the only difference between them.
>> >>
>> >> Anyone seen anything like this?  Any pointers as to what I could look at
>> >> to help diagnose this odd problem?  Could there be something I'm
>> >> overlooking in the logs?
>> >>
>> >> Thanks!
>> >>
>> >> -- Josh
>> >>
>> >>
>> >>
>
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: Performance oddity between AWS instance sizes

Posted by Josh Williams <jw...@endpoint.com>.

Ted,

Stack trace, that's definitely a good idea.  Here's one jstack snapshot
from the region server while there's no apparent activity going on:
https://gist.github.com/joshwilliams/4950c1d92382ea7f3160

If it's helpful, this is the YCSB side of the equation right around the
same time:
https://gist.github.com/joshwilliams/6fa3623088af9d1446a3


And Gary,

As far as the memory configuration, that's a good question.  Looks like
HBASE_HEAPSIZE isn't set, which I now see has a default of 1GB.  There
isn't any swap configured, and 12G of the memory on the instance is
going to file cache, so there's definitely room to spare.

Maybe it'd help if I gave it more room by setting HBASE_HEAPSIZE.
Couldn't hurt to try that now...

What's strange is running on m3.xlarge, which also has 15G of RAM but
fewer CPU cores, it runs fine.

Thanks to you both for the insight!

-- Josh



On Thu, 2014-09-18 at 11:42 -0700, Gary Helmling wrote:
> What do you have HBASE_HEAPSIZE set to in hbase-env.sh?  Is it
> possible that you're overcommitting memory and the instance is
> swapping?  Just a shot in the dark, but I see that the m3.2xlarge
> instance has 30G of memory vs. 15G for c3.2xlarge.
> 
> On Wed, Sep 17, 2014 at 3:28 PM, Ted Yu <yu...@gmail.com> wrote:
> > bq. there's almost no activity on either side
> >
> > During this period, can you capture stack trace for the region server and
> > pastebin the stack ?
> >
> > Cheers
> >
> > On Wed, Sep 17, 2014 at 3:21 PM, Josh Williams <jw...@endpoint.com>
> > wrote:
> >
> >> Hi, everyone.  Here's a strange one, at least to me.
> >>
> >> I'm doing some performance profiling, and as a rudimentary test I've
> >> been using YCSB to drive HBase (originally 0.98.3, recently updated to
> >> 0.98.6.)  The problem happens on a few different instance sizes, but
> >> this is probably the closest comparison...
> >>
> >> On m3.2xlarge instances, works as expected.
> >> On c3.2xlarge instances, HBase barely responds at all during workloads
> >> that involve read activity, falling silent for ~62 second intervals,
> >> with the YCSB throughput output resembling:
> >>
> >>  0 sec: 0 operations;
> >>  2 sec: 918 operations; 459 current ops/sec; [UPDATE
> >> AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26]
> >>  4 sec: 918 operations; 0 current ops/sec;
> >>  6 sec: 918 operations; 0 current ops/sec;
> >> <snip>
> >>  62 sec: 918 operations; 0 current ops/sec;
> >>  64 sec: 5302 operations; 2192 current ops/sec; [UPDATE
> >> AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56]
> >>  66 sec: 5302 operations; 0 current ops/sec;
> >>  68 sec: 5302 operations; 0 current ops/sec;
> >> (And so on...)
> >>
> >> While that happens there's almost no activity on either side, the CPU's
> >> and disks are idle, no iowait at all.
> >>
> >> There isn't much that jumps out at me when digging through the Hadoop
> >> and HBase logs, except that those 62-second intervals are often (but
> >> note always) associated with ClosedChannelExceptions in the regionserver
> >> logs.  But I believe that's just HBase finding that a TCP connection it
> >> wants to reply on had been closed.
> >>
> >> As far as I've seen this happens every time on this or any of the larger
> >> c3 class of instances, surprisingly.  The m3 instance class sizes all
> >> seem to work fine.  These are built with a custom AMI that has HBase and
> >> all installed, and run via a script, so the different instance type
> >> should be the only difference between them.
> >>
> >> Anyone seen anything like this?  Any pointers as to what I could look at
> >> to help diagnose this odd problem?  Could there be something I'm
> >> overlooking in the logs?
> >>
> >> Thanks!
> >>
> >> -- Josh
> >>
> >>
> >>

Re: Performance oddity between AWS instance sizes

Posted by Gary Helmling <gh...@gmail.com>.

What do you have HBASE_HEAPSIZE set to in hbase-env.sh?  Is it
possible that you're overcommitting memory and the instance is
swapping?  Just a shot in the dark, but I see that the m3.2xlarge
instance has 30G of memory vs. 15G for c3.2xlarge.

On Wed, Sep 17, 2014 at 3:28 PM, Ted Yu <yu...@gmail.com> wrote:
> bq. there's almost no activity on either side
>
> During this period, can you capture stack trace for the region server and
> pastebin the stack ?
>
> Cheers
>
> On Wed, Sep 17, 2014 at 3:21 PM, Josh Williams <jw...@endpoint.com>
> wrote:
>
>> Hi, everyone.  Here's a strange one, at least to me.
>>
>> I'm doing some performance profiling, and as a rudimentary test I've
>> been using YCSB to drive HBase (originally 0.98.3, recently updated to
>> 0.98.6.)  The problem happens on a few different instance sizes, but
>> this is probably the closest comparison...
>>
>> On m3.2xlarge instances, works as expected.
>> On c3.2xlarge instances, HBase barely responds at all during workloads
>> that involve read activity, falling silent for ~62 second intervals,
>> with the YCSB throughput output resembling:
>>
>>  0 sec: 0 operations;
>>  2 sec: 918 operations; 459 current ops/sec; [UPDATE
>> AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26]
>>  4 sec: 918 operations; 0 current ops/sec;
>>  6 sec: 918 operations; 0 current ops/sec;
>> <snip>
>>  62 sec: 918 operations; 0 current ops/sec;
>>  64 sec: 5302 operations; 2192 current ops/sec; [UPDATE
>> AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56]
>>  66 sec: 5302 operations; 0 current ops/sec;
>>  68 sec: 5302 operations; 0 current ops/sec;
>> (And so on...)
>>
>> While that happens there's almost no activity on either side, the CPU's
>> and disks are idle, no iowait at all.
>>
>> There isn't much that jumps out at me when digging through the Hadoop
>> and HBase logs, except that those 62-second intervals are often (but
>> note always) associated with ClosedChannelExceptions in the regionserver
>> logs.  But I believe that's just HBase finding that a TCP connection it
>> wants to reply on had been closed.
>>
>> As far as I've seen this happens every time on this or any of the larger
>> c3 class of instances, surprisingly.  The m3 instance class sizes all
>> seem to work fine.  These are built with a custom AMI that has HBase and
>> all installed, and run via a script, so the different instance type
>> should be the only difference between them.
>>
>> Anyone seen anything like this?  Any pointers as to what I could look at
>> to help diagnose this odd problem?  Could there be something I'm
>> overlooking in the logs?
>>
>> Thanks!
>>
>> -- Josh
>>
>>
>>

Re: Performance oddity between AWS instance sizes

Posted by Ted Yu <yu...@gmail.com>.

bq. there's almost no activity on either side

During this period, can you capture stack trace for the region server and
pastebin the stack ?

Cheers

On Wed, Sep 17, 2014 at 3:21 PM, Josh Williams <jw...@endpoint.com>
wrote:

> Hi, everyone.  Here's a strange one, at least to me.
>
> I'm doing some performance profiling, and as a rudimentary test I've
> been using YCSB to drive HBase (originally 0.98.3, recently updated to
> 0.98.6.)  The problem happens on a few different instance sizes, but
> this is probably the closest comparison...
>
> On m3.2xlarge instances, works as expected.
> On c3.2xlarge instances, HBase barely responds at all during workloads
> that involve read activity, falling silent for ~62 second intervals,
> with the YCSB throughput output resembling:
>
>  0 sec: 0 operations;
>  2 sec: 918 operations; 459 current ops/sec; [UPDATE
> AverageLatency(us)=1252778.39] [READ AverageLatency(us)=1034496.26]
>  4 sec: 918 operations; 0 current ops/sec;
>  6 sec: 918 operations; 0 current ops/sec;
> <snip>
>  62 sec: 918 operations; 0 current ops/sec;
>  64 sec: 5302 operations; 2192 current ops/sec; [UPDATE
> AverageLatency(us)=7715321.77] [READ AverageLatency(us)=7117905.56]
>  66 sec: 5302 operations; 0 current ops/sec;
>  68 sec: 5302 operations; 0 current ops/sec;
> (And so on...)
>
> While that happens there's almost no activity on either side, the CPU's
> and disks are idle, no iowait at all.
>
> There isn't much that jumps out at me when digging through the Hadoop
> and HBase logs, except that those 62-second intervals are often (but
> note always) associated with ClosedChannelExceptions in the regionserver
> logs.  But I believe that's just HBase finding that a TCP connection it
> wants to reply on had been closed.
>
> As far as I've seen this happens every time on this or any of the larger
> c3 class of instances, surprisingly.  The m3 instance class sizes all
> seem to work fine.  These are built with a custom AMI that has HBase and
> all installed, and run via a script, so the different instance type
> should be the only difference between them.
>
> Anyone seen anything like this?  Any pointers as to what I could look at
> to help diagnose this odd problem?  Could there be something I'm
> overlooking in the logs?
>
> Thanks!
>
> -- Josh
>
>
>