You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Patrick Schless <pa...@gmail.com> on 2013/12/13 21:07:12 UTC

3-Hour Periodic Network/CPU/Disk/Latency Spikes

CDH4.1.2
HBase 0.92.1
HDFS 2.0.0


Every 3 hours, our production HBase cluster does something that causes all
the data nodes to have a sustained spike in CPU/network/disk. The spike
lasts about 30 mins, and during this time the cluster has greatly increased
latencies for our typical application usage.

I can't find anything in our application that would have such a periodic
and significant behavior. Is there anything that HBase/HDFS might be doing
on it's own that would cause this? We're on the default schedule for major
compactions, but I thought that was daily.

Any ideas what could be causing this?

Thanks,

Patrick

Re: 3-Hour Periodic Network/CPU/Disk/Latency Spikes

Posted by Patrick Schless <pa...@gmail.com>.

Thanks for the tips. I'll play around with this this week and try to get a
script that won't affect our performance too bad. I imagine most people do
this at off-peak times, but we don't have that so we'll have to figure out
how to spread out the load as much as possible.


On Fri, Dec 13, 2013 at 6:53 PM, Vladimir Rodionov
<vr...@carrieriq.com>wrote:

> Available HBase API allows to compact table's regions independently.
> It can be cron job, script or client application which connects to HBase
> cluster, selects regions and trigger compaction,
> but you have to write this piece of software yourself.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Patrick Schless [patrick.schless@gmail.com]
> Sent: Friday, December 13, 2013 4:36 PM
> To: user
> Subject: Re: 3-Hour Periodic Network/CPU/Disk/Latency Spikes
>
> Ah, sorry about the attachment (didn't realize they weren't allowed).
> Here's the picture I was trying to attach:
> http://www.plainlystated.com/hbase_major_compactions.png
>
> It sounds like you're right, Vladimir, about the compaction storm, though I
> don't understand why it's about every three hours instead of every day. In
> the book [1] I see the suggestion that they be managed manually. I don't
> see, however, any advice on what do do after turning auto-compaction off.
> Are there best practices around scheduling and monitoring the process?
>
> Thanks,
> Patrick
>
> [1]
>
> http://hbase.apache.org/book/important_configurations.html#managed.compactions
>
>
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

RE: 3-Hour Periodic Network/CPU/Disk/Latency Spikes

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

Available HBase API allows to compact table's regions independently.
It can be cron job, script or client application which connects to HBase cluster, selects regions and trigger compaction,
but you have to write this piece of software yourself.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Patrick Schless [patrick.schless@gmail.com]
Sent: Friday, December 13, 2013 4:36 PM
To: user
Subject: Re: 3-Hour Periodic Network/CPU/Disk/Latency Spikes

Ah, sorry about the attachment (didn't realize they weren't allowed).
Here's the picture I was trying to attach:
http://www.plainlystated.com/hbase_major_compactions.png

It sounds like you're right, Vladimir, about the compaction storm, though I
don't understand why it's about every three hours instead of every day. In
the book [1] I see the suggestion that they be managed manually. I don't
see, however, any advice on what do do after turning auto-compaction off.
Are there best practices around scheduling and monitoring the process?

Thanks,
Patrick

[1]
http://hbase.apache.org/book/important_configurations.html#managed.compactions



Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: 3-Hour Periodic Network/CPU/Disk/Latency Spikes

Posted by Patrick Schless <pa...@gmail.com>.

Ah, sorry about the attachment (didn't realize they weren't allowed).
Here's the picture I was trying to attach:
http://www.plainlystated.com/hbase_major_compactions.png

It sounds like you're right, Vladimir, about the compaction storm, though I
don't understand why it's about every three hours instead of every day. In
the book [1] I see the suggestion that they be managed manually. I don't
see, however, any advice on what do do after turning auto-compaction off.
Are there best practices around scheduling and monitoring the process?

Thanks,
Patrick

[1]
http://hbase.apache.org/book/important_configurations.html#managed.compactions


On Fri, Dec 13, 2013 at 6:07 PM, Vladimir Rodionov
<vr...@carrieriq.com>wrote:

> You forgot to mention that it won't go through because Apache mail server
> blocks attachments.
> What Patrick is observing is called compaction storms. The best way (as
> since its 0.92.x) is to disable automatic compactions
> and manage them manually (see HBase book how to do this).
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
> From: Ted Yu [yuzhihong@gmail.com]
> Sent: Friday, December 13, 2013 3:33 PM
> To: user@hbase.apache.org
> Cc: user
> Subject: Re: 3-Hour Periodic Network/CPU/Disk/Latency Spikes
>
> Patrick:
> Attachment didn't go through.
>
> Cheers
>
> On Dec 13, 2013, at 3:18 PM, Patrick Schless <pa...@gmail.com>
> wrote:
>
> > Very interesting, I think we may be on to something. I grabbed all the
> timestamps for major compactions completing and put them on a graph (see
> attached). Each horizontal line is an individual server, and the dots are
> when compactions complete. Each server clearly has a cluster of compactions
> about every 3 hours, and several of the servers are aligned such that they
> are compacting at the same time.
> >
> > Should we be managing these compactions ourselves? Would it make more
> sense to have them less frequently (but presumably more expensive), or
> closer together?
> >
> > Thanks,
> > Patrick
> >
> >
> > On Fri, Dec 13, 2013 at 2:19 PM, Bryan Beaudreault <
> bbeaudreault@hubspot.com> wrote:
> >> Have you taken a look at the logs on the RegionServers during the
> period?
> >>
> >> One possibility is compactions happening organically.  If you were
> >> sustaining a certain level of writes most of the time, I could maybe see
> >> that every 3 hours enough store files build up to require compactions.
> >>
> >> There's nothing else automated in HDFS or HBase that I could see causing
> >> this.
> >>
> >> On Fri, Dec 13, 2013 at 3:07 PM, Patrick Schless
> >> <pa...@gmail.com>wrote:
> >>
> >> > CDH4.1.2
> >> > HBase 0.92.1
> >> > HDFS 2.0.0
> >> >
> >> >
> >> > Every 3 hours, our production HBase cluster does something that
> causes all
> >> > the data nodes to have a sustained spike in CPU/network/disk. The
> spike
> >> > lasts about 30 mins, and during this time the cluster has greatly
> increased
> >> > latencies for our typical application usage.
> >> >
> >> > I can't find anything in our application that would have such a
> periodic
> >> > and significant behavior. Is there anything that HBase/HDFS might be
> doing
> >> > on it's own that would cause this? We're on the default schedule for
> major
> >> > compactions, but I thought that was daily.
> >> >
> >> > Any ideas what could be causing this?
> >> >
> >> > Thanks,
> >> >
> >> > Patrick
> >> >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

RE: 3-Hour Periodic Network/CPU/Disk/Latency Spikes

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

You forgot to mention that it won't go through because Apache mail server blocks attachments.
What Patrick is observing is called compaction storms. The best way (as since its 0.92.x) is to disable automatic compactions
and manage them manually (see HBase book how to do this).

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Ted Yu [yuzhihong@gmail.com]
Sent: Friday, December 13, 2013 3:33 PM
To: user@hbase.apache.org
Cc: user
Subject: Re: 3-Hour Periodic Network/CPU/Disk/Latency Spikes

Patrick:
Attachment didn't go through.

Cheers

On Dec 13, 2013, at 3:18 PM, Patrick Schless <pa...@gmail.com> wrote:

> Very interesting, I think we may be on to something. I grabbed all the timestamps for major compactions completing and put them on a graph (see attached). Each horizontal line is an individual server, and the dots are when compactions complete. Each server clearly has a cluster of compactions about every 3 hours, and several of the servers are aligned such that they are compacting at the same time.
>
> Should we be managing these compactions ourselves? Would it make more sense to have them less frequently (but presumably more expensive), or closer together?
>
> Thanks,
> Patrick
>
>
> On Fri, Dec 13, 2013 at 2:19 PM, Bryan Beaudreault <bb...@hubspot.com> wrote:
>> Have you taken a look at the logs on the RegionServers during the period?
>>
>> One possibility is compactions happening organically.  If you were
>> sustaining a certain level of writes most of the time, I could maybe see
>> that every 3 hours enough store files build up to require compactions.
>>
>> There's nothing else automated in HDFS or HBase that I could see causing
>> this.
>>
>> On Fri, Dec 13, 2013 at 3:07 PM, Patrick Schless
>> <pa...@gmail.com>wrote:
>>
>> > CDH4.1.2
>> > HBase 0.92.1
>> > HDFS 2.0.0
>> >
>> >
>> > Every 3 hours, our production HBase cluster does something that causes all
>> > the data nodes to have a sustained spike in CPU/network/disk. The spike
>> > lasts about 30 mins, and during this time the cluster has greatly increased
>> > latencies for our typical application usage.
>> >
>> > I can't find anything in our application that would have such a periodic
>> > and significant behavior. Is there anything that HBase/HDFS might be doing
>> > on it's own that would cause this? We're on the default schedule for major
>> > compactions, but I thought that was daily.
>> >
>> > Any ideas what could be causing this?
>> >
>> > Thanks,
>> >
>> > Patrick
>> >
>

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: 3-Hour Periodic Network/CPU/Disk/Latency Spikes

Posted by Ted Yu <yu...@gmail.com>.

Patrick:
Attachment didn't go through. 

Cheers

On Dec 13, 2013, at 3:18 PM, Patrick Schless <pa...@gmail.com> wrote:

> Very interesting, I think we may be on to something. I grabbed all the timestamps for major compactions completing and put them on a graph (see attached). Each horizontal line is an individual server, and the dots are when compactions complete. Each server clearly has a cluster of compactions about every 3 hours, and several of the servers are aligned such that they are compacting at the same time.
> 
> Should we be managing these compactions ourselves? Would it make more sense to have them less frequently (but presumably more expensive), or closer together?
> 
> Thanks,
> Patrick
> 
> 
> On Fri, Dec 13, 2013 at 2:19 PM, Bryan Beaudreault <bb...@hubspot.com> wrote:
>> Have you taken a look at the logs on the RegionServers during the period?
>> 
>> One possibility is compactions happening organically.  If you were
>> sustaining a certain level of writes most of the time, I could maybe see
>> that every 3 hours enough store files build up to require compactions.
>> 
>> There's nothing else automated in HDFS or HBase that I could see causing
>> this.
>> 
>> On Fri, Dec 13, 2013 at 3:07 PM, Patrick Schless
>> <pa...@gmail.com>wrote:
>> 
>> > CDH4.1.2
>> > HBase 0.92.1
>> > HDFS 2.0.0
>> >
>> >
>> > Every 3 hours, our production HBase cluster does something that causes all
>> > the data nodes to have a sustained spike in CPU/network/disk. The spike
>> > lasts about 30 mins, and during this time the cluster has greatly increased
>> > latencies for our typical application usage.
>> >
>> > I can't find anything in our application that would have such a periodic
>> > and significant behavior. Is there anything that HBase/HDFS might be doing
>> > on it's own that would cause this? We're on the default schedule for major
>> > compactions, but I thought that was daily.
>> >
>> > Any ideas what could be causing this?
>> >
>> > Thanks,
>> >
>> > Patrick
>> >
>

Re: 3-Hour Periodic Network/CPU/Disk/Latency Spikes

Posted by Patrick Schless <pa...@gmail.com>.

Very interesting, I think we may be on to something. I grabbed all the
timestamps for major compactions completing and put them on a graph (see
attached). Each horizontal line is an individual server, and the dots are
when compactions complete. Each server clearly has a cluster of compactions
about every 3 hours, and several of the servers are aligned such that they
are compacting at the same time.

Should we be managing these compactions ourselves? Would it make more sense
to have them less frequently (but presumably more expensive), or closer
together?

Thanks,
Patrick

On Fri, Dec 13, 2013 at 2:19 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> Have you taken a look at the logs on the RegionServers during the period?
>
> One possibility is compactions happening organically.  If you were
> sustaining a certain level of writes most of the time, I could maybe see
> that every 3 hours enough store files build up to require compactions.
>
> There's nothing else automated in HDFS or HBase that I could see causing
> this.
>
> On Fri, Dec 13, 2013 at 3:07 PM, Patrick Schless
> <pa...@gmail.com>wrote:
>
> > CDH4.1.2
> > HBase 0.92.1
> > HDFS 2.0.0
> >
> >
> > Every 3 hours, our production HBase cluster does something that causes
> all
> > the data nodes to have a sustained spike in CPU/network/disk. The spike
> > lasts about 30 mins, and during this time the cluster has greatly
> increased
> > latencies for our typical application usage.
> >
> > I can't find anything in our application that would have such a periodic
> > and significant behavior. Is there anything that HBase/HDFS might be
> doing
> > on it's own that would cause this? We're on the default schedule for
> major
> > compactions, but I thought that was daily.
> >
> > Any ideas what could be causing this?
> >
> > Thanks,
> >
> > Patrick
> >
>

Re: 3-Hour Periodic Network/CPU/Disk/Latency Spikes

Posted by Bryan Beaudreault <bb...@hubspot.com>.

Have you taken a look at the logs on the RegionServers during the period?

One possibility is compactions happening organically.  If you were
sustaining a certain level of writes most of the time, I could maybe see
that every 3 hours enough store files build up to require compactions.

There's nothing else automated in HDFS or HBase that I could see causing
this.

On Fri, Dec 13, 2013 at 3:07 PM, Patrick Schless
<pa...@gmail.com>wrote:

> CDH4.1.2
> HBase 0.92.1
> HDFS 2.0.0
>
>
> Every 3 hours, our production HBase cluster does something that causes all
> the data nodes to have a sustained spike in CPU/network/disk. The spike
> lasts about 30 mins, and during this time the cluster has greatly increased
> latencies for our typical application usage.
>
> I can't find anything in our application that would have such a periodic
> and significant behavior. Is there anything that HBase/HDFS might be doing
> on it's own that would cause this? We're on the default schedule for major
> compactions, but I thought that was daily.
>
> Any ideas what could be causing this?
>
> Thanks,
>
> Patrick
>