You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Nicolas Thiébaud <ni...@captaindash.com> on 2012/07/02 17:50:54 UTC

Region servers fall after Zookeeper connectivity loss on EC2

Hi,

We have been successfully running a cdh3 HBase cluster on c1.xlarge
instances for over a month, but we recently started hitting what looks like
connectivity issues in the clusters. Zookeeper sessions are expired by the
zk server and the region servers throw a YouAreDeadException before
crashing.

Could this be imputed to the gc ? Is there anything I can do about it ? I
am monitoring the Ganglia metrics but am unsure of their semantics (where
can I find it?).

I know that running hbase on ec2 is advised against, but we really need to
get this working.

Thanks,

Nicolas.

ZooKeeper log: http://pastebin.com/bVjrkRSL
RegionServer log: http://pastebin.com/fU81d8hr

Re: Region servers fall after Zookeeper connectivity loss on EC2

Posted by Kevin O'dell <ke...@cloudera.com>.
Wasn't there an EC2 outage or am I imagining things?

On Mon, Jul 2, 2012 at 2:30 PM, Norbert Burger <no...@gmail.com> wrote:
> From what I understand, the leap second bug could've hit anytime in the 24
> hours before 23:59:59.  We had it start happening early afternoon Sat on a
> few of our boxes.
>
> Norbert
>
> On Mon, Jul 2, 2012 at 12:58 PM, Kevin O'dell <ke...@cloudera.com>wrote:
>
>> How recently would you say this is happening?  Did this start last Sat
>> around midnight?
>>
>> On Mon, Jul 2, 2012 at 11:50 AM, Nicolas Thiébaud
>> <ni...@captaindash.com> wrote:
>> > Hi,
>> >
>> > We have been successfully running a cdh3 HBase cluster on c1.xlarge
>> > instances for over a month, but we recently started hitting what looks
>> like
>> > connectivity issues in the clusters. Zookeeper sessions are expired by
>> the
>> > zk server and the region servers throw a YouAreDeadException before
>> > crashing.
>> >
>> > Could this be imputed to the gc ? Is there anything I can do about it ? I
>> > am monitoring the Ganglia metrics but am unsure of their semantics (where
>> > can I find it?).
>> >
>> > I know that running hbase on ec2 is advised against, but we really need
>> to
>> > get this working.
>> >
>> > Thanks,
>> >
>> > Nicolas.
>> >
>> > ZooKeeper log: http://pastebin.com/bVjrkRSL
>> > RegionServer log: http://pastebin.com/fU81d8hr
>>
>>
>>
>> --
>> Kevin O'Dell
>> Customer Operations Engineer, Cloudera
>>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: Region servers fall after Zookeeper connectivity loss on EC2

Posted by Norbert Burger <no...@gmail.com>.
>From what I understand, the leap second bug could've hit anytime in the 24
hours before 23:59:59.  We had it start happening early afternoon Sat on a
few of our boxes.

Norbert

On Mon, Jul 2, 2012 at 12:58 PM, Kevin O'dell <ke...@cloudera.com>wrote:

> How recently would you say this is happening?  Did this start last Sat
> around midnight?
>
> On Mon, Jul 2, 2012 at 11:50 AM, Nicolas Thiébaud
> <ni...@captaindash.com> wrote:
> > Hi,
> >
> > We have been successfully running a cdh3 HBase cluster on c1.xlarge
> > instances for over a month, but we recently started hitting what looks
> like
> > connectivity issues in the clusters. Zookeeper sessions are expired by
> the
> > zk server and the region servers throw a YouAreDeadException before
> > crashing.
> >
> > Could this be imputed to the gc ? Is there anything I can do about it ? I
> > am monitoring the Ganglia metrics but am unsure of their semantics (where
> > can I find it?).
> >
> > I know that running hbase on ec2 is advised against, but we really need
> to
> > get this working.
> >
> > Thanks,
> >
> > Nicolas.
> >
> > ZooKeeper log: http://pastebin.com/bVjrkRSL
> > RegionServer log: http://pastebin.com/fU81d8hr
>
>
>
> --
> Kevin O'Dell
> Customer Operations Engineer, Cloudera
>

Re: Region servers fall after Zookeeper connectivity loss on EC2

Posted by Kevin O'dell <ke...@cloudera.com>.
How recently would you say this is happening?  Did this start last Sat
around midnight?

On Mon, Jul 2, 2012 at 11:50 AM, Nicolas Thiébaud
<ni...@captaindash.com> wrote:
> Hi,
>
> We have been successfully running a cdh3 HBase cluster on c1.xlarge
> instances for over a month, but we recently started hitting what looks like
> connectivity issues in the clusters. Zookeeper sessions are expired by the
> zk server and the region servers throw a YouAreDeadException before
> crashing.
>
> Could this be imputed to the gc ? Is there anything I can do about it ? I
> am monitoring the Ganglia metrics but am unsure of their semantics (where
> can I find it?).
>
> I know that running hbase on ec2 is advised against, but we really need to
> get this working.
>
> Thanks,
>
> Nicolas.
>
> ZooKeeper log: http://pastebin.com/bVjrkRSL
> RegionServer log: http://pastebin.com/fU81d8hr



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: Region servers fall after Zookeeper connectivity loss on EC2

Posted by Nicolas Thiébaud <ni...@captaindash.com>.
The issue does appear to be due to the leap second, we are investigating.

Thanks !

On Mon, Jul 2, 2012 at 5:50 PM, Nicolas Thiébaud <ni...@captaindash.com>wrote:

> Hi,
>
> We have been successfully running a cdh3 HBase cluster on c1.xlarge
> instances for over a month, but we recently started hitting what looks like
> connectivity issues in the clusters. Zookeeper sessions are expired by the
> zk server and the region servers throw a YouAreDeadException before
> crashing.
>
> Could this be imputed to the gc ? Is there anything I can do about it ? I
> am monitoring the Ganglia metrics but am unsure of their semantics (where
> can I find it?).
>
> I know that running hbase on ec2 is advised against, but we really need to
> get this working.
>
> Thanks,
>
> Nicolas.
>
> ZooKeeper log: http://pastebin.com/bVjrkRSL
> RegionServer log: http://pastebin.com/fU81d8hr
>