You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Rohit Kelkar <ro...@gmail.com> on 2014/02/26 19:55:37 UTC

region server dead and datanode block movement error

We are running hbase 0.94.2 on hadoop 0.20 append version in production
(yes we have plans to upgrade hadoop). Its a 5 node cluster and a 6th node
running just the name node and hmaster.
I am seeing frequent RS YouAreDeadExceptions. Logs here
http://pastebin.com/44aFyYZV
The RS log shows a DFSOutputStream ResponseProcessor exception  for block
blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00 followed
by YouAreDeadException at the same time.
I grep'ed this block in the Datanode (see log here
http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in
receiveBlock for block blk_-6695300470410774365_837638
java.nio.channels.ClosedByInterruptException.
I have also attached the namenode logs around the block here
http://pastebin.com/9NE9J8s1

Across several RS failure instances I see the following pattern - the
region server YouAreDeadException is always preceeded by the EOFException
and datanode ClosedByInterruptException

Is the error in the movement of the block causing the region server to
report a YouAreDeadException? And of course, how do I solve this?

- R

Re: region server dead and datanode block movement error

Posted by Rohit Kelkar <ro...@gmail.com>.

Yes. For the same conditions (dataset size, etc) the issue occurred 4 out
of 5 times. Brought the region server down with a YouAreDeadException.
Thats why I started digging into the DN and NN logs etc. And could see a
common pattern as mentioned in my first mail.

- R


On Thu, Feb 27, 2014 at 11:09 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> so you might want to get some metrics over time, like using Ganglia or
> anything else. To track memory usage and network availability.
>
> are you often facing this issue? Is it "easy" for you to reproduce it?
>
>
> 2014-02-27 12:05 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:
>
> > Oh yes and forgot to add the ZK process
> > ZK = 5GB
> >
> > Total = 45GB
> >
> >
> > On Thu, Feb 27, 2014 at 11:01 AM, Rohit Kelkar <rohitkelkar@gmail.com
> > >wrote:
> >
> > > Hi Jean-Marc,
> > >
> > > Each node has 48GB RAM
> > > To isolate and debug the RS failure issue, we have switched off all
> other
> > > tools. The only processes running are
> > > - DN = 4GB
> > > - RS = 6GB
> > > - TT = 4GB
> > > - num mappers available on the node = 4 * 4GB = 16GB
> > > - num reducers available on the node = 2 * 4GB = 8GB
> > > - 4 other java processes unrelated to hadoop/hbase = 512MB * 4 = 2GB
> > >
> > > Total = 40GB
> > >
> > >
> > > On Thu, Feb 27, 2014 at 10:42 AM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > >> 2014-02-21 13:36:27,496 WARN org.apache.hadoop.ipc.HBaseServer:
> > >> (responseTooSlow):
> > >> {"processingtimems":41236,"call":"next(-8680499896692404689, 1), rpc
> > >> version=1, client version=29, methodsFingerPrint=54742778","client":"
> > >> 10.0.0.96:46618
> > >>
> > >>
> >
> ","starttimems":1393007746259,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"}
> > >> 2014-02-21 13:41:00,272 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > >> slept
> > >> 10193644ms instead of 10000000ms, this is likely due to a long garbage
> > >> collecting pause and it's usually bad, see
> > >> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >>
> > >> Your issue is clearly this.
> > >>
> > >> For the swap, it's not because you set swappiness that Linux will not
> > >> swap.
> > >> It will try to not swap, but if it really has to, it will.
> > >>
> > >> How many GB on your server? How many for the DN,for th RS, etc. any TT
> > on
> > >> them? Any other tool? If TT, how many slots? How many GB per slots?
> > >>
> > >> JM
> > >>
> > >>
> > >> 2014-02-27 11:37 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:
> > >>
> > >> > Hi Jean-Marc,
> > >> >
> > >> > I have updated the RS log here (http://pastebin.com/bVDvMvrB) with
> > >> events
> > >> > before 13:41:00. In the log I see a few responseTooSlow warnings at
> > >> > 13:34:00, 13:36:00. Then no activity till 13:41:00.
> > >> > At 13:41:00 there is a Sleeper warning - WARN
> > >> > org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of
> > >> > 10000000ms, this is likely due to a long garbage collecting pause
> and
> > >> it's
> > >> > usually bad, see ...
> > >> > Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session
> > timed
> > >> > out, have not heard from server in 260409ms for sessionid
> > >> > 0x34432befe5417d2, closing socket connection and attempting
> reconnect.
> > >> >
> > >> > Looking at some of the reasons you mentioned -
> > >> > 1. I analyzed the GC logs for this RS. In the last 10 mins before
> the
> > RS
> > >> > went down, the GC times are less than 1 sec. Nothing that will take
> > >> 260409
> > >> > ms as indicated above in the RS log.
> > >> > 2. The RS node has swappiness set to 0
> > >> > 3. So I think I should investigate the possibility of network
> issues.
> > >> Any
> > >> > pointers where I could start?
> > >> >
> > >> > - R
> > >> >
> > >> > On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari <
> > >> > jean-marc@spaggiari.org> wrote:
> > >> >
> > >> > > Hi Rohit,
> > >> > >
> > >> > > Usually YouAreDeadException is when your RegionServer is to slow.
> It
> > >> gets
> > >> > > kicked out by Master+ZK but then try to join back and get informed
> > it
> > >> has
> > >> > > bene kicked out.
> > >> > >
> > >> > > Reasons:
> > >> > > - Long Gargabe Collection;
> > >> > > - Swapping;
> > >> > > - Network issues (get disconnected, then re-connected);
> > >> > > - etc.
> > >> > >
> > >> > > what do you have before 2014-02-21 13:41:00,308 in the logs?
> > >> > >
> > >> > >
> > >> > > 2014-02-27 11:13 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:
> > >> > >
> > >> > > > Hi, has anybody been facing similar issues?
> > >> > > >
> > >> > > > - R
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar <
> > >> rohitkelkar@gmail.com
> > >> > > > >wrote:
> > >> > > >
> > >> > > > > We are running hbase 0.94.2 on hadoop 0.20 append version in
> > >> > production
> > >> > > > > (yes we have plans to upgrade hadoop). Its a 5 node cluster
> and
> > a
> > >> 6th
> > >> > > > node
> > >> > > > > running just the name node and hmaster.
> > >> > > > > I am seeing frequent RS YouAreDeadExceptions. Logs here
> > >> > > > > http://pastebin.com/44aFyYZV
> > >> > > > > The RS log shows a DFSOutputStream ResponseProcessor exception
> > >>  for
> > >> > > block
> > >> > > > > blk_-6695300470410774365_837638 java.io.EOFException at
> 13:41:00
> > >> > > followed
> > >> > > > > by YouAreDeadException at the same time.
> > >> > > > > I grep'ed this block in the Datanode (see log here
> > >> > > > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception
> > in
> > >> > > > > receiveBlock for block blk_-6695300470410774365_837638
> > >> > > > > java.nio.channels.ClosedByInterruptException.
> > >> > > > > I have also attached the namenode logs around the block here
> > >> > > > > http://pastebin.com/9NE9J8s1
> > >> > > > >
> > >> > > > > Across several RS failure instances I see the following
> pattern
> > -
> > >> the
> > >> > > > > region server YouAreDeadException is always preceeded by the
> > >> > > EOFException
> > >> > > > > and datanode ClosedByInterruptException
> > >> > > > >
> > >> > > > > Is the error in the movement of the block causing the region
> > >> server
> > >> > to
> > >> > > > > report a YouAreDeadException? And of course, how do I solve
> > this?
> > >> > > > >
> > >> > > > > - R
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: region server dead and datanode block movement error

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

so you might want to get some metrics over time, like using Ganglia or
anything else. To track memory usage and network availability.

are you often facing this issue? Is it "easy" for you to reproduce it?


2014-02-27 12:05 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:

> Oh yes and forgot to add the ZK process
> ZK = 5GB
>
> Total = 45GB
>
>
> On Thu, Feb 27, 2014 at 11:01 AM, Rohit Kelkar <rohitkelkar@gmail.com
> >wrote:
>
> > Hi Jean-Marc,
> >
> > Each node has 48GB RAM
> > To isolate and debug the RS failure issue, we have switched off all other
> > tools. The only processes running are
> > - DN = 4GB
> > - RS = 6GB
> > - TT = 4GB
> > - num mappers available on the node = 4 * 4GB = 16GB
> > - num reducers available on the node = 2 * 4GB = 8GB
> > - 4 other java processes unrelated to hadoop/hbase = 512MB * 4 = 2GB
> >
> > Total = 40GB
> >
> >
> > On Thu, Feb 27, 2014 at 10:42 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> >> 2014-02-21 13:36:27,496 WARN org.apache.hadoop.ipc.HBaseServer:
> >> (responseTooSlow):
> >> {"processingtimems":41236,"call":"next(-8680499896692404689, 1), rpc
> >> version=1, client version=29, methodsFingerPrint=54742778","client":"
> >> 10.0.0.96:46618
> >>
> >>
> ","starttimems":1393007746259,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"}
> >> 2014-02-21 13:41:00,272 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >> slept
> >> 10193644ms instead of 10000000ms, this is likely due to a long garbage
> >> collecting pause and it's usually bad, see
> >> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>
> >> Your issue is clearly this.
> >>
> >> For the swap, it's not because you set swappiness that Linux will not
> >> swap.
> >> It will try to not swap, but if it really has to, it will.
> >>
> >> How many GB on your server? How many for the DN,for th RS, etc. any TT
> on
> >> them? Any other tool? If TT, how many slots? How many GB per slots?
> >>
> >> JM
> >>
> >>
> >> 2014-02-27 11:37 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:
> >>
> >> > Hi Jean-Marc,
> >> >
> >> > I have updated the RS log here (http://pastebin.com/bVDvMvrB) with
> >> events
> >> > before 13:41:00. In the log I see a few responseTooSlow warnings at
> >> > 13:34:00, 13:36:00. Then no activity till 13:41:00.
> >> > At 13:41:00 there is a Sleeper warning - WARN
> >> > org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of
> >> > 10000000ms, this is likely due to a long garbage collecting pause and
> >> it's
> >> > usually bad, see ...
> >> > Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session
> timed
> >> > out, have not heard from server in 260409ms for sessionid
> >> > 0x34432befe5417d2, closing socket connection and attempting reconnect.
> >> >
> >> > Looking at some of the reasons you mentioned -
> >> > 1. I analyzed the GC logs for this RS. In the last 10 mins before the
> RS
> >> > went down, the GC times are less than 1 sec. Nothing that will take
> >> 260409
> >> > ms as indicated above in the RS log.
> >> > 2. The RS node has swappiness set to 0
> >> > 3. So I think I should investigate the possibility of network issues.
> >> Any
> >> > pointers where I could start?
> >> >
> >> > - R
> >> >
> >> > On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari <
> >> > jean-marc@spaggiari.org> wrote:
> >> >
> >> > > Hi Rohit,
> >> > >
> >> > > Usually YouAreDeadException is when your RegionServer is to slow. It
> >> gets
> >> > > kicked out by Master+ZK but then try to join back and get informed
> it
> >> has
> >> > > bene kicked out.
> >> > >
> >> > > Reasons:
> >> > > - Long Gargabe Collection;
> >> > > - Swapping;
> >> > > - Network issues (get disconnected, then re-connected);
> >> > > - etc.
> >> > >
> >> > > what do you have before 2014-02-21 13:41:00,308 in the logs?
> >> > >
> >> > >
> >> > > 2014-02-27 11:13 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:
> >> > >
> >> > > > Hi, has anybody been facing similar issues?
> >> > > >
> >> > > > - R
> >> > > >
> >> > > >
> >> > > > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar <
> >> rohitkelkar@gmail.com
> >> > > > >wrote:
> >> > > >
> >> > > > > We are running hbase 0.94.2 on hadoop 0.20 append version in
> >> > production
> >> > > > > (yes we have plans to upgrade hadoop). Its a 5 node cluster and
> a
> >> 6th
> >> > > > node
> >> > > > > running just the name node and hmaster.
> >> > > > > I am seeing frequent RS YouAreDeadExceptions. Logs here
> >> > > > > http://pastebin.com/44aFyYZV
> >> > > > > The RS log shows a DFSOutputStream ResponseProcessor exception
> >>  for
> >> > > block
> >> > > > > blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00
> >> > > followed
> >> > > > > by YouAreDeadException at the same time.
> >> > > > > I grep'ed this block in the Datanode (see log here
> >> > > > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception
> in
> >> > > > > receiveBlock for block blk_-6695300470410774365_837638
> >> > > > > java.nio.channels.ClosedByInterruptException.
> >> > > > > I have also attached the namenode logs around the block here
> >> > > > > http://pastebin.com/9NE9J8s1
> >> > > > >
> >> > > > > Across several RS failure instances I see the following pattern
> -
> >> the
> >> > > > > region server YouAreDeadException is always preceeded by the
> >> > > EOFException
> >> > > > > and datanode ClosedByInterruptException
> >> > > > >
> >> > > > > Is the error in the movement of the block causing the region
> >> server
> >> > to
> >> > > > > report a YouAreDeadException? And of course, how do I solve
> this?
> >> > > > >
> >> > > > > - R
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: region server dead and datanode block movement error

Posted by Rohit Kelkar <ro...@gmail.com>.

Oh yes and forgot to add the ZK process
ZK = 5GB

Total = 45GB


On Thu, Feb 27, 2014 at 11:01 AM, Rohit Kelkar <ro...@gmail.com>wrote:

> Hi Jean-Marc,
>
> Each node has 48GB RAM
> To isolate and debug the RS failure issue, we have switched off all other
> tools. The only processes running are
> - DN = 4GB
> - RS = 6GB
> - TT = 4GB
> - num mappers available on the node = 4 * 4GB = 16GB
> - num reducers available on the node = 2 * 4GB = 8GB
> - 4 other java processes unrelated to hadoop/hbase = 512MB * 4 = 2GB
>
> Total = 40GB
>
>
> On Thu, Feb 27, 2014 at 10:42 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
>> 2014-02-21 13:36:27,496 WARN org.apache.hadoop.ipc.HBaseServer:
>> (responseTooSlow):
>> {"processingtimems":41236,"call":"next(-8680499896692404689, 1), rpc
>> version=1, client version=29, methodsFingerPrint=54742778","client":"
>> 10.0.0.96:46618
>>
>> ","starttimems":1393007746259,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"}
>> 2014-02-21 13:41:00,272 WARN org.apache.hadoop.hbase.util.Sleeper: We
>> slept
>> 10193644ms instead of 10000000ms, this is likely due to a long garbage
>> collecting pause and it's usually bad, see
>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>
>> Your issue is clearly this.
>>
>> For the swap, it's not because you set swappiness that Linux will not
>> swap.
>> It will try to not swap, but if it really has to, it will.
>>
>> How many GB on your server? How many for the DN,for th RS, etc. any TT on
>> them? Any other tool? If TT, how many slots? How many GB per slots?
>>
>> JM
>>
>>
>> 2014-02-27 11:37 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:
>>
>> > Hi Jean-Marc,
>> >
>> > I have updated the RS log here (http://pastebin.com/bVDvMvrB) with
>> events
>> > before 13:41:00. In the log I see a few responseTooSlow warnings at
>> > 13:34:00, 13:36:00. Then no activity till 13:41:00.
>> > At 13:41:00 there is a Sleeper warning - WARN
>> > org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of
>> > 10000000ms, this is likely due to a long garbage collecting pause and
>> it's
>> > usually bad, see ...
>> > Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session timed
>> > out, have not heard from server in 260409ms for sessionid
>> > 0x34432befe5417d2, closing socket connection and attempting reconnect.
>> >
>> > Looking at some of the reasons you mentioned -
>> > 1. I analyzed the GC logs for this RS. In the last 10 mins before the RS
>> > went down, the GC times are less than 1 sec. Nothing that will take
>> 260409
>> > ms as indicated above in the RS log.
>> > 2. The RS node has swappiness set to 0
>> > 3. So I think I should investigate the possibility of network issues.
>> Any
>> > pointers where I could start?
>> >
>> > - R
>> >
>> > On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari <
>> > jean-marc@spaggiari.org> wrote:
>> >
>> > > Hi Rohit,
>> > >
>> > > Usually YouAreDeadException is when your RegionServer is to slow. It
>> gets
>> > > kicked out by Master+ZK but then try to join back and get informed it
>> has
>> > > bene kicked out.
>> > >
>> > > Reasons:
>> > > - Long Gargabe Collection;
>> > > - Swapping;
>> > > - Network issues (get disconnected, then re-connected);
>> > > - etc.
>> > >
>> > > what do you have before 2014-02-21 13:41:00,308 in the logs?
>> > >
>> > >
>> > > 2014-02-27 11:13 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:
>> > >
>> > > > Hi, has anybody been facing similar issues?
>> > > >
>> > > > - R
>> > > >
>> > > >
>> > > > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar <
>> rohitkelkar@gmail.com
>> > > > >wrote:
>> > > >
>> > > > > We are running hbase 0.94.2 on hadoop 0.20 append version in
>> > production
>> > > > > (yes we have plans to upgrade hadoop). Its a 5 node cluster and a
>> 6th
>> > > > node
>> > > > > running just the name node and hmaster.
>> > > > > I am seeing frequent RS YouAreDeadExceptions. Logs here
>> > > > > http://pastebin.com/44aFyYZV
>> > > > > The RS log shows a DFSOutputStream ResponseProcessor exception
>>  for
>> > > block
>> > > > > blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00
>> > > followed
>> > > > > by YouAreDeadException at the same time.
>> > > > > I grep'ed this block in the Datanode (see log here
>> > > > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in
>> > > > > receiveBlock for block blk_-6695300470410774365_837638
>> > > > > java.nio.channels.ClosedByInterruptException.
>> > > > > I have also attached the namenode logs around the block here
>> > > > > http://pastebin.com/9NE9J8s1
>> > > > >
>> > > > > Across several RS failure instances I see the following pattern -
>> the
>> > > > > region server YouAreDeadException is always preceeded by the
>> > > EOFException
>> > > > > and datanode ClosedByInterruptException
>> > > > >
>> > > > > Is the error in the movement of the block causing the region
>> server
>> > to
>> > > > > report a YouAreDeadException? And of course, how do I solve this?
>> > > > >
>> > > > > - R
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: region server dead and datanode block movement error

Posted by Rohit Kelkar <ro...@gmail.com>.

Hi Jean-Marc,

Each node has 48GB RAM
To isolate and debug the RS failure issue, we have switched off all other
tools. The only processes running are
- DN = 4GB
- RS = 6GB
- TT = 4GB
- num mappers available on the node = 4 * 4GB = 16GB
- num reducers available on the node = 2 * 4GB = 8GB
- 4 other java processes unrelated to hadoop/hbase = 512MB * 4 = 2GB

Total = 40GB


On Thu, Feb 27, 2014 at 10:42 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> 2014-02-21 13:36:27,496 WARN org.apache.hadoop.ipc.HBaseServer:
> (responseTooSlow):
> {"processingtimems":41236,"call":"next(-8680499896692404689, 1), rpc
> version=1, client version=29, methodsFingerPrint=54742778","client":"
> 10.0.0.96:46618
>
> ","starttimems":1393007746259,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"}
> 2014-02-21 13:41:00,272 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 10193644ms instead of 10000000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>
> Your issue is clearly this.
>
> For the swap, it's not because you set swappiness that Linux will not swap.
> It will try to not swap, but if it really has to, it will.
>
> How many GB on your server? How many for the DN,for th RS, etc. any TT on
> them? Any other tool? If TT, how many slots? How many GB per slots?
>
> JM
>
>
> 2014-02-27 11:37 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:
>
> > Hi Jean-Marc,
> >
> > I have updated the RS log here (http://pastebin.com/bVDvMvrB) with
> events
> > before 13:41:00. In the log I see a few responseTooSlow warnings at
> > 13:34:00, 13:36:00. Then no activity till 13:41:00.
> > At 13:41:00 there is a Sleeper warning - WARN
> > org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of
> > 10000000ms, this is likely due to a long garbage collecting pause and
> it's
> > usually bad, see ...
> > Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session timed
> > out, have not heard from server in 260409ms for sessionid
> > 0x34432befe5417d2, closing socket connection and attempting reconnect.
> >
> > Looking at some of the reasons you mentioned -
> > 1. I analyzed the GC logs for this RS. In the last 10 mins before the RS
> > went down, the GC times are less than 1 sec. Nothing that will take
> 260409
> > ms as indicated above in the RS log.
> > 2. The RS node has swappiness set to 0
> > 3. So I think I should investigate the possibility of network issues. Any
> > pointers where I could start?
> >
> > - R
> >
> > On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi Rohit,
> > >
> > > Usually YouAreDeadException is when your RegionServer is to slow. It
> gets
> > > kicked out by Master+ZK but then try to join back and get informed it
> has
> > > bene kicked out.
> > >
> > > Reasons:
> > > - Long Gargabe Collection;
> > > - Swapping;
> > > - Network issues (get disconnected, then re-connected);
> > > - etc.
> > >
> > > what do you have before 2014-02-21 13:41:00,308 in the logs?
> > >
> > >
> > > 2014-02-27 11:13 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:
> > >
> > > > Hi, has anybody been facing similar issues?
> > > >
> > > > - R
> > > >
> > > >
> > > > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar <
> rohitkelkar@gmail.com
> > > > >wrote:
> > > >
> > > > > We are running hbase 0.94.2 on hadoop 0.20 append version in
> > production
> > > > > (yes we have plans to upgrade hadoop). Its a 5 node cluster and a
> 6th
> > > > node
> > > > > running just the name node and hmaster.
> > > > > I am seeing frequent RS YouAreDeadExceptions. Logs here
> > > > > http://pastebin.com/44aFyYZV
> > > > > The RS log shows a DFSOutputStream ResponseProcessor exception  for
> > > block
> > > > > blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00
> > > followed
> > > > > by YouAreDeadException at the same time.
> > > > > I grep'ed this block in the Datanode (see log here
> > > > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in
> > > > > receiveBlock for block blk_-6695300470410774365_837638
> > > > > java.nio.channels.ClosedByInterruptException.
> > > > > I have also attached the namenode logs around the block here
> > > > > http://pastebin.com/9NE9J8s1
> > > > >
> > > > > Across several RS failure instances I see the following pattern -
> the
> > > > > region server YouAreDeadException is always preceeded by the
> > > EOFException
> > > > > and datanode ClosedByInterruptException
> > > > >
> > > > > Is the error in the movement of the block causing the region server
> > to
> > > > > report a YouAreDeadException? And of course, how do I solve this?
> > > > >
> > > > > - R
> > > > >
> > > >
> > >
> >
>

Re: region server dead and datanode block movement error

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

2014-02-21 13:36:27,496 WARN org.apache.hadoop.ipc.HBaseServer:
(responseTooSlow):
{"processingtimems":41236,"call":"next(-8680499896692404689, 1), rpc
version=1, client version=29, methodsFingerPrint=54742778","client":"
10.0.0.96:46618
","starttimems":1393007746259,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"}
2014-02-21 13:41:00,272 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
10193644ms instead of 10000000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

Your issue is clearly this.

For the swap, it's not because you set swappiness that Linux will not swap.
It will try to not swap, but if it really has to, it will.

How many GB on your server? How many for the DN,for th RS, etc. any TT on
them? Any other tool? If TT, how many slots? How many GB per slots?

JM


2014-02-27 11:37 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:

> Hi Jean-Marc,
>
> I have updated the RS log here (http://pastebin.com/bVDvMvrB) with events
> before 13:41:00. In the log I see a few responseTooSlow warnings at
> 13:34:00, 13:36:00. Then no activity till 13:41:00.
> At 13:41:00 there is a Sleeper warning - WARN
> org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of
> 10000000ms, this is likely due to a long garbage collecting pause and it's
> usually bad, see ...
> Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session timed
> out, have not heard from server in 260409ms for sessionid
> 0x34432befe5417d2, closing socket connection and attempting reconnect.
>
> Looking at some of the reasons you mentioned -
> 1. I analyzed the GC logs for this RS. In the last 10 mins before the RS
> went down, the GC times are less than 1 sec. Nothing that will take 260409
> ms as indicated above in the RS log.
> 2. The RS node has swappiness set to 0
> 3. So I think I should investigate the possibility of network issues. Any
> pointers where I could start?
>
> - R
>
> On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Rohit,
> >
> > Usually YouAreDeadException is when your RegionServer is to slow. It gets
> > kicked out by Master+ZK but then try to join back and get informed it has
> > bene kicked out.
> >
> > Reasons:
> > - Long Gargabe Collection;
> > - Swapping;
> > - Network issues (get disconnected, then re-connected);
> > - etc.
> >
> > what do you have before 2014-02-21 13:41:00,308 in the logs?
> >
> >
> > 2014-02-27 11:13 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:
> >
> > > Hi, has anybody been facing similar issues?
> > >
> > > - R
> > >
> > >
> > > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar <rohitkelkar@gmail.com
> > > >wrote:
> > >
> > > > We are running hbase 0.94.2 on hadoop 0.20 append version in
> production
> > > > (yes we have plans to upgrade hadoop). Its a 5 node cluster and a 6th
> > > node
> > > > running just the name node and hmaster.
> > > > I am seeing frequent RS YouAreDeadExceptions. Logs here
> > > > http://pastebin.com/44aFyYZV
> > > > The RS log shows a DFSOutputStream ResponseProcessor exception  for
> > block
> > > > blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00
> > followed
> > > > by YouAreDeadException at the same time.
> > > > I grep'ed this block in the Datanode (see log here
> > > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in
> > > > receiveBlock for block blk_-6695300470410774365_837638
> > > > java.nio.channels.ClosedByInterruptException.
> > > > I have also attached the namenode logs around the block here
> > > > http://pastebin.com/9NE9J8s1
> > > >
> > > > Across several RS failure instances I see the following pattern - the
> > > > region server YouAreDeadException is always preceeded by the
> > EOFException
> > > > and datanode ClosedByInterruptException
> > > >
> > > > Is the error in the movement of the block causing the region server
> to
> > > > report a YouAreDeadException? And of course, how do I solve this?
> > > >
> > > > - R
> > > >
> > >
> >
>

Re: region server dead and datanode block movement error

Posted by Rohit Kelkar <ro...@gmail.com>.

Hi Jean-Marc,

I have updated the RS log here (http://pastebin.com/bVDvMvrB) with events
before 13:41:00. In the log I see a few responseTooSlow warnings at
13:34:00, 13:36:00. Then no activity till 13:41:00.
At 13:41:00 there is a Sleeper warning - WARN
org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of
10000000ms, this is likely due to a long garbage collecting pause and it's
usually bad, see ...
Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session timed
out, have not heard from server in 260409ms for sessionid
0x34432befe5417d2, closing socket connection and attempting reconnect.

Looking at some of the reasons you mentioned -
1. I analyzed the GC logs for this RS. In the last 10 mins before the RS
went down, the GC times are less than 1 sec. Nothing that will take 260409
ms as indicated above in the RS log.
2. The RS node has swappiness set to 0
3. So I think I should investigate the possibility of network issues. Any
pointers where I could start?

- R

On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Rohit,
>
> Usually YouAreDeadException is when your RegionServer is to slow. It gets
> kicked out by Master+ZK but then try to join back and get informed it has
> bene kicked out.
>
> Reasons:
> - Long Gargabe Collection;
> - Swapping;
> - Network issues (get disconnected, then re-connected);
> - etc.
>
> what do you have before 2014-02-21 13:41:00,308 in the logs?
>
>
> 2014-02-27 11:13 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:
>
> > Hi, has anybody been facing similar issues?
> >
> > - R
> >
> >
> > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar <rohitkelkar@gmail.com
> > >wrote:
> >
> > > We are running hbase 0.94.2 on hadoop 0.20 append version in production
> > > (yes we have plans to upgrade hadoop). Its a 5 node cluster and a 6th
> > node
> > > running just the name node and hmaster.
> > > I am seeing frequent RS YouAreDeadExceptions. Logs here
> > > http://pastebin.com/44aFyYZV
> > > The RS log shows a DFSOutputStream ResponseProcessor exception  for
> block
> > > blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00
> followed
> > > by YouAreDeadException at the same time.
> > > I grep'ed this block in the Datanode (see log here
> > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in
> > > receiveBlock for block blk_-6695300470410774365_837638
> > > java.nio.channels.ClosedByInterruptException.
> > > I have also attached the namenode logs around the block here
> > > http://pastebin.com/9NE9J8s1
> > >
> > > Across several RS failure instances I see the following pattern - the
> > > region server YouAreDeadException is always preceeded by the
> EOFException
> > > and datanode ClosedByInterruptException
> > >
> > > Is the error in the movement of the block causing the region server to
> > > report a YouAreDeadException? And of course, how do I solve this?
> > >
> > > - R
> > >
> >
>

Re: region server dead and datanode block movement error

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Rohit,

Usually YouAreDeadException is when your RegionServer is to slow. It gets
kicked out by Master+ZK but then try to join back and get informed it has
bene kicked out.

Reasons:
- Long Gargabe Collection;
- Swapping;
- Network issues (get disconnected, then re-connected);
- etc.

what do you have before 2014-02-21 13:41:00,308 in the logs?


2014-02-27 11:13 GMT-05:00 Rohit Kelkar <ro...@gmail.com>:

> Hi, has anybody been facing similar issues?
>
> - R
>
>
> On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar <rohitkelkar@gmail.com
> >wrote:
>
> > We are running hbase 0.94.2 on hadoop 0.20 append version in production
> > (yes we have plans to upgrade hadoop). Its a 5 node cluster and a 6th
> node
> > running just the name node and hmaster.
> > I am seeing frequent RS YouAreDeadExceptions. Logs here
> > http://pastebin.com/44aFyYZV
> > The RS log shows a DFSOutputStream ResponseProcessor exception  for block
> > blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00 followed
> > by YouAreDeadException at the same time.
> > I grep'ed this block in the Datanode (see log here
> > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in
> > receiveBlock for block blk_-6695300470410774365_837638
> > java.nio.channels.ClosedByInterruptException.
> > I have also attached the namenode logs around the block here
> > http://pastebin.com/9NE9J8s1
> >
> > Across several RS failure instances I see the following pattern - the
> > region server YouAreDeadException is always preceeded by the EOFException
> > and datanode ClosedByInterruptException
> >
> > Is the error in the movement of the block causing the region server to
> > report a YouAreDeadException? And of course, how do I solve this?
> >
> > - R
> >
>

Re: region server dead and datanode block movement error

Posted by Rohit Kelkar <ro...@gmail.com>.

Hi, has anybody been facing similar issues?

- R


On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar <ro...@gmail.com>wrote:

> We are running hbase 0.94.2 on hadoop 0.20 append version in production
> (yes we have plans to upgrade hadoop). Its a 5 node cluster and a 6th node
> running just the name node and hmaster.
> I am seeing frequent RS YouAreDeadExceptions. Logs here
> http://pastebin.com/44aFyYZV
> The RS log shows a DFSOutputStream ResponseProcessor exception  for block
> blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00 followed
> by YouAreDeadException at the same time.
> I grep'ed this block in the Datanode (see log here
> http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in
> receiveBlock for block blk_-6695300470410774365_837638
> java.nio.channels.ClosedByInterruptException.
> I have also attached the namenode logs around the block here
> http://pastebin.com/9NE9J8s1
>
> Across several RS failure instances I see the following pattern - the
> region server YouAreDeadException is always preceeded by the EOFException
> and datanode ClosedByInterruptException
>
> Is the error in the movement of the block causing the region server to
> report a YouAreDeadException? And of course, how do I solve this?
>
> - R
>