You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Pankaj kr <pa...@huawei.com> on 2016/03/23 07:40:50 UTC

Region server getting aborted in every one or two days

Hi,

In our production environment, RS is getting aborted in every one or two days with following exception.

2016-03-16 13:57:07,975 | FATAL | MemStoreFlusher.0 | ABORTING region server xyz-vm8,24502,1458034278600: Replay of WAL required. Forcing server shutdown | org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2055)
org.apache.hadoop.hbase.DroppedSnapshotException: region: TB_WEBLOGIN_201603,060,1457916997964.06e204d3bc262b72820aa195fec23513.
                at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2423)
                at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2128)
                at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2090)
                at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1983)
                at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1909)
                at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:509)
                at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:470)
                at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:74)
                at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
                at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException
              at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
                at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
                at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:635)
                at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
                at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
                at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
                at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
                ... 1 more

I don't see any error info at HDFS side at that point of time.
Have anyone faced this issue?

HBase version is 0.98.6.

Regards,
Pankaj

Re: Region server getting aborted in every one or two days

Posted by Anoop John <an...@gmail.com>.
So seems like the issue also comes out just after a log roll. (?)   So
we no longer have the old WAL file and still that write op try to
write to old file?  From the WAL file path name u can confirm this

-Anoop-

On Wed, Mar 23, 2016 at 6:14 PM, Pankaj kr <pa...@huawei.com> wrote:
> Thanks Anoop for replying..
>
> No explicit close op happened on the WAL file (this log was rolled few sec before). As per HDFS log, there is no close call to this WAL file.
>
>
> Same issue happened again on 19th March,
>
> Here WAL was rolled just before the issue happened,
> 2016-03-19 05:38:07,153 | INFO  | regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | Rolled WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337083824 with entries=6508, filesize=61.03 MB; new WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337087136 | org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:972)
>
> And after some sec during sync op,
> 2016-03-19 05:38:10,075 | ERROR | sync.1 | Error syncing, request close of wal  | org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> java.nio.channels.ClosedChannelException
>         at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
>         at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
>         at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:545)
>         at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
>         at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>         at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
>         at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-03-19 05:38:10,076 | INFO  | regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | Rolled WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337087136 with entries=6383, filesize=61.51 MB; new WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337090049 | org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:972)
> 2016-03-19 05:38:10,087 | FATAL | regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | ABORTING region server RS-HOSTNAME,21302,1458301420876: IOE in log roller | org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2055)
> java.nio.channels.ClosedChannelException
>         at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
>         at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
>         at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:545)
>         at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
>         at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>         at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
>         at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
>         at java.lang.Thread.run(Thread.java:745)
> 2016-03-19 05:38:10,088 | FATAL | regionserver/RS-HOSTNAME/RS-IP`:21302.logRoller | RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver, org.apache.hadoop.hbase.JMXListener, org.apache.hadoop.hbase.index.coprocessor.wal.IndexWALObserver] | org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2063)
>
> Here also, no error details in DN/NN log.
>
> I am still checking this, will update if any findings.
>
> Regards,
> Pankaj
>
> -----Original Message-----
> From: Anoop John [mailto:anoop.hbase@gmail.com]
> Sent: Wednesday, March 23, 2016 3:50 PM
> To: user@hbase.apache.org
> Subject: Re: Region server getting aborted in every one or two days
>
> At the same time, any explicit close op happened on the WAL file?  Any log rolling?  Can u check the logs to know this?  May be check HDFS logs to know abt the close calls to WAL file?
>
> -Anoop-
>
> On Wed, Mar 23, 2016 at 12:10 PM, Pankaj kr <pa...@huawei.com> wrote:
>> Hi,
>>
>> In our production environment, RS is getting aborted in every one or two days with following exception.
>>
>> 2016-03-16 13:57:07,975 | FATAL | MemStoreFlusher.0 | ABORTING region
>> server xyz-vm8,24502,1458034278600: Replay of WAL required. Forcing
>> server shutdown |
>> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer
>> .java:2055)
>> org.apache.hadoop.hbase.DroppedSnapshotException: region: TB_WEBLOGIN_201603,060,1457916997964.06e204d3bc262b72820aa195fec23513.
>>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2423)
>>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2128)
>>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2090)
>>                 at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1983)
>>                 at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1909)
>>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:509)
>>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:470)
>>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:74)
>>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
>>                 at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.nio.channels.ClosedChannelException
>>               at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
>>                 at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
>>                 at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:635)
>>                 at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
>>                 at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>>                 at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
>>                 at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
>>                 ... 1 more
>>
>> I don't see any error info at HDFS side at that point of time.
>> Have anyone faced this issue?
>>
>> HBase version is 0.98.6.
>>
>> Regards,
>> Pankaj

RE: Region server getting aborted in every one or two days

Posted by Pankaj kr <pa...@huawei.com>.
Thanks Anoop for replying..

No explicit close op happened on the WAL file (this log was rolled few sec before). As per HDFS log, there is no close call to this WAL file.


Same issue happened again on 19th March,

Here WAL was rolled just before the issue happened,
2016-03-19 05:38:07,153 | INFO  | regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | Rolled WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337083824 with entries=6508, filesize=61.03 MB; new WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337087136 | org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:972)

And after some sec during sync op,
2016-03-19 05:38:10,075 | ERROR | sync.1 | Error syncing, request close of wal  | org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
java.nio.channels.ClosedChannelException
	at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
	at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
	at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:545)
	at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
	at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
	at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
	at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
	at java.lang.Thread.run(Thread.java:745)
2016-03-19 05:38:10,076 | INFO  | regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | Rolled WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337087136 with entries=6383, filesize=61.51 MB; new WAL /hbase/WALs/RS-HOSTNAME,21302,1458301420876/RS-HOSTNAME%2C21302%2C1458301420876.default.1458337090049 | org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:972)
2016-03-19 05:38:10,087 | FATAL | regionserver/RS-HOSTNAME/RS-IP:21302.logRoller | ABORTING region server RS-HOSTNAME,21302,1458301420876: IOE in log roller | org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2055)
java.nio.channels.ClosedChannelException
	at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
	at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
	at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:545)
	at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
	at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
	at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
	at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
	at java.lang.Thread.run(Thread.java:745)
2016-03-19 05:38:10,088 | FATAL | regionserver/RS-HOSTNAME/RS-IP`:21302.logRoller | RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver, org.apache.hadoop.hbase.JMXListener, org.apache.hadoop.hbase.index.coprocessor.wal.IndexWALObserver] | org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2063)

Here also, no error details in DN/NN log.

I am still checking this, will update if any findings.

Regards,
Pankaj

-----Original Message-----
From: Anoop John [mailto:anoop.hbase@gmail.com] 
Sent: Wednesday, March 23, 2016 3:50 PM
To: user@hbase.apache.org
Subject: Re: Region server getting aborted in every one or two days

At the same time, any explicit close op happened on the WAL file?  Any log rolling?  Can u check the logs to know this?  May be check HDFS logs to know abt the close calls to WAL file?

-Anoop-

On Wed, Mar 23, 2016 at 12:10 PM, Pankaj kr <pa...@huawei.com> wrote:
> Hi,
>
> In our production environment, RS is getting aborted in every one or two days with following exception.
>
> 2016-03-16 13:57:07,975 | FATAL | MemStoreFlusher.0 | ABORTING region 
> server xyz-vm8,24502,1458034278600: Replay of WAL required. Forcing 
> server shutdown | 
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer
> .java:2055)
> org.apache.hadoop.hbase.DroppedSnapshotException: region: TB_WEBLOGIN_201603,060,1457916997964.06e204d3bc262b72820aa195fec23513.
>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2423)
>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2128)
>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2090)
>                 at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1983)
>                 at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1909)
>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:509)
>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:470)
>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:74)
>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
>                 at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedChannelException
>               at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
>                 at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
>                 at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:635)
>                 at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
>                 at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>                 at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
>                 at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
>                 ... 1 more
>
> I don't see any error info at HDFS side at that point of time.
> Have anyone faced this issue?
>
> HBase version is 0.98.6.
>
> Regards,
> Pankaj

Re: Region server getting aborted in every one or two days

Posted by Heng Chen <he...@gmail.com>.
Is your DN with slow response at that time?

2016-03-23 15:50 GMT+08:00 Anoop John <an...@gmail.com>:

> At the same time, any explicit close op happened on the WAL file?  Any
> log rolling?  Can u check the logs to know this?  May be check HDFS
> logs to know abt the close calls to WAL file?
>
> -Anoop-
>
> On Wed, Mar 23, 2016 at 12:10 PM, Pankaj kr <pa...@huawei.com> wrote:
> > Hi,
> >
> > In our production environment, RS is getting aborted in every one or two
> days with following exception.
> >
> > 2016-03-16 13:57:07,975 | FATAL | MemStoreFlusher.0 | ABORTING region
> server xyz-vm8,24502,1458034278600: Replay of WAL required. Forcing server
> shutdown |
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2055)
> > org.apache.hadoop.hbase.DroppedSnapshotException: region:
> TB_WEBLOGIN_201603,060,1457916997964.06e204d3bc262b72820aa195fec23513.
> >                 at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2423)
> >                 at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2128)
> >                 at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2090)
> >                 at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1983)
> >                 at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1909)
> >                 at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:509)
> >                 at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:470)
> >                 at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:74)
> >                 at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
> >                 at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.nio.channels.ClosedChannelException
> >               at
> org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
> >                 at
> org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
> >                 at
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:635)
> >                 at
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
> >                 at
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
> >                 at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
> >                 at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
> >                 ... 1 more
> >
> > I don't see any error info at HDFS side at that point of time.
> > Have anyone faced this issue?
> >
> > HBase version is 0.98.6.
> >
> > Regards,
> > Pankaj
>

Re: Region server getting aborted in every one or two days

Posted by Anoop John <an...@gmail.com>.
At the same time, any explicit close op happened on the WAL file?  Any
log rolling?  Can u check the logs to know this?  May be check HDFS
logs to know abt the close calls to WAL file?

-Anoop-

On Wed, Mar 23, 2016 at 12:10 PM, Pankaj kr <pa...@huawei.com> wrote:
> Hi,
>
> In our production environment, RS is getting aborted in every one or two days with following exception.
>
> 2016-03-16 13:57:07,975 | FATAL | MemStoreFlusher.0 | ABORTING region server xyz-vm8,24502,1458034278600: Replay of WAL required. Forcing server shutdown | org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2055)
> org.apache.hadoop.hbase.DroppedSnapshotException: region: TB_WEBLOGIN_201603,060,1457916997964.06e204d3bc262b72820aa195fec23513.
>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2423)
>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2128)
>                 at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2090)
>                 at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1983)
>                 at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1909)
>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:509)
>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:470)
>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:74)
>                 at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
>                 at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedChannelException
>               at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
>                 at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
>                 at org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:635)
>                 at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
>                 at org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>                 at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
>                 at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
>                 ... 1 more
>
> I don't see any error info at HDFS side at that point of time.
> Have anyone faced this issue?
>
> HBase version is 0.98.6.
>
> Regards,
> Pankaj

Re: Region server getting aborted in every one or two days

Posted by Michal Medvecky <me...@pexe.so>.
Did you try running hbase hbck?

http://hbase.apache.org/0.94/book/hbck.in.depth.html

Michal

On Wed, Mar 23, 2016 at 7:40 AM, Pankaj kr <pa...@huawei.com> wrote:

> Hi,
>
> In our production environment, RS is getting aborted in every one or two
> days with following exception.
>
> 2016-03-16 13:57:07,975 | FATAL | MemStoreFlusher.0 | ABORTING region
> server xyz-vm8,24502,1458034278600: Replay of WAL required. Forcing server
> shutdown |
> org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2055)
> org.apache.hadoop.hbase.DroppedSnapshotException: region:
> TB_WEBLOGIN_201603,060,1457916997964.06e204d3bc262b72820aa195fec23513.
>                 at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2423)
>                 at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2128)
>                 at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2090)
>                 at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1983)
>                 at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1909)
>                 at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:509)
>                 at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:470)
>                 at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:74)
>                 at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259)
>                 at java.lang.Thread.run(Thread.java:745)
> Caused by: java.nio.channels.ClosedChannelException
>               at
> org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:208)
>                 at
> org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:142)
>                 at
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:635)
>                 at
> org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:490)
>                 at
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
>                 at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:190)
>                 at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1342)
>                 ... 1 more
>
> I don't see any error info at HDFS side at that point of time.
> Have anyone faced this issue?
>
> HBase version is 0.98.6.
>
> Regards,
> Pankaj
>