You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Pedro Gandola <pe...@gmail.com> on 2016/02/16 00:07:39 UTC

NPE on ProtobufLogWriter.sync()

Hi Guys,

One of my region servers got into a state where it was unable to start and
the cluster was not receiving traffic for some time:

*(master log)*

> 2016-02-15 22:04:33,186 ERROR
> [PriorityRpcServer.handler=4,queue=0,port=16000] master.MasterRpcServices:
> Region server hbase-rs9.localdomain,16020,1455560991134 reported a fatal
> error:
> ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE in
> log roller
> Cause:
> java.io.IOException: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more 2016-02-15 22:05:45,678 WARN
> [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> master.CatalogJanitor: Failed scan of catalog table
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> attempts=351, exceptions:
> Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
> callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta' at
> region=hbase:meta,,1.1588230740,
> hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:195)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
> at
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
> at
> org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
> at
> org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
> at
> org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
> at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
> at
> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193)
> at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:168)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.getMergedRegionsAndSplitParents(CatalogJanitor.java:120)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:221)
> at
> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:102)
> at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.SocketTimeoutException: callTimeout=60000,
> callDuration=68222: row '' on table 'hbase:meta' at
> region=hbase:meta,,1.1588230740,
> hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
> at
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
> ... 3 more
> Caused by:
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException:
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server
> hbase-rs9.localdomain,16020,1455560991134 aborting
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:980)
> at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2185)
> at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> at java.lang.Thread.run(Thread.java:745) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:325)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:380)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:199)
> at
> org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:346)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:320)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:126)
> ... 4 more


*(regionserver log)*

> 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, request
> close of wal
> java.io.IOException: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more
> 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234 ms,
> current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> DatanodeInfoWithStorage[10.5.2.95:50010,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> DatanodeInfoWithStorage[10.5.2.57:50010
> ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> 2016-02-15 22:04:33,043 FATAL
> [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: ABORTING region server
> hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> java.io.IOException: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more
> 2016-02-15 22:04:33,043 FATAL
> [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> org.apache.phoenix.coprocessor.SequenceRegionObserver,
> org.apache.phoenix.coprocessor.ScanRegionObserver,
> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> org.apache.phoenix.hbase.index.Indexer,
> org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> 2016-02-15 22:04:33,136 INFO
>  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: Dump of metrics as JSON on abort: {
>   "beans" : [ {
>     "name" : "java.lang:type=Memory",
>     "modelerType" : "sun.management.MemoryImpl",
>     "Verbose" : true,
>     "HeapMemoryUsage" : {
>       "committed" : 15032385536,
>       "init" : 15032385536,
>       "max" : 15032385536,
>       "used" : 3732843792
>     },
>     "ObjectPendingFinalizationCount" : 0,
>     "NonHeapMemoryUsage" : {
>       "committed" : 104660992,
>       "init" : 2555904,
>       "max" : -1,
>       "used" : 103018984
>     },
>     "ObjectName" : "java.lang:type=Memory"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
>     "modelerType" : "RegionServer,sub=IPC",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
>     "modelerType" : "RegionServer,sub=Replication",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
>     "modelerType" : "RegionServer,sub=Server",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ]
> }
> 2016-02-15 22:04:33,194 INFO
>  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: STOPPED: IOE in log roller
> 2016-02-15 22:04:33,194 INFO
>  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.LogRoller: LogRoller exiting.
> 2016-02-15 22:04:33,195 INFO  [regionserver/hbase-rs9.localdomain/
> 10.5.2.169:16020] regionserver.SplitLogWorker: Sending interrupt to stop
> the worker thread
> 2016-02-15 22:04:33,197 INFO  [regionserver/hbase-rs9.localdomain/
> 10.5.2.169:16020] regionserver.HRegionServer: Stopping infoServer
> 2016-02-15 22:04:33,197 INFO  [SplitLogWorker-hbase-rs9:16020]
> regionserver.SplitLogWorker: SplitLogWorker interrupted. Exiting.


Any idea what happened? Today, I moved some regions around to balance the
cluster and I ran a major compaction after that and I added more threads to
run small and large compactions, could this be related?

I see that in the current branch the class *ProtobufLogWriter:176* already
contains try..catch

try {
>   return this.output.getPos();
> } catch (NullPointerException npe) {
>   // Concurrent close...
>   throw new IOException(npe);
> }


But I would be nice to understand the root cause of this error and if there
is any misconfiguration from my side.

*Version: *HBase 1.1.2

Thank you
Cheers
Pedro

Re: NPE on ProtobufLogWriter.sync()

Posted by Pedro Gandola <pe...@gmail.com>.
Hi St.Ack

I attached the log in my previous response.

Thanks
Cheers

On Tue, Feb 16, 2016 at 11:51 PM, Stack <st...@duboce.net> wrote:

> On Tue, Feb 16, 2016 at 1:32 AM, Pedro Gandola <pe...@gmail.com>
> wrote:
>
> > Just missed one of your questions.
> >
> > Were you trying to restart it and it wouldn't come up? It kept doing the
> > > NPE on each restart?
> > > Or it happened once and killed the regionserver?
> >
> >
> > I received an alarm for the RS and I went there I saw the cluster not
> > processing data and I tried to restart the RS. Everything went to normal
> > but the RS crashed again then I decided to remove it completely until
> know
> > more about the problem. Now looking into the restarting code I see that
> the
> > problem might be other.
> >
> > *(log.out) *
> >
> > > /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214:
> 20748
> > > Killed                  nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
> > > --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT}
> 2>&1
> > > cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
> > > directory
> >
> >
> > *(log after restart)*
> >
> > > 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> > >
> >
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:java.io.tmpdir=/tmp
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:java.compiler=<NA>
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:os.name=Linux
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:os.arch=amd64
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:os.version=3.10.0-327.3.1.el7.x86_64
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:user.name=hbase
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:user.home=/home/hbase
> > > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > environment:user.dir=/home/hbase
> > > 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > > client connection,
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > sessionTimeout=30000
> > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > > client connection,
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > sessionTimeout=30000
> > > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > 2016-02-15 22:21:23,754 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Opening socket connection to server zookeeper2.localdomain/
> > 10.5.1.105:2181.
> > > Will not attempt to authenticate using SASL (unknown error)
> > > 2016-02-15 22:21:23,755 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Opening socket connection to server zookeeper2.localdomain/
> > 10.5.1.105:2181.
> > > Will not attempt to authenticate using SASL (unknown error)
> > > 2016-02-15 22:21:23,767 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Socket connection established to zookeeper2.localdomain/
> 10.5.1.105:2181,
> > > initiating session
> > > 2016-02-15 22:21:23,767 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Socket connection established to zookeeper2.localdomain/
> 10.5.1.105:2181,
> > > initiating session
> > > 2016-02-15 22:21:23,777 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Session establishment complete on server zookeeper2.localdomain/
> > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout =
> > 30000
> > > 2016-02-15 22:21:23,777 INFO
> > >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > > Session establishment complete on server zookeeper2.localdomain/
> > > 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout =
> > 30000
> > >
> >
> > I don't see any exception during the restart but for some reason the
> > process got killed and I don't see any SIGTERM or OOM on linux journal.
> >
> >
> Is that the last of the log? Just like that?
> St.Ack
>
>
> > Cheers
> > Pedro
> >
> >
> > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <pe...@gmail.com>
> > wrote:
> >
> > > Hi St.Ack,
> > >
> > > Thank you for your help. I have attached ~5 min worth of data before
> the
> > > crash.
> > >
> > > I restarted the cluster to update some configurations after that I
> moved
> > > some regions around to balance the cluster and just to ensure the data
> > > locality I ran a major compaction. After that I connected my App and 2
> > > hours later  this happened. So by the time of the crash I was not
> > > performing any operation over the cluster it was running normally.
> > >
> > > Thank you,
> > > Cheers
> > > Pedro
> > >
> > > On Tue, Feb 16, 2016 at 2:01 AM, Stack <st...@duboce.net> wrote:
> > >
> > >> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <
> pedro.gandola@gmail.com
> > >
> > >> wrote:
> > >>
> > >> > Hi Guys,
> > >> >
> > >> > One of my region servers got into a state where it was unable to
> start
> > >> and
> > >> > the cluster was not receiving traffic for some time:
> > >>
> > >>
> > >>
> > >> Were you trying to restart it and it wouldn't come up? It kept doing
> the
> > >> NPE on each restart?
> > >>
> > >> Or it happened once and killed the regionserver?
> > >>
> > >>
> > >>
> > >> *(master log)*
> > >> >
> > >> > > 2016-02-15 22:04:33,186 ERROR
> > >> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> > >> > master.MasterRpcServices:
> > >> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a
> > >> fatal
> > >> > > error:
> > >> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134:
> > IOE
> > >> in
> > >> > > log roller
> > >> > > Cause:
> > >> > > java.io.IOException: java.lang.NullPointerException
> > >> > > at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > >> > > at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > >> > > at java.lang.Thread.run(Thread.java:745)
> > >> > > Caused by: java.lang.NullPointerException
> > >> > > at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > >> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> > >> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > >> > > master.CatalogJanitor: Failed scan of catalog table
> > >> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> > after
> > >> > > attempts=351, exceptions:
> > >> > > Mon Feb 15 22:05:45 UTC 2016, null,
> java.net.SocketTimeoutException:
> > >> > > callTimeout=60000, callDuration=68222: row '' on table
> 'hbase:meta'
> > at
> > >> > > region=hbase:meta,,1.1588230740,
> > >> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > >> > > at
> > >> >
> > >>
> > >> You running with read replicas enabled?
> > >>
> > >>
> > >> ....
> > >>
> > >>
> > >>
> > >> *(regionserver log)*
> > >>
> > >>
> > >> Anything before this log message about rolling the WAL?
> > >>
> > >>
> > >> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing,
> > >> request
> > >> > close of wal
> > >> > java.io.IOException: java.lang.NullPointerException
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > >> > at java.lang.Thread.run(Thread.java:745)
> > >> > Caused by: java.lang.NullPointerException
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > >> > ... 2 more
> > >> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost:
> 234
> > >> ms,
> > >> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> > >> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> > >> > DatanodeInfoWithStorage[10.5.2.95:50010
> > >> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> > >> > DatanodeInfoWithStorage[10.5.2.57:50010
> > >> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> > >> > 2016-02-15 22:04:33,043 FATAL
> > >> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >> > regionserver.HRegionServer: ABORTING region server
> > >> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> > >> > java.io.IOException: java.lang.NullPointerException
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > >> > at java.lang.Thread.run(Thread.java:745)
> > >> > Caused by: java.lang.NullPointerException
> > >> > at
> > >> >
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > >> > ... 2 more
> > >> > 2016-02-15 22:04:33,043 FATAL
> > >> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors
> > are:
> > >> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > >> >
> > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> > >> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> > >> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > >> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> > >> > org.apache.phoenix.hbase.index.Indexer,
> > >> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> > >> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > >> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> > >> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> > >> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > >> > 2016-02-15 22:04:33,136 INFO
> > >> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> > >> >   "beans" : [ {
> > >> >     "name" : "java.lang:type=Memory",
> > >> >     "modelerType" : "sun.management.MemoryImpl",
> > >> >     "Verbose" : true,
> > >> >     "HeapMemoryUsage" : {
> > >> >       "committed" : 15032385536,
> > >> >       "init" : 15032385536,
> > >> >       "max" : 15032385536,
> > >> >       "used" : 3732843792
> > >> >     },
> > >> >     "ObjectPendingFinalizationCount" : 0,
> > >> >     "NonHeapMemoryUsage" : {
> > >> >       "committed" : 104660992,
> > >> >       "init" : 2555904,
> > >> >       "max" : -1,
> > >> >       "used" : 103018984
> > >> >     },
> > >> >     "ObjectName" : "java.lang:type=Memory"
> > >> >   } ],
> > >> >   "beans" : [ {
> > >> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> > >> >     "modelerType" : "RegionServer,sub=IPC",
> > >> >     "tag.Context" : "regionserver",
> > >> >     "tag.Hostname" : "hbase-rs9"
> > >> >   } ],
> > >> >   "beans" : [ {
> > >> >     "name" :
> "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> > >> >     "modelerType" : "RegionServer,sub=Replication",
> > >> >     "tag.Context" : "regionserver",
> > >> >     "tag.Hostname" : "hbase-rs9"
> > >> >   } ],
> > >> >   "beans" : [ {
> > >> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> > >> >     "modelerType" : "RegionServer,sub=Server",
> > >> >     "tag.Context" : "regionserver",
> > >> >     "tag.Hostname" : "hbase-rs9"
> > >> >   } ]
> > >> > }
> > >> > 2016-02-15 22:04:33,194 INFO
> > >> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >> > regionserver.HRegionServer: STOPPED: IOE in log roller
> > >>
> > >> Looks like we were in log roller, closing the WAL to put up a new one
> > but
> > >> the a sync was trying to run at same time but stream undo under it....
> > >> NPE... which caused server abort.
> > >>
> > >> ...
> > >>
> > >>
> > >> Any idea what happened? Today, I moved some regions around to balance
> > the
> > >> cluster and I ran a major compaction after that and I added more
> threads
> > >> to
> > >> run small and large compactions, could this be related?
> > >>
> > >>
> > >>
> > >> Yeah, pass more of the regionserver log from before the NPE... to see
> > what
> > >> was going on at the time.  I don't think the above changes related.
> > >>
> > >>
> > >>
> > >> I see that in the current branch the class *ProtobufLogWriter:176*
> > already
> > >> contains try..catch
> > >>
> > >> try {
> > >> >   return this.output.getPos();
> > >> > } catch (NullPointerException npe) {
> > >> >   // Concurrent close...
> > >> >   throw new IOException(npe);
> > >> > }
> > >>
> > >>
> > >> Yeah. Looks like the thought is an NPE could happen here because the
> > >> stream
> > >> is being closed out from under us by another thread (the syncers are
> off
> > >> on
> > >> their own just worried about syncing... This is just to get the NPE up
> > out
> > >> of the syncer thread and up into the general WAL subsystem.).  It
> looks
> > >> like this caused the regionserver abort. I would think it should just
> > >> provoke the WAL roll, not an abort. Put up more RS log please.
> > >>
> > >>
> > >>
> > >> > But I would be nice to understand the root cause of this error and
> if
> > >> there
> > >> > is any misconfiguration from my side.
> > >> >
> > >> >
> > >> Not your config I'd say (if your config brought it on, its a bug).
> > >>
> > >>
> > >>
> > >> > *Version: *HBase 1.1.2
> > >> >
> > >> >
> > >>
> > >> St.Ack
> > >>
> > >> P.S. It looks like a gentleman had similar issue end of last year:
> > >>
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
> > >> No response then.
> > >>
> > >>
> > >>
> > >>
> > >> > Thank you
> > >> > Cheers
> > >> > Pedro
> > >> >
> > >>
> > >
> > >
> >
>

Re: NPE on ProtobufLogWriter.sync()

Posted by Stack <st...@duboce.net>.
On Tue, Feb 16, 2016 at 1:32 AM, Pedro Gandola <pe...@gmail.com>
wrote:

> Just missed one of your questions.
>
> Were you trying to restart it and it wouldn't come up? It kept doing the
> > NPE on each restart?
> > Or it happened once and killed the regionserver?
>
>
> I received an alarm for the RS and I went there I saw the cluster not
> processing data and I tried to restart the RS. Everything went to normal
> but the RS crashed again then I decided to remove it completely until know
> more about the problem. Now looking into the restarting code I see that the
> problem might be other.
>
> *(log.out) *
>
> > /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: 20748
> > Killed                  nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
> > --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1
> > cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
> > directory
>
>
> *(log after restart)*
>
> > 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> >
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:java.io.tmpdir=/tmp
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:java.compiler=<NA>
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:os.name=Linux
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:os.arch=amd64
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:os.version=3.10.0-327.3.1.el7.x86_64
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:user.name=hbase
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:user.home=/home/hbase
> > 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > environment:user.dir=/home/hbase
> > 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > client connection,
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > sessionTimeout=30000
> > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > client connection,
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > sessionTimeout=30000
> > watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > 2016-02-15 22:21:23,754 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Opening socket connection to server zookeeper2.localdomain/
> 10.5.1.105:2181.
> > Will not attempt to authenticate using SASL (unknown error)
> > 2016-02-15 22:21:23,755 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Opening socket connection to server zookeeper2.localdomain/
> 10.5.1.105:2181.
> > Will not attempt to authenticate using SASL (unknown error)
> > 2016-02-15 22:21:23,767 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
> > initiating session
> > 2016-02-15 22:21:23,767 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
> > initiating session
> > 2016-02-15 22:21:23,777 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Session establishment complete on server zookeeper2.localdomain/
> > 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout =
> 30000
> > 2016-02-15 22:21:23,777 INFO
> >  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > Session establishment complete on server zookeeper2.localdomain/
> > 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout =
> 30000
> >
>
> I don't see any exception during the restart but for some reason the
> process got killed and I don't see any SIGTERM or OOM on linux journal.
>
>
Is that the last of the log? Just like that?
St.Ack


> Cheers
> Pedro
>
>
> On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <pe...@gmail.com>
> wrote:
>
> > Hi St.Ack,
> >
> > Thank you for your help. I have attached ~5 min worth of data before the
> > crash.
> >
> > I restarted the cluster to update some configurations after that I moved
> > some regions around to balance the cluster and just to ensure the data
> > locality I ran a major compaction. After that I connected my App and 2
> > hours later  this happened. So by the time of the crash I was not
> > performing any operation over the cluster it was running normally.
> >
> > Thank you,
> > Cheers
> > Pedro
> >
> > On Tue, Feb 16, 2016 at 2:01 AM, Stack <st...@duboce.net> wrote:
> >
> >> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <pedro.gandola@gmail.com
> >
> >> wrote:
> >>
> >> > Hi Guys,
> >> >
> >> > One of my region servers got into a state where it was unable to start
> >> and
> >> > the cluster was not receiving traffic for some time:
> >>
> >>
> >>
> >> Were you trying to restart it and it wouldn't come up? It kept doing the
> >> NPE on each restart?
> >>
> >> Or it happened once and killed the regionserver?
> >>
> >>
> >>
> >> *(master log)*
> >> >
> >> > > 2016-02-15 22:04:33,186 ERROR
> >> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> >> > master.MasterRpcServices:
> >> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a
> >> fatal
> >> > > error:
> >> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134:
> IOE
> >> in
> >> > > log roller
> >> > > Cause:
> >> > > java.io.IOException: java.lang.NullPointerException
> >> > > at
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> >> > > at
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> >> > > at java.lang.Thread.run(Thread.java:745)
> >> > > Caused by: java.lang.NullPointerException
> >> > > at
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> >> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> >> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> >> > > master.CatalogJanitor: Failed scan of catalog table
> >> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> after
> >> > > attempts=351, exceptions:
> >> > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
> >> > > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta'
> at
> >> > > region=hbase:meta,,1.1588230740,
> >> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> >> > >
> >> >
> >>
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> >> > > at
> >> >
> >>
> >> You running with read replicas enabled?
> >>
> >>
> >> ....
> >>
> >>
> >>
> >> *(regionserver log)*
> >>
> >>
> >> Anything before this log message about rolling the WAL?
> >>
> >>
> >> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing,
> >> request
> >> > close of wal
> >> > java.io.IOException: java.lang.NullPointerException
> >> > at
> >> >
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> >> > at
> >> >
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> >> > at java.lang.Thread.run(Thread.java:745)
> >> > Caused by: java.lang.NullPointerException
> >> > at
> >> >
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> >> > ... 2 more
> >> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234
> >> ms,
> >> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> >> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> >> > DatanodeInfoWithStorage[10.5.2.95:50010
> >> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> >> > DatanodeInfoWithStorage[10.5.2.57:50010
> >> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> >> > 2016-02-15 22:04:33,043 FATAL
> >> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> >> > regionserver.HRegionServer: ABORTING region server
> >> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> >> > java.io.IOException: java.lang.NullPointerException
> >> > at
> >> >
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> >> > at
> >> >
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> >> > at java.lang.Thread.run(Thread.java:745)
> >> > Caused by: java.lang.NullPointerException
> >> > at
> >> >
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> >> > ... 2 more
> >> > 2016-02-15 22:04:33,043 FATAL
> >> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> >> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors
> are:
> >> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> >> >
> org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> >> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> >> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> >> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> >> > org.apache.phoenix.hbase.index.Indexer,
> >> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> >> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> >> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> >> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> >> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> >> > 2016-02-15 22:04:33,136 INFO
> >> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> >> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> >> >   "beans" : [ {
> >> >     "name" : "java.lang:type=Memory",
> >> >     "modelerType" : "sun.management.MemoryImpl",
> >> >     "Verbose" : true,
> >> >     "HeapMemoryUsage" : {
> >> >       "committed" : 15032385536,
> >> >       "init" : 15032385536,
> >> >       "max" : 15032385536,
> >> >       "used" : 3732843792
> >> >     },
> >> >     "ObjectPendingFinalizationCount" : 0,
> >> >     "NonHeapMemoryUsage" : {
> >> >       "committed" : 104660992,
> >> >       "init" : 2555904,
> >> >       "max" : -1,
> >> >       "used" : 103018984
> >> >     },
> >> >     "ObjectName" : "java.lang:type=Memory"
> >> >   } ],
> >> >   "beans" : [ {
> >> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> >> >     "modelerType" : "RegionServer,sub=IPC",
> >> >     "tag.Context" : "regionserver",
> >> >     "tag.Hostname" : "hbase-rs9"
> >> >   } ],
> >> >   "beans" : [ {
> >> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> >> >     "modelerType" : "RegionServer,sub=Replication",
> >> >     "tag.Context" : "regionserver",
> >> >     "tag.Hostname" : "hbase-rs9"
> >> >   } ],
> >> >   "beans" : [ {
> >> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> >> >     "modelerType" : "RegionServer,sub=Server",
> >> >     "tag.Context" : "regionserver",
> >> >     "tag.Hostname" : "hbase-rs9"
> >> >   } ]
> >> > }
> >> > 2016-02-15 22:04:33,194 INFO
> >> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> >> > regionserver.HRegionServer: STOPPED: IOE in log roller
> >>
> >> Looks like we were in log roller, closing the WAL to put up a new one
> but
> >> the a sync was trying to run at same time but stream undo under it....
> >> NPE... which caused server abort.
> >>
> >> ...
> >>
> >>
> >> Any idea what happened? Today, I moved some regions around to balance
> the
> >> cluster and I ran a major compaction after that and I added more threads
> >> to
> >> run small and large compactions, could this be related?
> >>
> >>
> >>
> >> Yeah, pass more of the regionserver log from before the NPE... to see
> what
> >> was going on at the time.  I don't think the above changes related.
> >>
> >>
> >>
> >> I see that in the current branch the class *ProtobufLogWriter:176*
> already
> >> contains try..catch
> >>
> >> try {
> >> >   return this.output.getPos();
> >> > } catch (NullPointerException npe) {
> >> >   // Concurrent close...
> >> >   throw new IOException(npe);
> >> > }
> >>
> >>
> >> Yeah. Looks like the thought is an NPE could happen here because the
> >> stream
> >> is being closed out from under us by another thread (the syncers are off
> >> on
> >> their own just worried about syncing... This is just to get the NPE up
> out
> >> of the syncer thread and up into the general WAL subsystem.).  It looks
> >> like this caused the regionserver abort. I would think it should just
> >> provoke the WAL roll, not an abort. Put up more RS log please.
> >>
> >>
> >>
> >> > But I would be nice to understand the root cause of this error and if
> >> there
> >> > is any misconfiguration from my side.
> >> >
> >> >
> >> Not your config I'd say (if your config brought it on, its a bug).
> >>
> >>
> >>
> >> > *Version: *HBase 1.1.2
> >> >
> >> >
> >>
> >> St.Ack
> >>
> >> P.S. It looks like a gentleman had similar issue end of last year:
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
> >> No response then.
> >>
> >>
> >>
> >>
> >> > Thank you
> >> > Cheers
> >> > Pedro
> >> >
> >>
> >
> >
>

Re: NPE on ProtobufLogWriter.sync()

Posted by Pedro Gandola <pe...@gmail.com>.
Hi St.Ack,

Yes, I do have HA NN, I didn't failover it and it was not necessary to
clear the standby mode or run fsck. However, I'm a bit concerned about
those logs.
The logs that I sent to Enis show that only the standby NN was having
issues the active NN was normal and I didn't find errors on that time.

No crashes since then all region servers are operating normally.

Thanks
Cheers

On Wed, Feb 17, 2016 at 6:43 AM, Stack <st...@duboce.net> wrote:

> What @Enis said, your NN seems to be in standby mode. You have HA NN? You
> failed over? You need to clear standby state? Run fsck?
> St.Ack
>
> On Tue, Feb 16, 2016 at 6:32 PM, Pedro Gandola <pe...@gmail.com>
> wrote:
>
> > Hi Enis,
> >
> > At the time that the problem happened, the active NN log does not have
> any
> > error or exceptions however the standby node has the following:
> >
> > 2016-02-16 14:21:51,039 INFO ha.EditLogTailer
> > > (EditLogTailer.java:doTailEdits(238)) - Loaded 80 edits starting from
> > txid
> > > 4916546
> > > 2016-02-16 14:22:37,633 WARN namenode.FSNamesystem
> > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > > returned error: Operation category READ is not supported in state
> standby
> > > 2016-02-16 14:22:37,635 WARN namenode.FSNamesystem
> > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > > returned error: Operation category READ is not supported in state
> standby
> > > 2016-02-16 14:22:37,766 INFO ipc.Server (Server.java:run(2151)) - IPC
> > > Server handler 196 on 8020, call
> > > org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from
> > > 10.5.2.11:54684 Call#22619 Retry#1:
> > > org.apache.hadoop.ipc.StandbyException: *Operation category WRITE is
> not
> > > supported in state standby*
> > > 2016-02-16 14:22:37,783 INFO ipc.Server (Server.java:run(2151)) - IPC
> > > Server handler 161 on 8020, call
> > > org.apache.hadoop.hdfs.protocol.ClientProtocol.getAdditionalDatanode
> from
> > > 10.5.2.11:54684 Call#22620 Retry#0:
> > > org.apache.hadoop.ipc.StandbyException: Operation category READ is not
> > > supported in state standby
> >
> >
> > But the standby node log is full of messages like this:
> >
> > 2016-02-17 02:18:44,688 WARN namenode.FSNamesystem
> > > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > > returned error: Operation category READ is not supported in state
> standby
> >
> >
> > Is this normal? or shall I run a fsck? I'm trusting in Ambari that says
> > that HDFS is totally OK :).
> >
> > Thanks
> > Cheers
> > Pedro
> >
> > On Tue, Feb 16, 2016 at 11:10 PM, Enis Söztutar <en...@gmail.com>
> > wrote:
> >
> > > > Operation category WRITE is not supported in state standby
> > > This looks to be coming from a NN that is in Standby state (or safe
> > mode?).
> > > Did you check whether underlying HDFS is healthy. Is this HA
> > configuration
> > > and hbase-site.xml contains the correct NN configuration?
> > >
> > > Enis
> > >
> > >
> > > On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola <
> pedro.gandola@gmail.com>
> > > wrote:
> > >
> > > > Just lost another RegionServer... The cluster was very stable until
> > > > yesterday, 2 region servers in less than 24 hours something might be
> > > wrong
> > > > in my configuration.
> > > >
> > > > Any insights?
> > > >
> > > > Thank you
> > > > Pedro
> > > >
> > > > *(rs log)*
> > > >
> > > > > 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for
> > block
> > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> > > > > java.io.IOException: Bad response ERROR for block
> > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from
> > > datanode
> > > > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> > > > > 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod
> ]
> > > > > retry.RetryInvocationHandler: Exception while invoking renewLease
> of
> > > > class
> > > > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> > > > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over
> > after
> > > > > sleeping for 1261ms.
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > > > Operation category WRITE is not supported in state standby
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > > at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590)
> > > > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > > at com.sun.proxy.$Proxy17.renewLease(Unknown Source)
> > > > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > > at com.sun.proxy.$Proxy18.renewLease(Unknown Source)
> > > > > at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892)
> > > > > at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
> > > > > at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
> > > > > at
> > org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
> > > > > at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
> > > > > at java.lang.Thread.run(Thread.java:745)
> > > > > 2016-02-16 14:22:37,753 WARN  [DataStreamer for file
> > > > >
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > > hdfs.DFSClient: Error Recovery for block
> > > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 in
> > pipeline
> > > > > DatanodeInfoWithStorage[10.5.2.11:50010
> > > > ,DS-92794e46-b126-4aa8-bfd9-a8867cac42e4,DISK],
> > > > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK],
> > > > > DatanodeInfoWithStorage[10.5.2.57:50010
> > > > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]:
> > > > > bad datanode DatanodeInfoWithStorage[10.5.1.117:50010
> > > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > > > 2016-02-16 14:22:37,769 INFO  [DataStreamer for file
> > > > >
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > > retry.RetryInvocationHandler: Exception while invoking
> > > > > getAdditionalDatanode of class ClientNamenodeProtocolTranslatorPB
> > over
> > > > > hbase-ms5.localdomain/10.5.2.248:8020. Trying to fail over
> > > immediately.
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > > > Operation category READ is not supported in state standby
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3326)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > > > > 2016-02-16 14:22:37,781 WARN  [DataStreamer for file
> > > > >
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > > hdfs.DFSClient: DataStreamer Exception
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > > > No lease on
> > > > >
> > > >
> > >
> >
> /apps/hbase/data/oldWALs/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > > (inode 641732): File is not open for writing. Holder
> > > > > DFSClient_NONMAPREDUCE_-127840820_1 does not have any open files.
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3445)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3347)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > > at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > > at
> > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > > > > 2016-02-16 14:22:38,134 INFO
> > > > >  [main-SendThread(zookeeper3.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > > > Opening socket connection to server zookeeper3.localdomain/
> > > > 10.5.2.166:2181.
> > > > > Will not attempt to authenticate using SASL (unknown error)
> > > > > 2016-02-16 14:22:38,137 INFO
> > > > >  [main-SendThread(zookeeper3.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > > > Socket connection established to zookeeper3.localdomain/
> > > 10.5.2.166:2181,
> > > > > initiating session
> > > > > 2016-02-16 14:22:38,144 INFO
> > > > >  [main-SendThread(zookeeper3.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > > > Unable to reconnect to ZooKeeper service, session 0x25278d348e90c8d
> > has
> > > > > expired, closing socket connection
> > > > > 2016-02-16 14:22:38,144 FATAL [main-EventThread]
> > > > > regionserver.HRegionServer: ABORTING region server
> > > > > hbase-rs6.localdomain,16020,1455560986600:
> > > > > regionserver:16020-0x25278d348e90c8d,
> > > > >
> > > >
> > >
> >
> quorum=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181,
> > > > > baseZNode=/hbase regionserver:16020-0x25278d348e90c8d received
> > expired
> > > > from
> > > > > ZooKeeper, aborting
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:606)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
> > > > > at
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > > > > at
> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > > >
> > > >
> > > > On Tue, Feb 16, 2016 at 9:32 AM, Pedro Gandola <
> > pedro.gandola@gmail.com>
> > > > wrote:
> > > >
> > > > > Just missed one of your questions.
> > > > >
> > > > > Were you trying to restart it and it wouldn't come up? It kept
> doing
> > > the
> > > > >> NPE on each restart?
> > > > >> Or it happened once and killed the regionserver?
> > > > >
> > > > >
> > > > > I received an alarm for the RS and I went there I saw the cluster
> not
> > > > > processing data and I tried to restart the RS. Everything went to
> > > normal
> > > > > but the RS crashed again then I decided to remove it completely
> until
> > > > know
> > > > > more about the problem. Now looking into the restarting code I see
> > that
> > > > the
> > > > > problem might be other.
> > > > >
> > > > > *(log.out) *
> > > > >
> > > > >> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214:
> > > 20748
> > > > >> Killed                  nice -n $HBASE_NICENESS
> > > "$HBASE_HOME"/bin/hbase
> > > > >> --config "${HBASE_CONF_DIR}" $command "$@" start >>
> ${HBASE_LOGOUT}
> > > 2>&1
> > > > >> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file
> > or
> > > > >> directory
> > > > >
> > > > >
> > > > > *(log after restart)*
> > > > >
> > > > >> 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >>
> > > >
> > >
> >
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:java.io.tmpdir=/tmp
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:java.compiler=<NA>
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:os.name=Linux
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:os.arch=amd64
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:os.version=3.10.0-327.3.1.el7.x86_64
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:user.name=hbase
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:user.home=/home/hbase
> > > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > > >> environment:user.dir=/home/hbase
> > > > >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper:
> Initiating
> > > > >> client connection,
> > > > >>
> > > >
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > > >> sessionTimeout=30000
> > > > >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > > >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper:
> Initiating
> > > > >> client connection,
> > > > >>
> > > >
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > > >> sessionTimeout=30000
> > > > >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > > >> 2016-02-15 22:21:23,754 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Opening socket connection to server zookeeper2.localdomain/
> > > > >> 10.5.1.105:2181. Will not attempt to authenticate using SASL
> > (unknown
> > > > >> error)
> > > > >> 2016-02-15 22:21:23,755 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Opening socket connection to server zookeeper2.localdomain/
> > > > >> 10.5.1.105:2181. Will not attempt to authenticate using SASL
> > (unknown
> > > > >> error)
> > > > >> 2016-02-15 22:21:23,767 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Socket connection established to zookeeper2.localdomain/
> > > 10.5.1.105:2181
> > > > ,
> > > > >> initiating session
> > > > >> 2016-02-15 22:21:23,767 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Socket connection established to zookeeper2.localdomain/
> > > 10.5.1.105:2181
> > > > ,
> > > > >> initiating session
> > > > >> 2016-02-15 22:21:23,777 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Session establishment complete on server zookeeper2.localdomain/
> > > > >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated
> timeout
> > =
> > > > >> 30000
> > > > >> 2016-02-15 22:21:23,777 INFO
> > > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> > zookeeper.ClientCnxn:
> > > > >> Session establishment complete on server zookeeper2.localdomain/
> > > > >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated
> timeout
> > =
> > > > >> 30000
> > > > >>
> > > > >
> > > > > I don't see any exception during the restart but for some reason
> the
> > > > > process got killed and I don't see any SIGTERM or OOM on linux
> > journal.
> > > > >
> > > > > Cheers
> > > > > Pedro
> > > > >
> > > > >
> > > > > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <
> > > pedro.gandola@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Hi St.Ack,
> > > > >>
> > > > >> Thank you for your help. I have attached ~5 min worth of data
> before
> > > the
> > > > >> crash.
> > > > >>
> > > > >> I restarted the cluster to update some configurations after that I
> > > moved
> > > > >> some regions around to balance the cluster and just to ensure the
> > data
> > > > >> locality I ran a major compaction. After that I connected my App
> > and 2
> > > > >> hours later  this happened. So by the time of the crash I was not
> > > > >> performing any operation over the cluster it was running normally.
> > > > >>
> > > > >> Thank you,
> > > > >> Cheers
> > > > >> Pedro
> > > > >>
> > > > >> On Tue, Feb 16, 2016 at 2:01 AM, Stack <st...@duboce.net> wrote:
> > > > >>
> > > > >>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <
> > > > pedro.gandola@gmail.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>> > Hi Guys,
> > > > >>> >
> > > > >>> > One of my region servers got into a state where it was unable
> to
> > > > start
> > > > >>> and
> > > > >>> > the cluster was not receiving traffic for some time:
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Were you trying to restart it and it wouldn't come up? It kept
> > doing
> > > > the
> > > > >>> NPE on each restart?
> > > > >>>
> > > > >>> Or it happened once and killed the regionserver?
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> *(master log)*
> > > > >>> >
> > > > >>> > > 2016-02-15 22:04:33,186 ERROR
> > > > >>> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> > > > >>> > master.MasterRpcServices:
> > > > >>> > > Region server hbase-rs9.localdomain,16020,1455560991134
> > reported
> > > a
> > > > >>> fatal
> > > > >>> > > error:
> > > > >>> > > ABORTING region server
> > hbase-rs9.localdomain,16020,1455560991134:
> > > > >>> IOE in
> > > > >>> > > log roller
> > > > >>> > > Cause:
> > > > >>> > > java.io.IOException: java.lang.NullPointerException
> > > > >>> > > at
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > > >>> > > at
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > > >>> > > at java.lang.Thread.run(Thread.java:745)
> > > > >>> > > Caused by: java.lang.NullPointerException
> > > > >>> > > at
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > > >>> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> > > > >>> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > > > >>> > > master.CatalogJanitor: Failed scan of catalog table
> > > > >>> > > org.apache.hadoop.hbase.client.RetriesExhaustedException:
> > Failed
> > > > >>> after
> > > > >>> > > attempts=351, exceptions:
> > > > >>> > > Mon Feb 15 22:05:45 UTC 2016, null,
> > > > java.net.SocketTimeoutException:
> > > > >>> > > callTimeout=60000, callDuration=68222: row '' on table
> > > 'hbase:meta'
> > > > >>> at
> > > > >>> > > region=hbase:meta,,1.1588230740,
> > > > >>> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0
> at
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > > > >>> > > at
> > > > >>> >
> > > > >>>
> > > > >>> You running with read replicas enabled?
> > > > >>>
> > > > >>>
> > > > >>> ....
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> *(regionserver log)*
> > > > >>>
> > > > >>>
> > > > >>> Anything before this log message about rolling the WAL?
> > > > >>>
> > > > >>>
> > > > >>> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error
> syncing,
> > > > >>> request
> > > > >>> > close of wal
> > > > >>> > java.io.IOException: java.lang.NullPointerException
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > > >>> > at java.lang.Thread.run(Thread.java:745)
> > > > >>> > Caused by: java.lang.NullPointerException
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > > >>> > ... 2 more
> > > > >>> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync
> > cost:
> > > > 234
> > > > >>> ms,
> > > > >>> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> > > > >>> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> > > > >>> > DatanodeInfoWithStorage[10.5.2.95:50010
> > > > >>> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> > > > >>> > DatanodeInfoWithStorage[10.5.2.57:50010
> > > > >>> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> > > > >>> > 2016-02-15 22:04:33,043 FATAL
> > > > >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020
> .logRoller]
> > > > >>> > regionserver.HRegionServer: ABORTING region server
> > > > >>> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> > > > >>> > java.io.IOException: java.lang.NullPointerException
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > > >>> > at java.lang.Thread.run(Thread.java:745)
> > > > >>> > Caused by: java.lang.NullPointerException
> > > > >>> > at
> > > > >>> >
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > > >>> > ... 2 more
> > > > >>> > 2016-02-15 22:04:33,043 FATAL
> > > > >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020
> .logRoller]
> > > > >>> > regionserver.HRegionServer: RegionServer abort: loaded
> > coprocessors
> > > > >>> are:
> > > > >>> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > > > >>> >
> > > >
> org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> > > > >>> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> > > > >>> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > > > >>> >
> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> > > > >>> > org.apache.phoenix.hbase.index.Indexer,
> > > > >>> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> > > > >>> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > > > >>> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> > > > >>> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> > > > >>> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > > > >>> > 2016-02-15 22:04:33,136 INFO
> > > > >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020
> .logRoller]
> > > > >>> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> > > > >>> >   "beans" : [ {
> > > > >>> >     "name" : "java.lang:type=Memory",
> > > > >>> >     "modelerType" : "sun.management.MemoryImpl",
> > > > >>> >     "Verbose" : true,
> > > > >>> >     "HeapMemoryUsage" : {
> > > > >>> >       "committed" : 15032385536,
> > > > >>> >       "init" : 15032385536,
> > > > >>> >       "max" : 15032385536,
> > > > >>> >       "used" : 3732843792
> > > > >>> >     },
> > > > >>> >     "ObjectPendingFinalizationCount" : 0,
> > > > >>> >     "NonHeapMemoryUsage" : {
> > > > >>> >       "committed" : 104660992,
> > > > >>> >       "init" : 2555904,
> > > > >>> >       "max" : -1,
> > > > >>> >       "used" : 103018984
> > > > >>> >     },
> > > > >>> >     "ObjectName" : "java.lang:type=Memory"
> > > > >>> >   } ],
> > > > >>> >   "beans" : [ {
> > > > >>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> > > > >>> >     "modelerType" : "RegionServer,sub=IPC",
> > > > >>> >     "tag.Context" : "regionserver",
> > > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > > >>> >   } ],
> > > > >>> >   "beans" : [ {
> > > > >>> >     "name" :
> > > > "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> > > > >>> >     "modelerType" : "RegionServer,sub=Replication",
> > > > >>> >     "tag.Context" : "regionserver",
> > > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > > >>> >   } ],
> > > > >>> >   "beans" : [ {
> > > > >>> >     "name" :
> "Hadoop:service=HBase,name=RegionServer,sub=Server",
> > > > >>> >     "modelerType" : "RegionServer,sub=Server",
> > > > >>> >     "tag.Context" : "regionserver",
> > > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > > >>> >   } ]
> > > > >>> > }
> > > > >>> > 2016-02-15 22:04:33,194 INFO
> > > > >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020
> .logRoller]
> > > > >>> > regionserver.HRegionServer: STOPPED: IOE in log roller
> > > > >>>
> > > > >>> Looks like we were in log roller, closing the WAL to put up a new
> > one
> > > > but
> > > > >>> the a sync was trying to run at same time but stream undo under
> > > it....
> > > > >>> NPE... which caused server abort.
> > > > >>>
> > > > >>> ...
> > > > >>>
> > > > >>>
> > > > >>> Any idea what happened? Today, I moved some regions around to
> > balance
> > > > the
> > > > >>> cluster and I ran a major compaction after that and I added more
> > > > threads
> > > > >>> to
> > > > >>> run small and large compactions, could this be related?
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Yeah, pass more of the regionserver log from before the NPE... to
> > see
> > > > >>> what
> > > > >>> was going on at the time.  I don't think the above changes
> related.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> I see that in the current branch the class
> *ProtobufLogWriter:176*
> > > > >>> already
> > > > >>> contains try..catch
> > > > >>>
> > > > >>> try {
> > > > >>> >   return this.output.getPos();
> > > > >>> > } catch (NullPointerException npe) {
> > > > >>> >   // Concurrent close...
> > > > >>> >   throw new IOException(npe);
> > > > >>> > }
> > > > >>>
> > > > >>>
> > > > >>> Yeah. Looks like the thought is an NPE could happen here because
> > the
> > > > >>> stream
> > > > >>> is being closed out from under us by another thread (the syncers
> > are
> > > > off
> > > > >>> on
> > > > >>> their own just worried about syncing... This is just to get the
> NPE
> > > up
> > > > >>> out
> > > > >>> of the syncer thread and up into the general WAL subsystem.).  It
> > > looks
> > > > >>> like this caused the regionserver abort. I would think it should
> > just
> > > > >>> provoke the WAL roll, not an abort. Put up more RS log please.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> > But I would be nice to understand the root cause of this error
> > and
> > > if
> > > > >>> there
> > > > >>> > is any misconfiguration from my side.
> > > > >>> >
> > > > >>> >
> > > > >>> Not your config I'd say (if your config brought it on, its a
> bug).
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> > *Version: *HBase 1.1.2
> > > > >>> >
> > > > >>> >
> > > > >>>
> > > > >>> St.Ack
> > > > >>>
> > > > >>> P.S. It looks like a gentleman had similar issue end of last
> year:
> > > > >>>
> > > > >>>
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
> > > > >>> No response then.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> > Thank you
> > > > >>> > Cheers
> > > > >>> > Pedro
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: NPE on ProtobufLogWriter.sync()

Posted by Stack <st...@duboce.net>.
What @Enis said, your NN seems to be in standby mode. You have HA NN? You
failed over? You need to clear standby state? Run fsck?
St.Ack

On Tue, Feb 16, 2016 at 6:32 PM, Pedro Gandola <pe...@gmail.com>
wrote:

> Hi Enis,
>
> At the time that the problem happened, the active NN log does not have any
> error or exceptions however the standby node has the following:
>
> 2016-02-16 14:21:51,039 INFO ha.EditLogTailer
> > (EditLogTailer.java:doTailEdits(238)) - Loaded 80 edits starting from
> txid
> > 4916546
> > 2016-02-16 14:22:37,633 WARN namenode.FSNamesystem
> > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > returned error: Operation category READ is not supported in state standby
> > 2016-02-16 14:22:37,635 WARN namenode.FSNamesystem
> > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > returned error: Operation category READ is not supported in state standby
> > 2016-02-16 14:22:37,766 INFO ipc.Server (Server.java:run(2151)) - IPC
> > Server handler 196 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from
> > 10.5.2.11:54684 Call#22619 Retry#1:
> > org.apache.hadoop.ipc.StandbyException: *Operation category WRITE is not
> > supported in state standby*
> > 2016-02-16 14:22:37,783 INFO ipc.Server (Server.java:run(2151)) - IPC
> > Server handler 161 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.getAdditionalDatanode from
> > 10.5.2.11:54684 Call#22620 Retry#0:
> > org.apache.hadoop.ipc.StandbyException: Operation category READ is not
> > supported in state standby
>
>
> But the standby node log is full of messages like this:
>
> 2016-02-17 02:18:44,688 WARN namenode.FSNamesystem
> > (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> > returned error: Operation category READ is not supported in state standby
>
>
> Is this normal? or shall I run a fsck? I'm trusting in Ambari that says
> that HDFS is totally OK :).
>
> Thanks
> Cheers
> Pedro
>
> On Tue, Feb 16, 2016 at 11:10 PM, Enis Söztutar <en...@gmail.com>
> wrote:
>
> > > Operation category WRITE is not supported in state standby
> > This looks to be coming from a NN that is in Standby state (or safe
> mode?).
> > Did you check whether underlying HDFS is healthy. Is this HA
> configuration
> > and hbase-site.xml contains the correct NN configuration?
> >
> > Enis
> >
> >
> > On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola <pe...@gmail.com>
> > wrote:
> >
> > > Just lost another RegionServer... The cluster was very stable until
> > > yesterday, 2 region servers in less than 24 hours something might be
> > wrong
> > > in my configuration.
> > >
> > > Any insights?
> > >
> > > Thank you
> > > Pedro
> > >
> > > *(rs log)*
> > >
> > > > 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for
> block
> > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> > > > java.io.IOException: Bad response ERROR for block
> > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from
> > datanode
> > > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> > > > 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod]
> > > > retry.RetryInvocationHandler: Exception while invoking renewLease of
> > > class
> > > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> > > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over
> after
> > > > sleeping for 1261ms.
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > > Operation category WRITE is not supported in state standby
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590)
> > > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > at com.sun.proxy.$Proxy17.renewLease(Unknown Source)
> > > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > at com.sun.proxy.$Proxy18.renewLease(Unknown Source)
> > > > at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892)
> > > > at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
> > > > at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
> > > > at
> org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
> > > > at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
> > > > at java.lang.Thread.run(Thread.java:745)
> > > > 2016-02-16 14:22:37,753 WARN  [DataStreamer for file
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > hdfs.DFSClient: Error Recovery for block
> > > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 in
> pipeline
> > > > DatanodeInfoWithStorage[10.5.2.11:50010
> > > ,DS-92794e46-b126-4aa8-bfd9-a8867cac42e4,DISK],
> > > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK],
> > > > DatanodeInfoWithStorage[10.5.2.57:50010
> > > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]:
> > > > bad datanode DatanodeInfoWithStorage[10.5.1.117:50010
> > > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > > 2016-02-16 14:22:37,769 INFO  [DataStreamer for file
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > retry.RetryInvocationHandler: Exception while invoking
> > > > getAdditionalDatanode of class ClientNamenodeProtocolTranslatorPB
> over
> > > > hbase-ms5.localdomain/10.5.2.248:8020. Trying to fail over
> > immediately.
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > > Operation category READ is not supported in state standby
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3326)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > > > 2016-02-16 14:22:37,781 WARN  [DataStreamer for file
> > > >
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > > hdfs.DFSClient: DataStreamer Exception
> > > >
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > > No lease on
> > > >
> > >
> >
> /apps/hbase/data/oldWALs/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > > (inode 641732): File is not open for writing. Holder
> > > > DFSClient_NONMAPREDUCE_-127840820_1 does not have any open files.
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3445)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3347)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > > at java.security.AccessController.doPrivileged(Native Method)
> > > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > > at
> > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > > > 2016-02-16 14:22:38,134 INFO
> > > >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > > > Opening socket connection to server zookeeper3.localdomain/
> > > 10.5.2.166:2181.
> > > > Will not attempt to authenticate using SASL (unknown error)
> > > > 2016-02-16 14:22:38,137 INFO
> > > >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > > > Socket connection established to zookeeper3.localdomain/
> > 10.5.2.166:2181,
> > > > initiating session
> > > > 2016-02-16 14:22:38,144 INFO
> > > >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > > > Unable to reconnect to ZooKeeper service, session 0x25278d348e90c8d
> has
> > > > expired, closing socket connection
> > > > 2016-02-16 14:22:38,144 FATAL [main-EventThread]
> > > > regionserver.HRegionServer: ABORTING region server
> > > > hbase-rs6.localdomain,16020,1455560986600:
> > > > regionserver:16020-0x25278d348e90c8d,
> > > >
> > >
> >
> quorum=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181,
> > > > baseZNode=/hbase regionserver:16020-0x25278d348e90c8d received
> expired
> > > from
> > > > ZooKeeper, aborting
> > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > KeeperErrorCode = Session expired
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:606)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
> > > > at
> > > >
> > >
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > > > at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > >
> > >
> > > On Tue, Feb 16, 2016 at 9:32 AM, Pedro Gandola <
> pedro.gandola@gmail.com>
> > > wrote:
> > >
> > > > Just missed one of your questions.
> > > >
> > > > Were you trying to restart it and it wouldn't come up? It kept doing
> > the
> > > >> NPE on each restart?
> > > >> Or it happened once and killed the regionserver?
> > > >
> > > >
> > > > I received an alarm for the RS and I went there I saw the cluster not
> > > > processing data and I tried to restart the RS. Everything went to
> > normal
> > > > but the RS crashed again then I decided to remove it completely until
> > > know
> > > > more about the problem. Now looking into the restarting code I see
> that
> > > the
> > > > problem might be other.
> > > >
> > > > *(log.out) *
> > > >
> > > >> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214:
> > 20748
> > > >> Killed                  nice -n $HBASE_NICENESS
> > "$HBASE_HOME"/bin/hbase
> > > >> --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT}
> > 2>&1
> > > >> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file
> or
> > > >> directory
> > > >
> > > >
> > > > *(log after restart)*
> > > >
> > > >> 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> > > >>
> > >
> >
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:java.io.tmpdir=/tmp
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:java.compiler=<NA>
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:os.name=Linux
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:os.arch=amd64
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:os.version=3.10.0-327.3.1.el7.x86_64
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:user.name=hbase
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:user.home=/home/hbase
> > > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > > >> environment:user.dir=/home/hbase
> > > >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > > >> client connection,
> > > >>
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > >> sessionTimeout=30000
> > > >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > > >> client connection,
> > > >>
> > >
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > > >> sessionTimeout=30000
> > > >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > > >> 2016-02-15 22:21:23,754 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Opening socket connection to server zookeeper2.localdomain/
> > > >> 10.5.1.105:2181. Will not attempt to authenticate using SASL
> (unknown
> > > >> error)
> > > >> 2016-02-15 22:21:23,755 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Opening socket connection to server zookeeper2.localdomain/
> > > >> 10.5.1.105:2181. Will not attempt to authenticate using SASL
> (unknown
> > > >> error)
> > > >> 2016-02-15 22:21:23,767 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Socket connection established to zookeeper2.localdomain/
> > 10.5.1.105:2181
> > > ,
> > > >> initiating session
> > > >> 2016-02-15 22:21:23,767 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Socket connection established to zookeeper2.localdomain/
> > 10.5.1.105:2181
> > > ,
> > > >> initiating session
> > > >> 2016-02-15 22:21:23,777 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Session establishment complete on server zookeeper2.localdomain/
> > > >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout
> =
> > > >> 30000
> > > >> 2016-02-15 22:21:23,777 INFO
> > > >>  [main-SendThread(zookeeper2.localdomain:2181)]
> zookeeper.ClientCnxn:
> > > >> Session establishment complete on server zookeeper2.localdomain/
> > > >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout
> =
> > > >> 30000
> > > >>
> > > >
> > > > I don't see any exception during the restart but for some reason the
> > > > process got killed and I don't see any SIGTERM or OOM on linux
> journal.
> > > >
> > > > Cheers
> > > > Pedro
> > > >
> > > >
> > > > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <
> > pedro.gandola@gmail.com>
> > > > wrote:
> > > >
> > > >> Hi St.Ack,
> > > >>
> > > >> Thank you for your help. I have attached ~5 min worth of data before
> > the
> > > >> crash.
> > > >>
> > > >> I restarted the cluster to update some configurations after that I
> > moved
> > > >> some regions around to balance the cluster and just to ensure the
> data
> > > >> locality I ran a major compaction. After that I connected my App
> and 2
> > > >> hours later  this happened. So by the time of the crash I was not
> > > >> performing any operation over the cluster it was running normally.
> > > >>
> > > >> Thank you,
> > > >> Cheers
> > > >> Pedro
> > > >>
> > > >> On Tue, Feb 16, 2016 at 2:01 AM, Stack <st...@duboce.net> wrote:
> > > >>
> > > >>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <
> > > pedro.gandola@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> > Hi Guys,
> > > >>> >
> > > >>> > One of my region servers got into a state where it was unable to
> > > start
> > > >>> and
> > > >>> > the cluster was not receiving traffic for some time:
> > > >>>
> > > >>>
> > > >>>
> > > >>> Were you trying to restart it and it wouldn't come up? It kept
> doing
> > > the
> > > >>> NPE on each restart?
> > > >>>
> > > >>> Or it happened once and killed the regionserver?
> > > >>>
> > > >>>
> > > >>>
> > > >>> *(master log)*
> > > >>> >
> > > >>> > > 2016-02-15 22:04:33,186 ERROR
> > > >>> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> > > >>> > master.MasterRpcServices:
> > > >>> > > Region server hbase-rs9.localdomain,16020,1455560991134
> reported
> > a
> > > >>> fatal
> > > >>> > > error:
> > > >>> > > ABORTING region server
> hbase-rs9.localdomain,16020,1455560991134:
> > > >>> IOE in
> > > >>> > > log roller
> > > >>> > > Cause:
> > > >>> > > java.io.IOException: java.lang.NullPointerException
> > > >>> > > at
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > >>> > > at
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > >>> > > at java.lang.Thread.run(Thread.java:745)
> > > >>> > > Caused by: java.lang.NullPointerException
> > > >>> > > at
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > >>> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> > > >>> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > > >>> > > master.CatalogJanitor: Failed scan of catalog table
> > > >>> > > org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed
> > > >>> after
> > > >>> > > attempts=351, exceptions:
> > > >>> > > Mon Feb 15 22:05:45 UTC 2016, null,
> > > java.net.SocketTimeoutException:
> > > >>> > > callTimeout=60000, callDuration=68222: row '' on table
> > 'hbase:meta'
> > > >>> at
> > > >>> > > region=hbase:meta,,1.1588230740,
> > > >>> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > > >>> > > at
> > > >>> >
> > > >>>
> > > >>> You running with read replicas enabled?
> > > >>>
> > > >>>
> > > >>> ....
> > > >>>
> > > >>>
> > > >>>
> > > >>> *(regionserver log)*
> > > >>>
> > > >>>
> > > >>> Anything before this log message about rolling the WAL?
> > > >>>
> > > >>>
> > > >>> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing,
> > > >>> request
> > > >>> > close of wal
> > > >>> > java.io.IOException: java.lang.NullPointerException
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > >>> > at java.lang.Thread.run(Thread.java:745)
> > > >>> > Caused by: java.lang.NullPointerException
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > >>> > ... 2 more
> > > >>> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync
> cost:
> > > 234
> > > >>> ms,
> > > >>> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> > > >>> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> > > >>> > DatanodeInfoWithStorage[10.5.2.95:50010
> > > >>> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> > > >>> > DatanodeInfoWithStorage[10.5.2.57:50010
> > > >>> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> > > >>> > 2016-02-15 22:04:33,043 FATAL
> > > >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > > >>> > regionserver.HRegionServer: ABORTING region server
> > > >>> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> > > >>> > java.io.IOException: java.lang.NullPointerException
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > >>> > at java.lang.Thread.run(Thread.java:745)
> > > >>> > Caused by: java.lang.NullPointerException
> > > >>> > at
> > > >>> >
> > > >>>
> > > >>>
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > >>> > ... 2 more
> > > >>> > 2016-02-15 22:04:33,043 FATAL
> > > >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > > >>> > regionserver.HRegionServer: RegionServer abort: loaded
> coprocessors
> > > >>> are:
> > > >>> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > > >>> >
> > > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> > > >>> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> > > >>> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > > >>> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> > > >>> > org.apache.phoenix.hbase.index.Indexer,
> > > >>> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> > > >>> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > > >>> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> > > >>> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> > > >>> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > > >>> > 2016-02-15 22:04:33,136 INFO
> > > >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > > >>> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> > > >>> >   "beans" : [ {
> > > >>> >     "name" : "java.lang:type=Memory",
> > > >>> >     "modelerType" : "sun.management.MemoryImpl",
> > > >>> >     "Verbose" : true,
> > > >>> >     "HeapMemoryUsage" : {
> > > >>> >       "committed" : 15032385536,
> > > >>> >       "init" : 15032385536,
> > > >>> >       "max" : 15032385536,
> > > >>> >       "used" : 3732843792
> > > >>> >     },
> > > >>> >     "ObjectPendingFinalizationCount" : 0,
> > > >>> >     "NonHeapMemoryUsage" : {
> > > >>> >       "committed" : 104660992,
> > > >>> >       "init" : 2555904,
> > > >>> >       "max" : -1,
> > > >>> >       "used" : 103018984
> > > >>> >     },
> > > >>> >     "ObjectName" : "java.lang:type=Memory"
> > > >>> >   } ],
> > > >>> >   "beans" : [ {
> > > >>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> > > >>> >     "modelerType" : "RegionServer,sub=IPC",
> > > >>> >     "tag.Context" : "regionserver",
> > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > >>> >   } ],
> > > >>> >   "beans" : [ {
> > > >>> >     "name" :
> > > "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> > > >>> >     "modelerType" : "RegionServer,sub=Replication",
> > > >>> >     "tag.Context" : "regionserver",
> > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > >>> >   } ],
> > > >>> >   "beans" : [ {
> > > >>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> > > >>> >     "modelerType" : "RegionServer,sub=Server",
> > > >>> >     "tag.Context" : "regionserver",
> > > >>> >     "tag.Hostname" : "hbase-rs9"
> > > >>> >   } ]
> > > >>> > }
> > > >>> > 2016-02-15 22:04:33,194 INFO
> > > >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > > >>> > regionserver.HRegionServer: STOPPED: IOE in log roller
> > > >>>
> > > >>> Looks like we were in log roller, closing the WAL to put up a new
> one
> > > but
> > > >>> the a sync was trying to run at same time but stream undo under
> > it....
> > > >>> NPE... which caused server abort.
> > > >>>
> > > >>> ...
> > > >>>
> > > >>>
> > > >>> Any idea what happened? Today, I moved some regions around to
> balance
> > > the
> > > >>> cluster and I ran a major compaction after that and I added more
> > > threads
> > > >>> to
> > > >>> run small and large compactions, could this be related?
> > > >>>
> > > >>>
> > > >>>
> > > >>> Yeah, pass more of the regionserver log from before the NPE... to
> see
> > > >>> what
> > > >>> was going on at the time.  I don't think the above changes related.
> > > >>>
> > > >>>
> > > >>>
> > > >>> I see that in the current branch the class *ProtobufLogWriter:176*
> > > >>> already
> > > >>> contains try..catch
> > > >>>
> > > >>> try {
> > > >>> >   return this.output.getPos();
> > > >>> > } catch (NullPointerException npe) {
> > > >>> >   // Concurrent close...
> > > >>> >   throw new IOException(npe);
> > > >>> > }
> > > >>>
> > > >>>
> > > >>> Yeah. Looks like the thought is an NPE could happen here because
> the
> > > >>> stream
> > > >>> is being closed out from under us by another thread (the syncers
> are
> > > off
> > > >>> on
> > > >>> their own just worried about syncing... This is just to get the NPE
> > up
> > > >>> out
> > > >>> of the syncer thread and up into the general WAL subsystem.).  It
> > looks
> > > >>> like this caused the regionserver abort. I would think it should
> just
> > > >>> provoke the WAL roll, not an abort. Put up more RS log please.
> > > >>>
> > > >>>
> > > >>>
> > > >>> > But I would be nice to understand the root cause of this error
> and
> > if
> > > >>> there
> > > >>> > is any misconfiguration from my side.
> > > >>> >
> > > >>> >
> > > >>> Not your config I'd say (if your config brought it on, its a bug).
> > > >>>
> > > >>>
> > > >>>
> > > >>> > *Version: *HBase 1.1.2
> > > >>> >
> > > >>> >
> > > >>>
> > > >>> St.Ack
> > > >>>
> > > >>> P.S. It looks like a gentleman had similar issue end of last year:
> > > >>>
> > > >>>
> > >
> >
> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
> > > >>> No response then.
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> > Thank you
> > > >>> > Cheers
> > > >>> > Pedro
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: NPE on ProtobufLogWriter.sync()

Posted by Pedro Gandola <pe...@gmail.com>.
Hi Enis,

At the time that the problem happened, the active NN log does not have any
error or exceptions however the standby node has the following:

2016-02-16 14:21:51,039 INFO ha.EditLogTailer
> (EditLogTailer.java:doTailEdits(238)) - Loaded 80 edits starting from txid
> 4916546
> 2016-02-16 14:22:37,633 WARN namenode.FSNamesystem
> (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> returned error: Operation category READ is not supported in state standby
> 2016-02-16 14:22:37,635 WARN namenode.FSNamesystem
> (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> returned error: Operation category READ is not supported in state standby
> 2016-02-16 14:22:37,766 INFO ipc.Server (Server.java:run(2151)) - IPC
> Server handler 196 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from
> 10.5.2.11:54684 Call#22619 Retry#1:
> org.apache.hadoop.ipc.StandbyException: *Operation category WRITE is not
> supported in state standby*
> 2016-02-16 14:22:37,783 INFO ipc.Server (Server.java:run(2151)) - IPC
> Server handler 161 on 8020, call
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getAdditionalDatanode from
> 10.5.2.11:54684 Call#22620 Retry#0:
> org.apache.hadoop.ipc.StandbyException: Operation category READ is not
> supported in state standby


But the standby node log is full of messages like this:

2016-02-17 02:18:44,688 WARN namenode.FSNamesystem
> (FSNamesystem.java:getCorruptFiles(7263)) - Get corrupt file blocks
> returned error: Operation category READ is not supported in state standby


Is this normal? or shall I run a fsck? I'm trusting in Ambari that says
that HDFS is totally OK :).

Thanks
Cheers
Pedro

On Tue, Feb 16, 2016 at 11:10 PM, Enis Söztutar <en...@gmail.com> wrote:

> > Operation category WRITE is not supported in state standby
> This looks to be coming from a NN that is in Standby state (or safe mode?).
> Did you check whether underlying HDFS is healthy. Is this HA configuration
> and hbase-site.xml contains the correct NN configuration?
>
> Enis
>
>
> On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola <pe...@gmail.com>
> wrote:
>
> > Just lost another RegionServer... The cluster was very stable until
> > yesterday, 2 region servers in less than 24 hours something might be
> wrong
> > in my configuration.
> >
> > Any insights?
> >
> > Thank you
> > Pedro
> >
> > *(rs log)*
> >
> > > 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> > > java.io.IOException: Bad response ERROR for block
> > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from
> datanode
> > > DatanodeInfoWithStorage[10.5.1.117:50010
> > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > at
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> > > 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod]
> > > retry.RetryInvocationHandler: Exception while invoking renewLease of
> > class
> > > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> > > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over after
> > > sleeping for 1261ms.
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > Operation category WRITE is not supported in state standby
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > at
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > at java.security.AccessController.doPrivileged(Native Method)
> > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > at
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > at
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590)
> > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > > at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > at com.sun.proxy.$Proxy17.renewLease(Unknown Source)
> > > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > > at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > at
> org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > at com.sun.proxy.$Proxy18.renewLease(Unknown Source)
> > > at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892)
> > > at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
> > > at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
> > > at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
> > > at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
> > > at java.lang.Thread.run(Thread.java:745)
> > > 2016-02-16 14:22:37,753 WARN  [DataStreamer for file
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > hdfs.DFSClient: Error Recovery for block
> > > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 in pipeline
> > > DatanodeInfoWithStorage[10.5.2.11:50010
> > ,DS-92794e46-b126-4aa8-bfd9-a8867cac42e4,DISK],
> > > DatanodeInfoWithStorage[10.5.1.117:50010
> > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK],
> > > DatanodeInfoWithStorage[10.5.2.57:50010
> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]:
> > > bad datanode DatanodeInfoWithStorage[10.5.1.117:50010
> > > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > > 2016-02-16 14:22:37,769 INFO  [DataStreamer for file
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > retry.RetryInvocationHandler: Exception while invoking
> > > getAdditionalDatanode of class ClientNamenodeProtocolTranslatorPB over
> > > hbase-ms5.localdomain/10.5.2.248:8020. Trying to fail over
> immediately.
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > > Operation category READ is not supported in state standby
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3326)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > at
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > at java.security.AccessController.doPrivileged(Native Method)
> > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > at
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > at
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > at
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > at
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > at
> org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > > 2016-02-16 14:22:37,781 WARN  [DataStreamer for file
> > >
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > > hdfs.DFSClient: DataStreamer Exception
> > >
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > > No lease on
> > >
> >
> /apps/hbase/data/oldWALs/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > > (inode 641732): File is not open for writing. Holder
> > > DFSClient_NONMAPREDUCE_-127840820_1 does not have any open files.
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3445)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3347)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > at
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > > at java.security.AccessController.doPrivileged(Native Method)
> > > at javax.security.auth.Subject.doAs(Subject.java:422)
> > > at
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > > at
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > at
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > at
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > at
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > at java.lang.reflect.Method.invoke(Method.java:497)
> > > at
> org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > > at
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > > 2016-02-16 14:22:38,134 INFO
> > >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > > Opening socket connection to server zookeeper3.localdomain/
> > 10.5.2.166:2181.
> > > Will not attempt to authenticate using SASL (unknown error)
> > > 2016-02-16 14:22:38,137 INFO
> > >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > > Socket connection established to zookeeper3.localdomain/
> 10.5.2.166:2181,
> > > initiating session
> > > 2016-02-16 14:22:38,144 INFO
> > >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > > Unable to reconnect to ZooKeeper service, session 0x25278d348e90c8d has
> > > expired, closing socket connection
> > > 2016-02-16 14:22:38,144 FATAL [main-EventThread]
> > > regionserver.HRegionServer: ABORTING region server
> > > hbase-rs6.localdomain,16020,1455560986600:
> > > regionserver:16020-0x25278d348e90c8d,
> > >
> >
> quorum=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181,
> > > baseZNode=/hbase regionserver:16020-0x25278d348e90c8d received expired
> > from
> > > ZooKeeper, aborting
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired
> > > at
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:606)
> > > at
> > >
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
> > > at
> > >
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> >
> >
> > On Tue, Feb 16, 2016 at 9:32 AM, Pedro Gandola <pe...@gmail.com>
> > wrote:
> >
> > > Just missed one of your questions.
> > >
> > > Were you trying to restart it and it wouldn't come up? It kept doing
> the
> > >> NPE on each restart?
> > >> Or it happened once and killed the regionserver?
> > >
> > >
> > > I received an alarm for the RS and I went there I saw the cluster not
> > > processing data and I tried to restart the RS. Everything went to
> normal
> > > but the RS crashed again then I decided to remove it completely until
> > know
> > > more about the problem. Now looking into the restarting code I see that
> > the
> > > problem might be other.
> > >
> > > *(log.out) *
> > >
> > >> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214:
> 20748
> > >> Killed                  nice -n $HBASE_NICENESS
> "$HBASE_HOME"/bin/hbase
> > >> --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT}
> 2>&1
> > >> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
> > >> directory
> > >
> > >
> > > *(log after restart)*
> > >
> > >> 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> > >>
> >
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > >> environment:java.io.tmpdir=/tmp
> > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > >> environment:java.compiler=<NA>
> > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > >> environment:os.name=Linux
> > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > >> environment:os.arch=amd64
> > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > >> environment:os.version=3.10.0-327.3.1.el7.x86_64
> > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > >> environment:user.name=hbase
> > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > >> environment:user.home=/home/hbase
> > >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> > >> environment:user.dir=/home/hbase
> > >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > >> client connection,
> > >>
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > >> sessionTimeout=30000
> > >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> > >> client connection,
> > >>
> >
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> > >> sessionTimeout=30000
> > >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> > >> 2016-02-15 22:21:23,754 INFO
> > >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > >> Opening socket connection to server zookeeper2.localdomain/
> > >> 10.5.1.105:2181. Will not attempt to authenticate using SASL (unknown
> > >> error)
> > >> 2016-02-15 22:21:23,755 INFO
> > >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > >> Opening socket connection to server zookeeper2.localdomain/
> > >> 10.5.1.105:2181. Will not attempt to authenticate using SASL (unknown
> > >> error)
> > >> 2016-02-15 22:21:23,767 INFO
> > >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > >> Socket connection established to zookeeper2.localdomain/
> 10.5.1.105:2181
> > ,
> > >> initiating session
> > >> 2016-02-15 22:21:23,767 INFO
> > >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > >> Socket connection established to zookeeper2.localdomain/
> 10.5.1.105:2181
> > ,
> > >> initiating session
> > >> 2016-02-15 22:21:23,777 INFO
> > >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > >> Session establishment complete on server zookeeper2.localdomain/
> > >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout =
> > >> 30000
> > >> 2016-02-15 22:21:23,777 INFO
> > >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> > >> Session establishment complete on server zookeeper2.localdomain/
> > >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout =
> > >> 30000
> > >>
> > >
> > > I don't see any exception during the restart but for some reason the
> > > process got killed and I don't see any SIGTERM or OOM on linux journal.
> > >
> > > Cheers
> > > Pedro
> > >
> > >
> > > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <
> pedro.gandola@gmail.com>
> > > wrote:
> > >
> > >> Hi St.Ack,
> > >>
> > >> Thank you for your help. I have attached ~5 min worth of data before
> the
> > >> crash.
> > >>
> > >> I restarted the cluster to update some configurations after that I
> moved
> > >> some regions around to balance the cluster and just to ensure the data
> > >> locality I ran a major compaction. After that I connected my App and 2
> > >> hours later  this happened. So by the time of the crash I was not
> > >> performing any operation over the cluster it was running normally.
> > >>
> > >> Thank you,
> > >> Cheers
> > >> Pedro
> > >>
> > >> On Tue, Feb 16, 2016 at 2:01 AM, Stack <st...@duboce.net> wrote:
> > >>
> > >>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <
> > pedro.gandola@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > Hi Guys,
> > >>> >
> > >>> > One of my region servers got into a state where it was unable to
> > start
> > >>> and
> > >>> > the cluster was not receiving traffic for some time:
> > >>>
> > >>>
> > >>>
> > >>> Were you trying to restart it and it wouldn't come up? It kept doing
> > the
> > >>> NPE on each restart?
> > >>>
> > >>> Or it happened once and killed the regionserver?
> > >>>
> > >>>
> > >>>
> > >>> *(master log)*
> > >>> >
> > >>> > > 2016-02-15 22:04:33,186 ERROR
> > >>> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> > >>> > master.MasterRpcServices:
> > >>> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported
> a
> > >>> fatal
> > >>> > > error:
> > >>> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134:
> > >>> IOE in
> > >>> > > log roller
> > >>> > > Cause:
> > >>> > > java.io.IOException: java.lang.NullPointerException
> > >>> > > at
> > >>> > >
> > >>> >
> > >>>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > >>> > > at
> > >>> > >
> > >>> >
> > >>>
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > >>> > > at java.lang.Thread.run(Thread.java:745)
> > >>> > > Caused by: java.lang.NullPointerException
> > >>> > > at
> > >>> > >
> > >>> >
> > >>>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > >>> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> > >>> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > >>> > > master.CatalogJanitor: Failed scan of catalog table
> > >>> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> > >>> after
> > >>> > > attempts=351, exceptions:
> > >>> > > Mon Feb 15 22:05:45 UTC 2016, null,
> > java.net.SocketTimeoutException:
> > >>> > > callTimeout=60000, callDuration=68222: row '' on table
> 'hbase:meta'
> > >>> at
> > >>> > > region=hbase:meta,,1.1588230740,
> > >>> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> > >>> > >
> > >>> >
> > >>>
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > >>> > > at
> > >>> >
> > >>>
> > >>> You running with read replicas enabled?
> > >>>
> > >>>
> > >>> ....
> > >>>
> > >>>
> > >>>
> > >>> *(regionserver log)*
> > >>>
> > >>>
> > >>> Anything before this log message about rolling the WAL?
> > >>>
> > >>>
> > >>> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing,
> > >>> request
> > >>> > close of wal
> > >>> > java.io.IOException: java.lang.NullPointerException
> > >>> > at
> > >>> >
> > >>>
> > >>>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > >>> > at
> > >>> >
> > >>>
> > >>>
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > >>> > at java.lang.Thread.run(Thread.java:745)
> > >>> > Caused by: java.lang.NullPointerException
> > >>> > at
> > >>> >
> > >>>
> > >>>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > >>> > ... 2 more
> > >>> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost:
> > 234
> > >>> ms,
> > >>> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> > >>> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> > >>> > DatanodeInfoWithStorage[10.5.2.95:50010
> > >>> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> > >>> > DatanodeInfoWithStorage[10.5.2.57:50010
> > >>> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> > >>> > 2016-02-15 22:04:33,043 FATAL
> > >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >>> > regionserver.HRegionServer: ABORTING region server
> > >>> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> > >>> > java.io.IOException: java.lang.NullPointerException
> > >>> > at
> > >>> >
> > >>>
> > >>>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > >>> > at
> > >>> >
> > >>>
> > >>>
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > >>> > at java.lang.Thread.run(Thread.java:745)
> > >>> > Caused by: java.lang.NullPointerException
> > >>> > at
> > >>> >
> > >>>
> > >>>
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > >>> > ... 2 more
> > >>> > 2016-02-15 22:04:33,043 FATAL
> > >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >>> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors
> > >>> are:
> > >>> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > >>> >
> > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> > >>> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> > >>> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > >>> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> > >>> > org.apache.phoenix.hbase.index.Indexer,
> > >>> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> > >>> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > >>> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> > >>> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> > >>> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > >>> > 2016-02-15 22:04:33,136 INFO
> > >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >>> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> > >>> >   "beans" : [ {
> > >>> >     "name" : "java.lang:type=Memory",
> > >>> >     "modelerType" : "sun.management.MemoryImpl",
> > >>> >     "Verbose" : true,
> > >>> >     "HeapMemoryUsage" : {
> > >>> >       "committed" : 15032385536,
> > >>> >       "init" : 15032385536,
> > >>> >       "max" : 15032385536,
> > >>> >       "used" : 3732843792
> > >>> >     },
> > >>> >     "ObjectPendingFinalizationCount" : 0,
> > >>> >     "NonHeapMemoryUsage" : {
> > >>> >       "committed" : 104660992,
> > >>> >       "init" : 2555904,
> > >>> >       "max" : -1,
> > >>> >       "used" : 103018984
> > >>> >     },
> > >>> >     "ObjectName" : "java.lang:type=Memory"
> > >>> >   } ],
> > >>> >   "beans" : [ {
> > >>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> > >>> >     "modelerType" : "RegionServer,sub=IPC",
> > >>> >     "tag.Context" : "regionserver",
> > >>> >     "tag.Hostname" : "hbase-rs9"
> > >>> >   } ],
> > >>> >   "beans" : [ {
> > >>> >     "name" :
> > "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> > >>> >     "modelerType" : "RegionServer,sub=Replication",
> > >>> >     "tag.Context" : "regionserver",
> > >>> >     "tag.Hostname" : "hbase-rs9"
> > >>> >   } ],
> > >>> >   "beans" : [ {
> > >>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> > >>> >     "modelerType" : "RegionServer,sub=Server",
> > >>> >     "tag.Context" : "regionserver",
> > >>> >     "tag.Hostname" : "hbase-rs9"
> > >>> >   } ]
> > >>> > }
> > >>> > 2016-02-15 22:04:33,194 INFO
> > >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > >>> > regionserver.HRegionServer: STOPPED: IOE in log roller
> > >>>
> > >>> Looks like we were in log roller, closing the WAL to put up a new one
> > but
> > >>> the a sync was trying to run at same time but stream undo under
> it....
> > >>> NPE... which caused server abort.
> > >>>
> > >>> ...
> > >>>
> > >>>
> > >>> Any idea what happened? Today, I moved some regions around to balance
> > the
> > >>> cluster and I ran a major compaction after that and I added more
> > threads
> > >>> to
> > >>> run small and large compactions, could this be related?
> > >>>
> > >>>
> > >>>
> > >>> Yeah, pass more of the regionserver log from before the NPE... to see
> > >>> what
> > >>> was going on at the time.  I don't think the above changes related.
> > >>>
> > >>>
> > >>>
> > >>> I see that in the current branch the class *ProtobufLogWriter:176*
> > >>> already
> > >>> contains try..catch
> > >>>
> > >>> try {
> > >>> >   return this.output.getPos();
> > >>> > } catch (NullPointerException npe) {
> > >>> >   // Concurrent close...
> > >>> >   throw new IOException(npe);
> > >>> > }
> > >>>
> > >>>
> > >>> Yeah. Looks like the thought is an NPE could happen here because the
> > >>> stream
> > >>> is being closed out from under us by another thread (the syncers are
> > off
> > >>> on
> > >>> their own just worried about syncing... This is just to get the NPE
> up
> > >>> out
> > >>> of the syncer thread and up into the general WAL subsystem.).  It
> looks
> > >>> like this caused the regionserver abort. I would think it should just
> > >>> provoke the WAL roll, not an abort. Put up more RS log please.
> > >>>
> > >>>
> > >>>
> > >>> > But I would be nice to understand the root cause of this error and
> if
> > >>> there
> > >>> > is any misconfiguration from my side.
> > >>> >
> > >>> >
> > >>> Not your config I'd say (if your config brought it on, its a bug).
> > >>>
> > >>>
> > >>>
> > >>> > *Version: *HBase 1.1.2
> > >>> >
> > >>> >
> > >>>
> > >>> St.Ack
> > >>>
> > >>> P.S. It looks like a gentleman had similar issue end of last year:
> > >>>
> > >>>
> >
> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
> > >>> No response then.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> > Thank you
> > >>> > Cheers
> > >>> > Pedro
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: NPE on ProtobufLogWriter.sync()

Posted by Enis Söztutar <en...@gmail.com>.
> Operation category WRITE is not supported in state standby
This looks to be coming from a NN that is in Standby state (or safe mode?).
Did you check whether underlying HDFS is healthy. Is this HA configuration
and hbase-site.xml contains the correct NN configuration?

Enis


On Tue, Feb 16, 2016 at 6:51 AM, Pedro Gandola <pe...@gmail.com>
wrote:

> Just lost another RegionServer... The cluster was very stable until
> yesterday, 2 region servers in less than 24 hours something might be wrong
> in my configuration.
>
> Any insights?
>
> Thank you
> Pedro
>
> *(rs log)*
>
> > 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> > java.io.IOException: Bad response ERROR for block
> > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from datanode
> > DatanodeInfoWithStorage[10.5.1.117:50010
> > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > at
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> > 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod]
> > retry.RetryInvocationHandler: Exception while invoking renewLease of
> class
> > ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> > 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over after
> > sleeping for 1261ms.
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > Operation category WRITE is not supported in state standby
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> > at
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> > at
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
> > at
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590)
> > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:497)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > at com.sun.proxy.$Proxy17.renewLease(Unknown Source)
> > at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:497)
> > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > at com.sun.proxy.$Proxy18.renewLease(Unknown Source)
> > at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892)
> > at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
> > at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
> > at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
> > at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
> > at java.lang.Thread.run(Thread.java:745)
> > 2016-02-16 14:22:37,753 WARN  [DataStreamer for file
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > hdfs.DFSClient: Error Recovery for block
> > BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 in pipeline
> > DatanodeInfoWithStorage[10.5.2.11:50010
> ,DS-92794e46-b126-4aa8-bfd9-a8867cac42e4,DISK],
> > DatanodeInfoWithStorage[10.5.1.117:50010
> ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK],
> > DatanodeInfoWithStorage[10.5.2.57:50010
> ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]:
> > bad datanode DatanodeInfoWithStorage[10.5.1.117:50010
> > ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> > 2016-02-16 14:22:37,769 INFO  [DataStreamer for file
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > retry.RetryInvocationHandler: Exception while invoking
> > getAdditionalDatanode of class ClientNamenodeProtocolTranslatorPB over
> > hbase-ms5.localdomain/10.5.2.248:8020. Trying to fail over immediately.
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> > Operation category READ is not supported in state standby
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3326)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > at
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > at
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > at
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:497)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:497)
> > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > at
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > at
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > at
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > at
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > 2016-02-16 14:22:37,781 WARN  [DataStreamer for file
> >
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> > hdfs.DFSClient: DataStreamer Exception
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> > No lease on
> >
> /apps/hbase/data/oldWALs/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> > (inode 641732): File is not open for writing. Holder
> > DFSClient_NONMAPREDUCE_-127840820_1 does not have any open files.
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3445)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3347)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> > at
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> > at
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> > at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> > at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> > at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> > at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> > at
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:497)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:497)
> > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> > at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> > at
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> > at
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> > at
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> > at
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> > 2016-02-16 14:22:38,134 INFO
> >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > Opening socket connection to server zookeeper3.localdomain/
> 10.5.2.166:2181.
> > Will not attempt to authenticate using SASL (unknown error)
> > 2016-02-16 14:22:38,137 INFO
> >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > Socket connection established to zookeeper3.localdomain/10.5.2.166:2181,
> > initiating session
> > 2016-02-16 14:22:38,144 INFO
> >  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> > Unable to reconnect to ZooKeeper service, session 0x25278d348e90c8d has
> > expired, closing socket connection
> > 2016-02-16 14:22:38,144 FATAL [main-EventThread]
> > regionserver.HRegionServer: ABORTING region server
> > hbase-rs6.localdomain,16020,1455560986600:
> > regionserver:16020-0x25278d348e90c8d,
> >
> quorum=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181,
> > baseZNode=/hbase regionserver:16020-0x25278d348e90c8d received expired
> from
> > ZooKeeper, aborting
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired
> > at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:606)
> > at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
> > at
> >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>
>
> On Tue, Feb 16, 2016 at 9:32 AM, Pedro Gandola <pe...@gmail.com>
> wrote:
>
> > Just missed one of your questions.
> >
> > Were you trying to restart it and it wouldn't come up? It kept doing the
> >> NPE on each restart?
> >> Or it happened once and killed the regionserver?
> >
> >
> > I received an alarm for the RS and I went there I saw the cluster not
> > processing data and I tried to restart the RS. Everything went to normal
> > but the RS crashed again then I decided to remove it completely until
> know
> > more about the problem. Now looking into the restarting code I see that
> the
> > problem might be other.
> >
> > *(log.out) *
> >
> >> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: 20748
> >> Killed                  nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
> >> --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1
> >> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
> >> directory
> >
> >
> > *(log after restart)*
> >
> >> 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> >>
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> >> environment:java.io.tmpdir=/tmp
> >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> >> environment:java.compiler=<NA>
> >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> >> environment:os.name=Linux
> >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> >> environment:os.arch=amd64
> >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> >> environment:os.version=3.10.0-327.3.1.el7.x86_64
> >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> >> environment:user.name=hbase
> >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> >> environment:user.home=/home/hbase
> >> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> >> environment:user.dir=/home/hbase
> >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> >> client connection,
> >>
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> >> sessionTimeout=30000
> >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> >> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> >> client connection,
> >>
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> >> sessionTimeout=30000
> >> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> >> 2016-02-15 22:21:23,754 INFO
> >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> >> Opening socket connection to server zookeeper2.localdomain/
> >> 10.5.1.105:2181. Will not attempt to authenticate using SASL (unknown
> >> error)
> >> 2016-02-15 22:21:23,755 INFO
> >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> >> Opening socket connection to server zookeeper2.localdomain/
> >> 10.5.1.105:2181. Will not attempt to authenticate using SASL (unknown
> >> error)
> >> 2016-02-15 22:21:23,767 INFO
> >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> >> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181
> ,
> >> initiating session
> >> 2016-02-15 22:21:23,767 INFO
> >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> >> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181
> ,
> >> initiating session
> >> 2016-02-15 22:21:23,777 INFO
> >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> >> Session establishment complete on server zookeeper2.localdomain/
> >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout =
> >> 30000
> >> 2016-02-15 22:21:23,777 INFO
> >>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> >> Session establishment complete on server zookeeper2.localdomain/
> >> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout =
> >> 30000
> >>
> >
> > I don't see any exception during the restart but for some reason the
> > process got killed and I don't see any SIGTERM or OOM on linux journal.
> >
> > Cheers
> > Pedro
> >
> >
> > On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <pe...@gmail.com>
> > wrote:
> >
> >> Hi St.Ack,
> >>
> >> Thank you for your help. I have attached ~5 min worth of data before the
> >> crash.
> >>
> >> I restarted the cluster to update some configurations after that I moved
> >> some regions around to balance the cluster and just to ensure the data
> >> locality I ran a major compaction. After that I connected my App and 2
> >> hours later  this happened. So by the time of the crash I was not
> >> performing any operation over the cluster it was running normally.
> >>
> >> Thank you,
> >> Cheers
> >> Pedro
> >>
> >> On Tue, Feb 16, 2016 at 2:01 AM, Stack <st...@duboce.net> wrote:
> >>
> >>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <
> pedro.gandola@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi Guys,
> >>> >
> >>> > One of my region servers got into a state where it was unable to
> start
> >>> and
> >>> > the cluster was not receiving traffic for some time:
> >>>
> >>>
> >>>
> >>> Were you trying to restart it and it wouldn't come up? It kept doing
> the
> >>> NPE on each restart?
> >>>
> >>> Or it happened once and killed the regionserver?
> >>>
> >>>
> >>>
> >>> *(master log)*
> >>> >
> >>> > > 2016-02-15 22:04:33,186 ERROR
> >>> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> >>> > master.MasterRpcServices:
> >>> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a
> >>> fatal
> >>> > > error:
> >>> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134:
> >>> IOE in
> >>> > > log roller
> >>> > > Cause:
> >>> > > java.io.IOException: java.lang.NullPointerException
> >>> > > at
> >>> > >
> >>> >
> >>>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> >>> > > at
> >>> > >
> >>> >
> >>>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> >>> > > at java.lang.Thread.run(Thread.java:745)
> >>> > > Caused by: java.lang.NullPointerException
> >>> > > at
> >>> > >
> >>> >
> >>>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> >>> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> >>> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> >>> > > master.CatalogJanitor: Failed scan of catalog table
> >>> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
> >>> after
> >>> > > attempts=351, exceptions:
> >>> > > Mon Feb 15 22:05:45 UTC 2016, null,
> java.net.SocketTimeoutException:
> >>> > > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta'
> >>> at
> >>> > > region=hbase:meta,,1.1588230740,
> >>> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> >>> > >
> >>> >
> >>>
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> >>> > > at
> >>> >
> >>>
> >>> You running with read replicas enabled?
> >>>
> >>>
> >>> ....
> >>>
> >>>
> >>>
> >>> *(regionserver log)*
> >>>
> >>>
> >>> Anything before this log message about rolling the WAL?
> >>>
> >>>
> >>> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing,
> >>> request
> >>> > close of wal
> >>> > java.io.IOException: java.lang.NullPointerException
> >>> > at
> >>> >
> >>>
> >>>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> >>> > at
> >>> >
> >>>
> >>>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> >>> > at java.lang.Thread.run(Thread.java:745)
> >>> > Caused by: java.lang.NullPointerException
> >>> > at
> >>> >
> >>>
> >>>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> >>> > ... 2 more
> >>> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost:
> 234
> >>> ms,
> >>> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> >>> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> >>> > DatanodeInfoWithStorage[10.5.2.95:50010
> >>> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> >>> > DatanodeInfoWithStorage[10.5.2.57:50010
> >>> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> >>> > 2016-02-15 22:04:33,043 FATAL
> >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> >>> > regionserver.HRegionServer: ABORTING region server
> >>> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> >>> > java.io.IOException: java.lang.NullPointerException
> >>> > at
> >>> >
> >>>
> >>>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> >>> > at
> >>> >
> >>>
> >>>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> >>> > at java.lang.Thread.run(Thread.java:745)
> >>> > Caused by: java.lang.NullPointerException
> >>> > at
> >>> >
> >>>
> >>>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> >>> > ... 2 more
> >>> > 2016-02-15 22:04:33,043 FATAL
> >>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> >>> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors
> >>> are:
> >>> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> >>> >
> org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> >>> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> >>> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> >>> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> >>> > org.apache.phoenix.hbase.index.Indexer,
> >>> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> >>> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> >>> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> >>> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> >>> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> >>> > 2016-02-15 22:04:33,136 INFO
> >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> >>> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> >>> >   "beans" : [ {
> >>> >     "name" : "java.lang:type=Memory",
> >>> >     "modelerType" : "sun.management.MemoryImpl",
> >>> >     "Verbose" : true,
> >>> >     "HeapMemoryUsage" : {
> >>> >       "committed" : 15032385536,
> >>> >       "init" : 15032385536,
> >>> >       "max" : 15032385536,
> >>> >       "used" : 3732843792
> >>> >     },
> >>> >     "ObjectPendingFinalizationCount" : 0,
> >>> >     "NonHeapMemoryUsage" : {
> >>> >       "committed" : 104660992,
> >>> >       "init" : 2555904,
> >>> >       "max" : -1,
> >>> >       "used" : 103018984
> >>> >     },
> >>> >     "ObjectName" : "java.lang:type=Memory"
> >>> >   } ],
> >>> >   "beans" : [ {
> >>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> >>> >     "modelerType" : "RegionServer,sub=IPC",
> >>> >     "tag.Context" : "regionserver",
> >>> >     "tag.Hostname" : "hbase-rs9"
> >>> >   } ],
> >>> >   "beans" : [ {
> >>> >     "name" :
> "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> >>> >     "modelerType" : "RegionServer,sub=Replication",
> >>> >     "tag.Context" : "regionserver",
> >>> >     "tag.Hostname" : "hbase-rs9"
> >>> >   } ],
> >>> >   "beans" : [ {
> >>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> >>> >     "modelerType" : "RegionServer,sub=Server",
> >>> >     "tag.Context" : "regionserver",
> >>> >     "tag.Hostname" : "hbase-rs9"
> >>> >   } ]
> >>> > }
> >>> > 2016-02-15 22:04:33,194 INFO
> >>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> >>> > regionserver.HRegionServer: STOPPED: IOE in log roller
> >>>
> >>> Looks like we were in log roller, closing the WAL to put up a new one
> but
> >>> the a sync was trying to run at same time but stream undo under it....
> >>> NPE... which caused server abort.
> >>>
> >>> ...
> >>>
> >>>
> >>> Any idea what happened? Today, I moved some regions around to balance
> the
> >>> cluster and I ran a major compaction after that and I added more
> threads
> >>> to
> >>> run small and large compactions, could this be related?
> >>>
> >>>
> >>>
> >>> Yeah, pass more of the regionserver log from before the NPE... to see
> >>> what
> >>> was going on at the time.  I don't think the above changes related.
> >>>
> >>>
> >>>
> >>> I see that in the current branch the class *ProtobufLogWriter:176*
> >>> already
> >>> contains try..catch
> >>>
> >>> try {
> >>> >   return this.output.getPos();
> >>> > } catch (NullPointerException npe) {
> >>> >   // Concurrent close...
> >>> >   throw new IOException(npe);
> >>> > }
> >>>
> >>>
> >>> Yeah. Looks like the thought is an NPE could happen here because the
> >>> stream
> >>> is being closed out from under us by another thread (the syncers are
> off
> >>> on
> >>> their own just worried about syncing... This is just to get the NPE up
> >>> out
> >>> of the syncer thread and up into the general WAL subsystem.).  It looks
> >>> like this caused the regionserver abort. I would think it should just
> >>> provoke the WAL roll, not an abort. Put up more RS log please.
> >>>
> >>>
> >>>
> >>> > But I would be nice to understand the root cause of this error and if
> >>> there
> >>> > is any misconfiguration from my side.
> >>> >
> >>> >
> >>> Not your config I'd say (if your config brought it on, its a bug).
> >>>
> >>>
> >>>
> >>> > *Version: *HBase 1.1.2
> >>> >
> >>> >
> >>>
> >>> St.Ack
> >>>
> >>> P.S. It looks like a gentleman had similar issue end of last year:
> >>>
> >>>
> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
> >>> No response then.
> >>>
> >>>
> >>>
> >>>
> >>> > Thank you
> >>> > Cheers
> >>> > Pedro
> >>> >
> >>>
> >>
> >>
> >
>

Re: NPE on ProtobufLogWriter.sync()

Posted by Pedro Gandola <pe...@gmail.com>.
Just lost another RegionServer... The cluster was very stable until
yesterday, 2 region servers in less than 24 hours something might be wrong
in my configuration.

Any insights?

Thank you
Pedro

*(rs log)*

> 2016-02-16 14:22:37,749 WARN  [ResponseProcessor for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937
> java.io.IOException: Bad response ERROR for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 from datanode
> DatanodeInfoWithStorage[10.5.1.117:50010
> ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:786)
> 2016-02-16 14:22:37,751 INFO  [LeaseRenewer:hbase@nameservice-prod]
> retry.RetryInvocationHandler: Exception while invoking renewLease of class
> ClientNamenodeProtocolTranslatorPB over hbase-ms5.localdomain/
> 10.5.2.248:8020 after 1 fail over attempts. Trying to fail over after
> sleeping for 1261ms.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category WRITE is not supported in state standby
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renewLease(FSNamesystem.java:4457)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.renewLease(NameNodeRpcServer.java:992)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.renewLease(ClientNamenodeProtocolServerSideTranslatorPB.java:652)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy16.renewLease(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:590)
> at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.renewLease(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> at com.sun.proxy.$Proxy18.renewLease(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:892)
> at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
> at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
> at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
> at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
> at java.lang.Thread.run(Thread.java:745)
> 2016-02-16 14:22:37,753 WARN  [DataStreamer for file
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> hdfs.DFSClient: Error Recovery for block
> BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937 in pipeline
> DatanodeInfoWithStorage[10.5.2.11:50010,DS-92794e46-b126-4aa8-bfd9-a8867cac42e4,DISK],
> DatanodeInfoWithStorage[10.5.1.117:50010,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK],
> DatanodeInfoWithStorage[10.5.2.57:50010,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]:
> bad datanode DatanodeInfoWithStorage[10.5.1.117:50010
> ,DS-f1076321-e61d-4554-98aa-0801d64040ae,DISK]
> 2016-02-16 14:22:37,769 INFO  [DataStreamer for file
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> retry.RetryInvocationHandler: Exception while invoking
> getAdditionalDatanode of class ClientNamenodeProtocolTranslatorPB over
> hbase-ms5.localdomain/10.5.2.248:8020. Trying to fail over immediately.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category READ is not supported in state standby
> at
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1872)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1306)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3326)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> 2016-02-16 14:22:37,781 WARN  [DataStreamer for file
> /apps/hbase/data/WALs/hbase-rs6.localdomain,16020,1455560986600/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> block BP-568427890-10.5.1.235-1453722567252:blk_1074371748_630937]
> hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /apps/hbase/data/oldWALs/hbase-rs6.localdomain%2C16020%2C1455560986600.default.1455632424508
> (inode 641732): File is not open for writing. Holder
> DFSClient_NONMAPREDUCE_-127840820_1 does not have any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3445)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3347)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:759)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> at org.apache.hadoop.ipc.Client.call(Client.java:1427)
> at org.apache.hadoop.ipc.Client.call(Client.java:1358)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:443)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
> at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1010)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1165)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
> 2016-02-16 14:22:38,134 INFO
>  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> Opening socket connection to server zookeeper3.localdomain/10.5.2.166:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2016-02-16 14:22:38,137 INFO
>  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> Socket connection established to zookeeper3.localdomain/10.5.2.166:2181,
> initiating session
> 2016-02-16 14:22:38,144 INFO
>  [main-SendThread(zookeeper3.localdomain:2181)] zookeeper.ClientCnxn:
> Unable to reconnect to ZooKeeper service, session 0x25278d348e90c8d has
> expired, closing socket connection
> 2016-02-16 14:22:38,144 FATAL [main-EventThread]
> regionserver.HRegionServer: ABORTING region server
> hbase-rs6.localdomain,16020,1455560986600:
> regionserver:16020-0x25278d348e90c8d,
> quorum=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181,
> baseZNode=/hbase regionserver:16020-0x25278d348e90c8d received expired from
> ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:606)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:517)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)


On Tue, Feb 16, 2016 at 9:32 AM, Pedro Gandola <pe...@gmail.com>
wrote:

> Just missed one of your questions.
>
> Were you trying to restart it and it wouldn't come up? It kept doing the
>> NPE on each restart?
>> Or it happened once and killed the regionserver?
>
>
> I received an alarm for the RS and I went there I saw the cluster not
> processing data and I tried to restart the RS. Everything went to normal
> but the RS crashed again then I decided to remove it completely until know
> more about the problem. Now looking into the restarting code I see that the
> problem might be other.
>
> *(log.out) *
>
>> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: 20748
>> Killed                  nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
>> --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1
>> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
>> directory
>
>
> *(log after restart)*
>
>> 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:java.io.tmpdir=/tmp
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:java.compiler=<NA>
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:os.name=Linux
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:os.arch=amd64
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:os.version=3.10.0-327.3.1.el7.x86_64
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:user.name=hbase
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:user.home=/home/hbase
>> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
>> environment:user.dir=/home/hbase
>> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
>> client connection,
>> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
>> sessionTimeout=30000
>> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
>> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
>> client connection,
>> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
>> sessionTimeout=30000
>> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
>> 2016-02-15 22:21:23,754 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Opening socket connection to server zookeeper2.localdomain/
>> 10.5.1.105:2181. Will not attempt to authenticate using SASL (unknown
>> error)
>> 2016-02-15 22:21:23,755 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Opening socket connection to server zookeeper2.localdomain/
>> 10.5.1.105:2181. Will not attempt to authenticate using SASL (unknown
>> error)
>> 2016-02-15 22:21:23,767 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
>> initiating session
>> 2016-02-15 22:21:23,767 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
>> initiating session
>> 2016-02-15 22:21:23,777 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Session establishment complete on server zookeeper2.localdomain/
>> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout =
>> 30000
>> 2016-02-15 22:21:23,777 INFO
>>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
>> Session establishment complete on server zookeeper2.localdomain/
>> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout =
>> 30000
>>
>
> I don't see any exception during the restart but for some reason the
> process got killed and I don't see any SIGTERM or OOM on linux journal.
>
> Cheers
> Pedro
>
>
> On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <pe...@gmail.com>
> wrote:
>
>> Hi St.Ack,
>>
>> Thank you for your help. I have attached ~5 min worth of data before the
>> crash.
>>
>> I restarted the cluster to update some configurations after that I moved
>> some regions around to balance the cluster and just to ensure the data
>> locality I ran a major compaction. After that I connected my App and 2
>> hours later  this happened. So by the time of the crash I was not
>> performing any operation over the cluster it was running normally.
>>
>> Thank you,
>> Cheers
>> Pedro
>>
>> On Tue, Feb 16, 2016 at 2:01 AM, Stack <st...@duboce.net> wrote:
>>
>>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <pe...@gmail.com>
>>> wrote:
>>>
>>> > Hi Guys,
>>> >
>>> > One of my region servers got into a state where it was unable to start
>>> and
>>> > the cluster was not receiving traffic for some time:
>>>
>>>
>>>
>>> Were you trying to restart it and it wouldn't come up? It kept doing the
>>> NPE on each restart?
>>>
>>> Or it happened once and killed the regionserver?
>>>
>>>
>>>
>>> *(master log)*
>>> >
>>> > > 2016-02-15 22:04:33,186 ERROR
>>> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
>>> > master.MasterRpcServices:
>>> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a
>>> fatal
>>> > > error:
>>> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134:
>>> IOE in
>>> > > log roller
>>> > > Cause:
>>> > > java.io.IOException: java.lang.NullPointerException
>>> > > at
>>> > >
>>> >
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>>> > > at
>>> > >
>>> >
>>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>>> > > at java.lang.Thread.run(Thread.java:745)
>>> > > Caused by: java.lang.NullPointerException
>>> > > at
>>> > >
>>> >
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>>> > > ... 2 more 2016-02-15 22:05:45,678 WARN
>>> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
>>> > > master.CatalogJanitor: Failed scan of catalog table
>>> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed
>>> after
>>> > > attempts=351, exceptions:
>>> > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
>>> > > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta'
>>> at
>>> > > region=hbase:meta,,1.1588230740,
>>> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
>>> > >
>>> >
>>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
>>> > > at
>>> >
>>>
>>> You running with read replicas enabled?
>>>
>>>
>>> ....
>>>
>>>
>>>
>>> *(regionserver log)*
>>>
>>>
>>> Anything before this log message about rolling the WAL?
>>>
>>>
>>> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing,
>>> request
>>> > close of wal
>>> > java.io.IOException: java.lang.NullPointerException
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>>> > at java.lang.Thread.run(Thread.java:745)
>>> > Caused by: java.lang.NullPointerException
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>>> > ... 2 more
>>> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234
>>> ms,
>>> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
>>> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
>>> > DatanodeInfoWithStorage[10.5.2.95:50010
>>> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
>>> > DatanodeInfoWithStorage[10.5.2.57:50010
>>> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
>>> > 2016-02-15 22:04:33,043 FATAL
>>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>>> > regionserver.HRegionServer: ABORTING region server
>>> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
>>> > java.io.IOException: java.lang.NullPointerException
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>>> > at java.lang.Thread.run(Thread.java:745)
>>> > Caused by: java.lang.NullPointerException
>>> > at
>>> >
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>>> > ... 2 more
>>> > 2016-02-15 22:04:33,043 FATAL
>>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>>> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors
>>> are:
>>> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
>>> > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
>>> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
>>> > org.apache.phoenix.coprocessor.ScanRegionObserver,
>>> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
>>> > org.apache.phoenix.hbase.index.Indexer,
>>> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
>>> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
>>> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
>>> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
>>> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
>>> > 2016-02-15 22:04:33,136 INFO
>>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>>> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
>>> >   "beans" : [ {
>>> >     "name" : "java.lang:type=Memory",
>>> >     "modelerType" : "sun.management.MemoryImpl",
>>> >     "Verbose" : true,
>>> >     "HeapMemoryUsage" : {
>>> >       "committed" : 15032385536,
>>> >       "init" : 15032385536,
>>> >       "max" : 15032385536,
>>> >       "used" : 3732843792
>>> >     },
>>> >     "ObjectPendingFinalizationCount" : 0,
>>> >     "NonHeapMemoryUsage" : {
>>> >       "committed" : 104660992,
>>> >       "init" : 2555904,
>>> >       "max" : -1,
>>> >       "used" : 103018984
>>> >     },
>>> >     "ObjectName" : "java.lang:type=Memory"
>>> >   } ],
>>> >   "beans" : [ {
>>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
>>> >     "modelerType" : "RegionServer,sub=IPC",
>>> >     "tag.Context" : "regionserver",
>>> >     "tag.Hostname" : "hbase-rs9"
>>> >   } ],
>>> >   "beans" : [ {
>>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
>>> >     "modelerType" : "RegionServer,sub=Replication",
>>> >     "tag.Context" : "regionserver",
>>> >     "tag.Hostname" : "hbase-rs9"
>>> >   } ],
>>> >   "beans" : [ {
>>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
>>> >     "modelerType" : "RegionServer,sub=Server",
>>> >     "tag.Context" : "regionserver",
>>> >     "tag.Hostname" : "hbase-rs9"
>>> >   } ]
>>> > }
>>> > 2016-02-15 22:04:33,194 INFO
>>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>>> > regionserver.HRegionServer: STOPPED: IOE in log roller
>>>
>>> Looks like we were in log roller, closing the WAL to put up a new one but
>>> the a sync was trying to run at same time but stream undo under it....
>>> NPE... which caused server abort.
>>>
>>> ...
>>>
>>>
>>> Any idea what happened? Today, I moved some regions around to balance the
>>> cluster and I ran a major compaction after that and I added more threads
>>> to
>>> run small and large compactions, could this be related?
>>>
>>>
>>>
>>> Yeah, pass more of the regionserver log from before the NPE... to see
>>> what
>>> was going on at the time.  I don't think the above changes related.
>>>
>>>
>>>
>>> I see that in the current branch the class *ProtobufLogWriter:176*
>>> already
>>> contains try..catch
>>>
>>> try {
>>> >   return this.output.getPos();
>>> > } catch (NullPointerException npe) {
>>> >   // Concurrent close...
>>> >   throw new IOException(npe);
>>> > }
>>>
>>>
>>> Yeah. Looks like the thought is an NPE could happen here because the
>>> stream
>>> is being closed out from under us by another thread (the syncers are off
>>> on
>>> their own just worried about syncing... This is just to get the NPE up
>>> out
>>> of the syncer thread and up into the general WAL subsystem.).  It looks
>>> like this caused the regionserver abort. I would think it should just
>>> provoke the WAL roll, not an abort. Put up more RS log please.
>>>
>>>
>>>
>>> > But I would be nice to understand the root cause of this error and if
>>> there
>>> > is any misconfiguration from my side.
>>> >
>>> >
>>> Not your config I'd say (if your config brought it on, its a bug).
>>>
>>>
>>>
>>> > *Version: *HBase 1.1.2
>>> >
>>> >
>>>
>>> St.Ack
>>>
>>> P.S. It looks like a gentleman had similar issue end of last year:
>>>
>>> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
>>> No response then.
>>>
>>>
>>>
>>>
>>> > Thank you
>>> > Cheers
>>> > Pedro
>>> >
>>>
>>
>>
>

Re: NPE on ProtobufLogWriter.sync()

Posted by Pedro Gandola <pe...@gmail.com>.
Just missed one of your questions.

Were you trying to restart it and it wouldn't come up? It kept doing the
> NPE on each restart?
> Or it happened once and killed the regionserver?


I received an alarm for the RS and I went there I saw the cluster not
processing data and I tried to restart the RS. Everything went to normal
but the RS crashed again then I decided to remove it completely until know
more about the problem. Now looking into the restarting code I see that the
problem might be other.

*(log.out) *

> /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh: line 214: 20748
> Killed                  nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase
> --config "${HBASE_CONF_DIR}" $command "$@" start >> ${HBASE_LOGOUT} 2>&1
> cat: /mnt/var/run/hbase/hbase-hbase-regionserver.pid: No such file or
> directory


*(log after restart)*

> 2016-02-15 22:21:23,692 INFO  [main] zookeeper.ZooKeeper: Client
> environment:java.library.path=:/usr/hdp/2.3.2.0-2950/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.3.2.0-2950/hadoop/lib/native
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:java.io.tmpdir=/tmp
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:java.compiler=<NA>
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:os.name=Linux
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:os.arch=amd64
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:os.version=3.10.0-327.3.1.el7.x86_64
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:user.name=hbase
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:user.home=/home/hbase
> 2016-02-15 22:21:23,693 INFO  [main] zookeeper.ZooKeeper: Client
> environment:user.dir=/home/hbase
> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> client connection,
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> sessionTimeout=30000
> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> 2016-02-15 22:21:23,694 INFO  [main] zookeeper.ZooKeeper: Initiating
> client connection,
> connectString=zookeeper1.localdomain:2181,zookeeper2.localdomain:2181,zookeeper3.localdomain:2181
> sessionTimeout=30000
> watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher@197d671
> 2016-02-15 22:21:23,754 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Opening socket connection to server zookeeper2.localdomain/10.5.1.105:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2016-02-15 22:21:23,755 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Opening socket connection to server zookeeper2.localdomain/10.5.1.105:2181.
> Will not attempt to authenticate using SASL (unknown error)
> 2016-02-15 22:21:23,767 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
> initiating session
> 2016-02-15 22:21:23,767 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Socket connection established to zookeeper2.localdomain/10.5.1.105:2181,
> initiating session
> 2016-02-15 22:21:23,777 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Session establishment complete on server zookeeper2.localdomain/
> 10.5.1.105:2181, sessionid = 0x25278d348e90d5e, negotiated timeout = 30000
> 2016-02-15 22:21:23,777 INFO
>  [main-SendThread(zookeeper2.localdomain:2181)] zookeeper.ClientCnxn:
> Session establishment complete on server zookeeper2.localdomain/
> 10.5.1.105:2181, sessionid = 0x25278d348e90d5f, negotiated timeout = 30000
>

I don't see any exception during the restart but for some reason the
process got killed and I don't see any SIGTERM or OOM on linux journal.

Cheers
Pedro


On Tue, Feb 16, 2016 at 9:06 AM, Pedro Gandola <pe...@gmail.com>
wrote:

> Hi St.Ack,
>
> Thank you for your help. I have attached ~5 min worth of data before the
> crash.
>
> I restarted the cluster to update some configurations after that I moved
> some regions around to balance the cluster and just to ensure the data
> locality I ran a major compaction. After that I connected my App and 2
> hours later  this happened. So by the time of the crash I was not
> performing any operation over the cluster it was running normally.
>
> Thank you,
> Cheers
> Pedro
>
> On Tue, Feb 16, 2016 at 2:01 AM, Stack <st...@duboce.net> wrote:
>
>> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <pe...@gmail.com>
>> wrote:
>>
>> > Hi Guys,
>> >
>> > One of my region servers got into a state where it was unable to start
>> and
>> > the cluster was not receiving traffic for some time:
>>
>>
>>
>> Were you trying to restart it and it wouldn't come up? It kept doing the
>> NPE on each restart?
>>
>> Or it happened once and killed the regionserver?
>>
>>
>>
>> *(master log)*
>> >
>> > > 2016-02-15 22:04:33,186 ERROR
>> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
>> > master.MasterRpcServices:
>> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a
>> fatal
>> > > error:
>> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE
>> in
>> > > log roller
>> > > Cause:
>> > > java.io.IOException: java.lang.NullPointerException
>> > > at
>> > >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>> > > at
>> > >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>> > > at java.lang.Thread.run(Thread.java:745)
>> > > Caused by: java.lang.NullPointerException
>> > > at
>> > >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>> > > ... 2 more 2016-02-15 22:05:45,678 WARN
>> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
>> > > master.CatalogJanitor: Failed scan of catalog table
>> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
>> > > attempts=351, exceptions:
>> > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
>> > > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta' at
>> > > region=hbase:meta,,1.1588230740,
>> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
>> > >
>> >
>> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
>> > > at
>> >
>>
>> You running with read replicas enabled?
>>
>>
>> ....
>>
>>
>>
>> *(regionserver log)*
>>
>>
>> Anything before this log message about rolling the WAL?
>>
>>
>> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing,
>> request
>> > close of wal
>> > java.io.IOException: java.lang.NullPointerException
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>> > at java.lang.Thread.run(Thread.java:745)
>> > Caused by: java.lang.NullPointerException
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>> > ... 2 more
>> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234
>> ms,
>> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
>> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
>> > DatanodeInfoWithStorage[10.5.2.95:50010
>> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
>> > DatanodeInfoWithStorage[10.5.2.57:50010
>> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
>> > 2016-02-15 22:04:33,043 FATAL
>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>> > regionserver.HRegionServer: ABORTING region server
>> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
>> > java.io.IOException: java.lang.NullPointerException
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
>> > at java.lang.Thread.run(Thread.java:745)
>> > Caused by: java.lang.NullPointerException
>> > at
>> >
>>
>> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
>> > ... 2 more
>> > 2016-02-15 22:04:33,043 FATAL
>> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
>> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
>> > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
>> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
>> > org.apache.phoenix.coprocessor.ScanRegionObserver,
>> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
>> > org.apache.phoenix.hbase.index.Indexer,
>> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
>> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
>> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
>> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
>> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
>> > 2016-02-15 22:04:33,136 INFO
>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
>> >   "beans" : [ {
>> >     "name" : "java.lang:type=Memory",
>> >     "modelerType" : "sun.management.MemoryImpl",
>> >     "Verbose" : true,
>> >     "HeapMemoryUsage" : {
>> >       "committed" : 15032385536,
>> >       "init" : 15032385536,
>> >       "max" : 15032385536,
>> >       "used" : 3732843792
>> >     },
>> >     "ObjectPendingFinalizationCount" : 0,
>> >     "NonHeapMemoryUsage" : {
>> >       "committed" : 104660992,
>> >       "init" : 2555904,
>> >       "max" : -1,
>> >       "used" : 103018984
>> >     },
>> >     "ObjectName" : "java.lang:type=Memory"
>> >   } ],
>> >   "beans" : [ {
>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
>> >     "modelerType" : "RegionServer,sub=IPC",
>> >     "tag.Context" : "regionserver",
>> >     "tag.Hostname" : "hbase-rs9"
>> >   } ],
>> >   "beans" : [ {
>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
>> >     "modelerType" : "RegionServer,sub=Replication",
>> >     "tag.Context" : "regionserver",
>> >     "tag.Hostname" : "hbase-rs9"
>> >   } ],
>> >   "beans" : [ {
>> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
>> >     "modelerType" : "RegionServer,sub=Server",
>> >     "tag.Context" : "regionserver",
>> >     "tag.Hostname" : "hbase-rs9"
>> >   } ]
>> > }
>> > 2016-02-15 22:04:33,194 INFO
>> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
>> > regionserver.HRegionServer: STOPPED: IOE in log roller
>>
>> Looks like we were in log roller, closing the WAL to put up a new one but
>> the a sync was trying to run at same time but stream undo under it....
>> NPE... which caused server abort.
>>
>> ...
>>
>>
>> Any idea what happened? Today, I moved some regions around to balance the
>> cluster and I ran a major compaction after that and I added more threads
>> to
>> run small and large compactions, could this be related?
>>
>>
>>
>> Yeah, pass more of the regionserver log from before the NPE... to see what
>> was going on at the time.  I don't think the above changes related.
>>
>>
>>
>> I see that in the current branch the class *ProtobufLogWriter:176* already
>> contains try..catch
>>
>> try {
>> >   return this.output.getPos();
>> > } catch (NullPointerException npe) {
>> >   // Concurrent close...
>> >   throw new IOException(npe);
>> > }
>>
>>
>> Yeah. Looks like the thought is an NPE could happen here because the
>> stream
>> is being closed out from under us by another thread (the syncers are off
>> on
>> their own just worried about syncing... This is just to get the NPE up out
>> of the syncer thread and up into the general WAL subsystem.).  It looks
>> like this caused the regionserver abort. I would think it should just
>> provoke the WAL roll, not an abort. Put up more RS log please.
>>
>>
>>
>> > But I would be nice to understand the root cause of this error and if
>> there
>> > is any misconfiguration from my side.
>> >
>> >
>> Not your config I'd say (if your config brought it on, its a bug).
>>
>>
>>
>> > *Version: *HBase 1.1.2
>> >
>> >
>>
>> St.Ack
>>
>> P.S. It looks like a gentleman had similar issue end of last year:
>>
>> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
>> No response then.
>>
>>
>>
>>
>> > Thank you
>> > Cheers
>> > Pedro
>> >
>>
>
>

Re: NPE on ProtobufLogWriter.sync()

Posted by Pedro Gandola <pe...@gmail.com>.
Hi St.Ack,

Thank you for your help. I have attached ~5 min worth of data before the
crash.

I restarted the cluster to update some configurations after that I moved
some regions around to balance the cluster and just to ensure the data
locality I ran a major compaction. After that I connected my App and 2
hours later  this happened. So by the time of the crash I was not
performing any operation over the cluster it was running normally.

Thank you,
Cheers
Pedro

On Tue, Feb 16, 2016 at 2:01 AM, Stack <st...@duboce.net> wrote:

> On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <pe...@gmail.com>
> wrote:
>
> > Hi Guys,
> >
> > One of my region servers got into a state where it was unable to start
> and
> > the cluster was not receiving traffic for some time:
>
>
>
> Were you trying to restart it and it wouldn't come up? It kept doing the
> NPE on each restart?
>
> Or it happened once and killed the regionserver?
>
>
>
> *(master log)*
> >
> > > 2016-02-15 22:04:33,186 ERROR
> > > [PriorityRpcServer.handler=4,queue=0,port=16000]
> > master.MasterRpcServices:
> > > Region server hbase-rs9.localdomain,16020,1455560991134 reported a
> fatal
> > > error:
> > > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE
> in
> > > log roller
> > > Cause:
> > > java.io.IOException: java.lang.NullPointerException
> > > at
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > > at
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > > at java.lang.Thread.run(Thread.java:745)
> > > Caused by: java.lang.NullPointerException
> > > at
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > > ... 2 more 2016-02-15 22:05:45,678 WARN
> > > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > > master.CatalogJanitor: Failed scan of catalog table
> > > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> > > attempts=351, exceptions:
> > > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
> > > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta' at
> > > region=hbase:meta,,1.1588230740,
> > > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> > >
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > > at
> >
>
> You running with read replicas enabled?
>
>
> ....
>
>
>
> *(regionserver log)*
>
>
> Anything before this log message about rolling the WAL?
>
>
> > 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, request
> > close of wal
> > java.io.IOException: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > ... 2 more
> > 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234
> ms,
> > current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
> ,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> > DatanodeInfoWithStorage[10.5.2.95:50010
> ,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> > DatanodeInfoWithStorage[10.5.2.57:50010
> > ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> > 2016-02-15 22:04:33,043 FATAL
> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > regionserver.HRegionServer: ABORTING region server
> > hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> > java.io.IOException: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.lang.NullPointerException
> > at
> >
>
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > ... 2 more
> > 2016-02-15 22:04:33,043 FATAL
> > [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> > [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> > org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> > org.apache.phoenix.coprocessor.SequenceRegionObserver,
> > org.apache.phoenix.coprocessor.ScanRegionObserver,
> > org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> > org.apache.phoenix.hbase.index.Indexer,
> > org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> > org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> > org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> > org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> > org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> > 2016-02-15 22:04:33,136 INFO
> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > regionserver.HRegionServer: Dump of metrics as JSON on abort: {
> >   "beans" : [ {
> >     "name" : "java.lang:type=Memory",
> >     "modelerType" : "sun.management.MemoryImpl",
> >     "Verbose" : true,
> >     "HeapMemoryUsage" : {
> >       "committed" : 15032385536,
> >       "init" : 15032385536,
> >       "max" : 15032385536,
> >       "used" : 3732843792
> >     },
> >     "ObjectPendingFinalizationCount" : 0,
> >     "NonHeapMemoryUsage" : {
> >       "committed" : 104660992,
> >       "init" : 2555904,
> >       "max" : -1,
> >       "used" : 103018984
> >     },
> >     "ObjectName" : "java.lang:type=Memory"
> >   } ],
> >   "beans" : [ {
> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
> >     "modelerType" : "RegionServer,sub=IPC",
> >     "tag.Context" : "regionserver",
> >     "tag.Hostname" : "hbase-rs9"
> >   } ],
> >   "beans" : [ {
> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
> >     "modelerType" : "RegionServer,sub=Replication",
> >     "tag.Context" : "regionserver",
> >     "tag.Hostname" : "hbase-rs9"
> >   } ],
> >   "beans" : [ {
> >     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
> >     "modelerType" : "RegionServer,sub=Server",
> >     "tag.Context" : "regionserver",
> >     "tag.Hostname" : "hbase-rs9"
> >   } ]
> > }
> > 2016-02-15 22:04:33,194 INFO
> >  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> > regionserver.HRegionServer: STOPPED: IOE in log roller
>
> Looks like we were in log roller, closing the WAL to put up a new one but
> the a sync was trying to run at same time but stream undo under it....
> NPE... which caused server abort.
>
> ...
>
>
> Any idea what happened? Today, I moved some regions around to balance the
> cluster and I ran a major compaction after that and I added more threads to
> run small and large compactions, could this be related?
>
>
>
> Yeah, pass more of the regionserver log from before the NPE... to see what
> was going on at the time.  I don't think the above changes related.
>
>
>
> I see that in the current branch the class *ProtobufLogWriter:176* already
> contains try..catch
>
> try {
> >   return this.output.getPos();
> > } catch (NullPointerException npe) {
> >   // Concurrent close...
> >   throw new IOException(npe);
> > }
>
>
> Yeah. Looks like the thought is an NPE could happen here because the stream
> is being closed out from under us by another thread (the syncers are off on
> their own just worried about syncing... This is just to get the NPE up out
> of the syncer thread and up into the general WAL subsystem.).  It looks
> like this caused the regionserver abort. I would think it should just
> provoke the WAL roll, not an abort. Put up more RS log please.
>
>
>
> > But I would be nice to understand the root cause of this error and if
> there
> > is any misconfiguration from my side.
> >
> >
> Not your config I'd say (if your config brought it on, its a bug).
>
>
>
> > *Version: *HBase 1.1.2
> >
> >
>
> St.Ack
>
> P.S. It looks like a gentleman had similar issue end of last year:
>
> http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
> No response then.
>
>
>
>
> > Thank you
> > Cheers
> > Pedro
> >
>

Re: NPE on ProtobufLogWriter.sync()

Posted by Stack <st...@duboce.net>.
On Mon, Feb 15, 2016 at 3:07 PM, Pedro Gandola <pe...@gmail.com>
wrote:

> Hi Guys,
>
> One of my region servers got into a state where it was unable to start and
> the cluster was not receiving traffic for some time:



Were you trying to restart it and it wouldn't come up? It kept doing the
NPE on each restart?

Or it happened once and killed the regionserver?



*(master log)*
>
> > 2016-02-15 22:04:33,186 ERROR
> > [PriorityRpcServer.handler=4,queue=0,port=16000]
> master.MasterRpcServices:
> > Region server hbase-rs9.localdomain,16020,1455560991134 reported a fatal
> > error:
> > ABORTING region server hbase-rs9.localdomain,16020,1455560991134: IOE in
> > log roller
> > Cause:
> > java.io.IOException: java.lang.NullPointerException
> > at
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> > at
> >
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> > at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.lang.NullPointerException
> > at
> >
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> > ... 2 more 2016-02-15 22:05:45,678 WARN
> > [hbase-ms4.localdomain,16000,1455560972387_ChoreService_1]
> > master.CatalogJanitor: Failed scan of catalog table
> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
> > attempts=351, exceptions:
> > Mon Feb 15 22:05:45 UTC 2016, null, java.net.SocketTimeoutException:
> > callTimeout=60000, callDuration=68222: row '' on table 'hbase:meta' at
> > region=hbase:meta,,1.1588230740,
> > hostname=hbase-rs9.localdomain,16020,1455560991134, seqNum=0 at
> >
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:271)
> > at
>

You running with read replicas enabled?


....



*(regionserver log)*


Anything before this log message about rolling the WAL?


> 2016-02-15 22:04:33,034 ERROR [sync.2] wal.FSHLog: Error syncing, request
> close of wal
> java.io.IOException: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
>
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more
> 2016-02-15 22:04:33,036 INFO  [sync.2] wal.FSHLog: Slow sync cost: 234 ms,
> current pipeline: [DatanodeInfoWithStorage[10.5.2.169:50010
,DS-7b9cfb3b-6c79-4e1b-ac90-19881e568518,DISK],
> DatanodeInfoWithStorage[10.5.2.95:50010
,DS-40db3807-a850-4aeb-8529-d3ea3920e556,DISK],
> DatanodeInfoWithStorage[10.5.2.57:50010
> ,DS-3ddc1541-c052-4cc2-b8a4-65d84fd90cfb,DISK]]
> 2016-02-15 22:04:33,043 FATAL
> [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: ABORTING region server
> hbase-rs9.localdomain,16020,1455560991134: IOE in log roller
> java.io.IOException: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:176)
> at
>
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1346)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at
>
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:173)
> ... 2 more
> 2016-02-15 22:04:33,043 FATAL
> [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: RegionServer abort: loaded coprocessors are:
> [org.apache.hadoop.hbase.regionserver.LocalIndexMerger,
> org.apache.hadoop.hbase.regionserver.IndexHalfStoreFileReaderGenerator,
> org.apache.phoenix.coprocessor.SequenceRegionObserver,
> org.apache.phoenix.coprocessor.ScanRegionObserver,
> org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver,
> org.apache.phoenix.hbase.index.Indexer,
> org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver,
> org.apache.hadoop.hbase.regionserver.LocalIndexSplitter,
> org.apache.phoenix.coprocessor.ServerCachingEndpointImpl,
> org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,
> org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
> 2016-02-15 22:04:33,136 INFO
>  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: Dump of metrics as JSON on abort: {
>   "beans" : [ {
>     "name" : "java.lang:type=Memory",
>     "modelerType" : "sun.management.MemoryImpl",
>     "Verbose" : true,
>     "HeapMemoryUsage" : {
>       "committed" : 15032385536,
>       "init" : 15032385536,
>       "max" : 15032385536,
>       "used" : 3732843792
>     },
>     "ObjectPendingFinalizationCount" : 0,
>     "NonHeapMemoryUsage" : {
>       "committed" : 104660992,
>       "init" : 2555904,
>       "max" : -1,
>       "used" : 103018984
>     },
>     "ObjectName" : "java.lang:type=Memory"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
>     "modelerType" : "RegionServer,sub=IPC",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
>     "modelerType" : "RegionServer,sub=Replication",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ],
>   "beans" : [ {
>     "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
>     "modelerType" : "RegionServer,sub=Server",
>     "tag.Context" : "regionserver",
>     "tag.Hostname" : "hbase-rs9"
>   } ]
> }
> 2016-02-15 22:04:33,194 INFO
>  [regionserver/hbase-rs9.localdomain/10.5.2.169:16020.logRoller]
> regionserver.HRegionServer: STOPPED: IOE in log roller

Looks like we were in log roller, closing the WAL to put up a new one but
the a sync was trying to run at same time but stream undo under it....
NPE... which caused server abort.

...


Any idea what happened? Today, I moved some regions around to balance the
cluster and I ran a major compaction after that and I added more threads to
run small and large compactions, could this be related?



Yeah, pass more of the regionserver log from before the NPE... to see what
was going on at the time.  I don't think the above changes related.



I see that in the current branch the class *ProtobufLogWriter:176* already
contains try..catch

try {
>   return this.output.getPos();
> } catch (NullPointerException npe) {
>   // Concurrent close...
>   throw new IOException(npe);
> }


Yeah. Looks like the thought is an NPE could happen here because the stream
is being closed out from under us by another thread (the syncers are off on
their own just worried about syncing... This is just to get the NPE up out
of the syncer thread and up into the general WAL subsystem.).  It looks
like this caused the regionserver abort. I would think it should just
provoke the WAL roll, not an abort. Put up more RS log please.



> But I would be nice to understand the root cause of this error and if there
> is any misconfiguration from my side.
>
>
Not your config I'd say (if your config brought it on, its a bug).



> *Version: *HBase 1.1.2
>
>

St.Ack

P.S. It looks like a gentleman had similar issue end of last year:
http://mail-archives.apache.org/mod_mbox/hbase-user/201510.mbox/%3CCA+nge+rofDTXboLmC0s4BcgWVtyecyJaPZwXp5ZJu-gM4DAVEA@mail.gmail.com%3E
No response then.




> Thank you
> Cheers
> Pedro
>