You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Tsz Wo Nicholas Sze (JIRA)" <ji...@apache.org> on 2018/07/09 08:26:00 UTC

[jira] [Commented] (RATIS-259) Ratis waits infinitely with no timout when not able to connect to ozone datanode

    [ https://issues.apache.org/jira/browse/RATIS-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536668#comment-16536668 ] 

Tsz Wo Nicholas Sze commented on RATIS-259:
-------------------------------------------

Currently, the ratis client only retries indefinitely (as described in the Raft paper).  I agree that we should have some better retry policies; see also RATIS-11.

> Ratis waits infinitely with no timout when not able to connect to ozone datanode
> --------------------------------------------------------------------------------
>
>                 Key: RATIS-259
>                 URL: https://issues.apache.org/jira/browse/RATIS-259
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Nilotpal Nandi
>            Assignee: Nanda kumar
>            Priority: Critical
>
> In ozone, the pipeline type is Ratis with replication factor as 1.
> Ran the following command ozone command , it hung and did not complete
> --------------------------------------------------------------------------------------------------------
> {noformat}
> [root@ozone-vm bin]# ./ozone oz -getKey /nnvolume1/buckettest1/passwd ./hello1
> Command Failed : {"httpCode":0,"shortMessage":"get key needs a file path to download to","resource":null,"message":"get key needs a file path to download to","requestID":null,"hostName":null}
> [root@ozone-vm bin]# ./ozone oz -getKey /nnvolume1/buckettest1/passwd -file .
> 2018-06-29 05:09:46,865 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> Command Failed : {"httpCode":0,"shortMessage":".exists. Download will overwrite an existing file. Aborting.","resource":null,"message":".exists. Download will overwrite an existing file. Aborting.","requestID":null,"hostName":null}
> [root@ozone-vm bin]# ./ozone oz -getKey /nnvolume1/buckettest1/passwd -file ./hello11
> 2018-06-29 05:10:27,661 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2018-06-29 05:10:28,373 INFO conf.ConfUtils: raft.rpc.type = GRPC (default)
> 2018-06-29 05:10:28,406 INFO conf.ConfUtils: raft.grpc.message.size.max = 33554432 (custom)
> 2018-06-29 05:10:28,424 INFO conf.ConfUtils: raft.client.rpc.retryInterval = 300 ms (default)
> 2018-06-29 05:10:28,428 INFO conf.ConfUtils: raft.client.async.outstanding-requests.max = 100 (default)
> 2018-06-29 05:10:28,428 INFO conf.ConfUtils: raft.client.async.scheduler-threads = 3 (default)
> 2018-06-29 05:10:28,595 INFO conf.ConfUtils: raft.grpc.flow.control.window = 1MB (=1048576) (default)
> 2018-06-29 05:10:28,595 INFO conf.ConfUtils: raft.grpc.message.size.max = 33554432 (custom)
> 2018-06-29 05:10:28,884 INFO conf.ConfUtils: raft.client.rpc.request.timeout = 3000 ms (default){noformat}
>  
> jstack :
> ------------
>  
> {noformat}
> [root@ozone-vm logs]# jstack 4983
> 2018-06-29 05:31:00
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode):
> "Attach Listener" #19 daemon prio=9 os_prio=0 tid=0x00007fe9fc016000 nid=0x1804 waiting on condition [0x0000000000000000]
>  java.lang.Thread.State: RUNNABLE
> "threadDeathWatcher-3-1" #17 daemon prio=1 os_prio=0 tid=0x00007fea191e5800 nid=0x13af waiting on condition [0x00007fea02905000]
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
>  at java.lang.Thread.sleep(Native Method)
>  at org.apache.ratis.shaded.io.netty.util.ThreadDeathWatcher$Watcher.run(ThreadDeathWatcher.java:152)
>  at org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
>  at java.lang.Thread.run(Thread.java:748)
> "grpc-default-worker-ELG-1-2" #16 daemon prio=5 os_prio=0 tid=0x00007fea191d4000 nid=0x13ac runnable [0x00007fea085fc000]
>  java.lang.Thread.State: RUNNABLE
>  at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>  at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>  at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>  at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>  - locked <0x00000000e0c372f0> (a org.apache.ratis.shaded.io.netty.channel.nio.SelectedSelectionKeySet)
>  - locked <0x00000000e0c419c8> (a java.util.Collections$UnmodifiableSet)
>  - locked <0x00000000e0c372a8> (a sun.nio.ch.EPollSelectorImpl)
>  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>  at org.apache.ratis.shaded.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>  at org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:753)
>  at org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:409)
>  at org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
>  at org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
>  at java.lang.Thread.run(Thread.java:748)
> "grpc-default-executor-0" #15 daemon prio=5 os_prio=0 tid=0x00007fea191d0800 nid=0x13ab waiting on condition [0x00007fea086fd000]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x00000000e0c55a00> (a java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>  at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>  at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>  at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> "grpc-default-worker-ELG-1-1" #14 daemon prio=5 os_prio=0 tid=0x00007fea191cd800 nid=0x13aa runnable [0x00007fea087fe000]
>  java.lang.Thread.State: RUNNABLE
>  at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>  at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>  at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>  at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>  - locked <0x00000000e0c36d88> (a org.apache.ratis.shaded.io.netty.channel.nio.SelectedSelectionKeySet)
>  - locked <0x00000000e0c3e868> (a java.util.Collections$UnmodifiableSet)
>  - locked <0x00000000e0c36d40> (a sun.nio.ch.EPollSelectorImpl)
>  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>  at org.apache.ratis.shaded.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>  at org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:753)
>  at org.apache.ratis.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:409)
>  at org.apache.ratis.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
>  at org.apache.ratis.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
>  at java.lang.Thread.run(Thread.java:748)
> "Service Thread" #7 daemon prio=9 os_prio=0 tid=0x00007fea180bd000 nid=0x139f runnable [0x0000000000000000]
>  java.lang.Thread.State: RUNNABLE
> "C1 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007fea180b8800 nid=0x139e waiting on condition [0x0000000000000000]
>  java.lang.Thread.State: RUNNABLE
> "C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007fea180b6000 nid=0x139d waiting on condition [0x0000000000000000]
>  java.lang.Thread.State: RUNNABLE
> "Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007fea180b4800 nid=0x139c runnable [0x0000000000000000]
>  java.lang.Thread.State: RUNNABLE
> "Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007fea18081000 nid=0x139b in Object.wait() [0x00007fea1dacf000]
>  java.lang.Thread.State: WAITING (on object monitor)
>  at java.lang.Object.wait(Native Method)
>  - waiting on <0x00000000e075b328> (a java.lang.ref.ReferenceQueue$Lock)
>  at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
>  - locked <0x00000000e075b328> (a java.lang.ref.ReferenceQueue$Lock)
>  at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
>  at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:212)
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007fea1807c800 nid=0x139a in Object.wait() [0x00007fea1dbd0000]
>  java.lang.Thread.State: WAITING (on object monitor)
>  at java.lang.Object.wait(Native Method)
>  - waiting on <0x00000000e075b4e0> (a java.lang.ref.Reference$Lock)
>  at java.lang.Object.wait(Object.java:502)
>  at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
>  - locked <0x00000000e075b4e0> (a java.lang.ref.Reference$Lock)
>  at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
> "main" #1 prio=5 os_prio=0 tid=0x00007fea18010000 nid=0x1398 waiting on condition [0x00007fea216e3000]
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
>  at java.lang.Thread.sleep(Native Method)
>  at java.lang.Thread.sleep(Thread.java:340)
>  at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
>  at org.apache.ratis.util.TimeDuration.sleep(TimeDuration.java:127)
>  at org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetry(RaftClientImpl.java:263)
>  at org.apache.ratis.client.impl.RaftClientImpl.send(RaftClientImpl.java:192)
>  at org.apache.ratis.client.impl.RaftClientImpl.sendReadOnly(RaftClientImpl.java:178)
>  at org.apache.hadoop.hdds.scm.XceiverClientRatis.sendRequest(XceiverClientRatis.java:217)
>  at org.apache.hadoop.hdds.scm.XceiverClientRatis.sendCommand(XceiverClientRatis.java:235)
>  at org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.getKey(ContainerProtocolCalls.java:94)
>  at org.apache.hadoop.ozone.client.io.ChunkGroupInputStream.getFromKsmKeyInfo(ChunkGroupInputStream.java:287)
>  at org.apache.hadoop.ozone.client.rpc.RpcClient.getKey(RpcClient.java:495)
>  at org.apache.hadoop.ozone.client.OzoneBucket.readKey(OzoneBucket.java:255)
>  at org.apache.hadoop.ozone.web.ozShell.keys.GetKeyHandler.execute(GetKeyHandler.java:110)
>  at org.apache.hadoop.ozone.web.ozShell.Shell.dispatch(Shell.java:395)
>  at org.apache.hadoop.ozone.web.ozShell.Shell.run(Shell.java:135)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>  at org.apache.hadoop.ozone.web.ozShell.Shell.main(Shell.java:114)
> "VM Thread" os_prio=0 tid=0x00007fea18074800 nid=0x1399 runnable
> "VM Periodic Task Thread" os_prio=0 tid=0x00007fea180c3000 nid=0x13a0 waiting on condition
> JNI global references: 379
> {noformat}
>  
> datanode log :
> ---------------------
>  
> {noformat}
> 2018-06-29 05:17:46,795 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9820. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
> 2018-06-29 05:17:47,051 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: localhost/127.0.0.1:9820
> 2018-06-29 05:17:53,054 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:9820. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
> 2018-06-29 05:17:53,563 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool ID needed, but service not yet registered with NN, trace:
> java.lang.Exception
>  at org.apache.hadoop.hdfs.server.datanode.BPOfferService.getBlockPoolId(BPOfferService.java:213)
>  at org.apache.hadoop.hdfs.server.datanode.BPOfferService.getBlockPoolId(BPOfferService.java:224)
>  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.getActorInfoMap(BPServiceActor.java:175)
>  at org.apache.hadoop.hdfs.server.datanode.DataNode.getBPServiceActorInfo(DataNode.java:3109)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
>  at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
>  at com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:193)
>  at com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:175)
>  at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:117)
>  at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:54)
>  at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
>  at com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:83)
>  at com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:206)
>  at com.sun.jmx.mbeanserver.MBeanSupport.getAttributes(MBeanSupport.java:213)
>  at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttributes(DefaultMBeanServerInterceptor.java:709)
>  at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttributes(JmxMBeanServer.java:705)
>  at org.apache.hadoop.hdfs.server.common.MetricsLoggerTask.run(MetricsLoggerTask.java:91)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2018-06-29 05:17:53,563 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool ID needed, but service not yet registered with NN, trace:
> java.lang.Exception
>  
>  
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)