You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Attila Doroszlai (Jira)" <ji...@apache.org> on 2023/06/27 06:38:00 UTC

[jira] [Comment Edited] (HDDS-8934) SCM throws SingleThreadExecutor exceptions and then shuts down

    [ https://issues.apache.org/jira/browse/HDDS-8934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737471#comment-17737471 ] 

Attila Doroszlai edited comment on HDDS-8934 at 6/27/23 6:37 AM:
-----------------------------------------------------------------

HDDS-6900 fixed {{UndeclaredThrowableException}} for {{TimeoutException}}.  Here, {{ExecutionException}} is thrown, which is also undeclared, can only happen due to proxying through {{SCMHAInvocationHandler}}.

Other possible exceptions:
 * {{Exception}} from {{submitScmCertsToRatis}} and {{response.getException}}
 * {{InterruptedException}} from {{submitRequest}}

{code:title=https://github.com/apache/ozone/blob/ecc7d5f17c504ee2df41a6f64cf60b4474505ad0/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMHAInvocationHandler.java#L93-L124}
  private Object invokeRatis(Method method, Object[] args)
      throws Exception {
    LOG.trace("Invoking method {} on target {}", method, ratisHandler);
    // TODO: Add metric here to track time taken by Ratis
    Preconditions.checkNotNull(ratisHandler);
    SCMRatisRequest scmRatisRequest = SCMRatisRequest.of(requestType,
        method.getName(), method.getParameterTypes(), args);


    // Scm Cert DB updates should use RaftClient.
    // As rootCA which is primary SCM only can issues certificates to sub-CA.
    // In case primary is not leader SCM, still sub-ca cert DB updates should go
    // via ratis. So, in this special scenario we use RaftClient.
    final SCMRatisResponse response;
    if (method.getName().equals("storeValidCertificate") &&
        args[args.length - 1].equals(HddsProtos.NodeType.SCM)) {
      response =
          HASecurityUtils.submitScmCertsToRatis(
              ratisHandler.getDivision().getGroup(),
              ratisHandler.getGrpcTlsConfig(),
              scmRatisRequest.encode());


    } else {
      response = ratisHandler.submitRequest(
          scmRatisRequest);
    }


    if (response.isSuccess()) {
      return response.getResult();
    }
    // Should we unwrap and throw proper exception from here?
    throw response.getException();
  }
{code}

I think {{Exception}} should be declared, unless we want to convert to some specific exception (e.g. {{SCMException}}).


was (Author: adoroszlai):
HDDS-6900 fixed {{UndeclaredThrowableException}} for {{TimeoutException}}.  Here, {{ExecutionException}} is thrown, which is also undeclared, can only happen due to proxying through {{SCMHAInvocationHandler}}.

Other possible exceptions:
 * {{Exception}} from {{submitScmCertsToRatis}} and {{response.getException}}
 * {{InterruptedException}} from {{submitRequest}}

{code:title=https://github.com/apache/ozone/blob/ecc7d5f17c504ee2df41a6f64cf60b4474505ad0/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMHAInvocationHandler.java#L93-L124}
  private Object invokeRatis(Method method, Object[] args)
      throws Exception {
    LOG.trace("Invoking method {} on target {}", method, ratisHandler);
    // TODO: Add metric here to track time taken by Ratis
    Preconditions.checkNotNull(ratisHandler);
    SCMRatisRequest scmRatisRequest = SCMRatisRequest.of(requestType,
        method.getName(), method.getParameterTypes(), args);


    // Scm Cert DB updates should use RaftClient.
    // As rootCA which is primary SCM only can issues certificates to sub-CA.
    // In case primary is not leader SCM, still sub-ca cert DB updates should go
    // via ratis. So, in this special scenario we use RaftClient.
    final SCMRatisResponse response;
    if (method.getName().equals("storeValidCertificate") &&
        args[args.length - 1].equals(HddsProtos.NodeType.SCM)) {
      response =
          HASecurityUtils.submitScmCertsToRatis(
              ratisHandler.getDivision().getGroup(),
              ratisHandler.getGrpcTlsConfig(),
              scmRatisRequest.encode());


    } else {
      response = ratisHandler.submitRequest(
          scmRatisRequest);
    }


    if (response.isSuccess()) {
      return response.getResult();
    }
    // Should we unwrap and throw proper exception from here?
    throw response.getException();
  }
{code}

I think {{Exception}} should be declared, unless we want to convert to some specific exception (e.g. {{IOException}}).

> SCM throws SingleThreadExecutor exceptions and then shuts down
> --------------------------------------------------------------
>
>                 Key: HDDS-8934
>                 URL: https://issues.apache.org/jira/browse/HDDS-8934
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM
>            Reporter: Pratyush Bhatt
>            Priority: Major
>
> SCM throws org.apache.hadoop.hdds.server.events.SingleThreadExecutor exception
> {noformat}
> 2023-06-25 10:37:34,775 [IPC Server listener on 9860] INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 9860
> 2023-06-25 10:37:34,775 [70d60297-a634-4bc4-99c0-0a8092d2c5d0@group-97E573E5D3A6-StateMachineUpdater] INFO org.apache.hadoop.hdds.scm.server.StorageContainerManager: Stopping Storage Container Manager HTTP server.
> 2023-06-25 10:37:34,776 [IPC Server Responder] INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
> 2023-06-25 10:37:34,776 [EventQueue-DeadNodeForDeadNodeHandler] ERROR org.apache.hadoop.hdds.server.events.SingleThreadExecutor: Error on execution message dd5867e2-90ba-44bf-b58e-1163e6fd8058(ozn-lease112-5.ozn-lease112.root.hwx.site/172.27.130.138)
> java.lang.reflect.UndeclaredThrowableException
>         at com.sun.proxy.$Proxy15.updatePipelineState(Unknown Source)
>         at org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.closePipeline(PipelineManagerImpl.java:440)
>         at org.apache.hadoop.hdds.scm.node.DeadNodeHandler.lambda$null$0(DeadNodeHandler.java:126)
>         at java.lang.Iterable.forEach(Iterable.java:75)
>         at org.apache.hadoop.hdds.scm.node.DeadNodeHandler.lambda$destroyPipelines$1(DeadNodeHandler.java:124)
>         at java.util.Optional.ifPresent(Optional.java:159)
>         at org.apache.hadoop.hdds.scm.node.DeadNodeHandler.destroyPipelines(DeadNodeHandler.java:123)
>         at org.apache.hadoop.hdds.scm.node.DeadNodeHandler.onMessage(DeadNodeHandler.java:84)
>         at org.apache.hadoop.hdds.scm.node.DeadNodeHandler.onMessage(DeadNodeHandler.java:50)
>         at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:85)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.ExecutionException: org.apache.ratis.protocol.exceptions.ServerNotReadyException: 70d60297-a634-4bc4-99c0-0a8092d2c5d0@group-97E573E5D3A6 is not in [RUNNING]: current state is CLOSING
>         at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>         at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
>         at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:229)
>         at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:115)
>         at org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:71)
>         ... 13 more
> Caused by: org.apache.ratis.protocol.exceptions.ServerNotReadyException: 70d60297-a634-4bc4-99c0-0a8092d2c5d0@group-97E573E5D3A6 is not in [RUNNING]: current state is CLOSING
>         at org.apache.ratis.server.impl.RaftServerImpl.lambda$assertLifeCycleState$9(RaftServerImpl.java:749)
>         at org.apache.ratis.util.LifeCycle.assertCurrentState(LifeCycle.java:253)
>         at org.apache.ratis.server.impl.RaftServerImpl.assertLifeCycleState(RaftServerImpl.java:748)
>         at org.apache.ratis.server.impl.RaftServerImpl.submitClientRequestAsync(RaftServerImpl.java:838)
>         at org.apache.ratis.server.impl.RaftServerImpl.lambda$null$12(RaftServerImpl.java:831)
>         at org.apache.ratis.util.JavaUtils.callAsUnchecked(JavaUtils.java:117)
>         at org.apache.ratis.server.impl.RaftServerImpl.lambda$executeSubmitClientRequestAsync$13(RaftServerImpl.java:831)
>         at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>         ... 3 more{noformat}
> Then shuts down:
> {noformat}
> 2023-06-25 10:37:34,807 [70d60297-a634-4bc4-99c0-0a8092d2c5d0@group-97E573E5D3A6-StateMachineUpdater] INFO SCMHATransactionMonitor: SCMHATransactionMonitor Service is not running, skip stop.
> 2023-06-25 10:37:34,807 [Lease Manager-LeaseManager#LeaseMonitor] WARN org.apache.hadoop.ozone.lease.LeaseManager: Lease manager is interrupted. Shutting down...
> java.lang.InterruptedException
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>         at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409)
>         at org.apache.hadoop.ozone.lease.LeaseManager$LeaseMonitor.run(LeaseManager.java:270)
>         at java.lang.Thread.run(Thread.java:748)
> 2023-06-25 10:37:34,808 [70d60297-a634-4bc4-99c0-0a8092d2c5d0@group-97E573E5D3A6-StateMachineUpdater] INFO org.apache.hadoop.hdds.scm.server.StorageContainerManager: Stopping SCM MetadataStore.
> 2023-06-25 10:37:34,852 [70d60297-a634-4bc4-99c0-0a8092d2c5d0@group-97E573E5D3A6-StateMachineUpdater] INFO org.apache.hadoop.hdds.scm.server.StorageContainerManager: Terminating with exit status 0: scm statemachine is closed by ratis, terminate SCM
> 2023-06-25 10:37:34,855 [shutdown-hook-0] INFO org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down StorageContainerManager at ozn-lease112-2.ozn-lease112.root.hwx.site/172.27.214.76
> ************************************************************/
> 2023-06-25 10:37:34,855 [shutdown-hook-0] INFO org.apache.hadoop.hdds.scm.server.StorageContainerManager: Storage Container Manager is not running.
> 2023-06-25 10:37:34,855 [shutdown-hook-0] INFO org.apache.hadoop.hdds.scm.server.StorageContainerManager: Stopping Replication Manager Service.
> 2023-06-25 10:37:34,855 [shutdown-hook-0] INFO org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Replication Monitor Thread is not running.
> 2023-06-25 10:37:34,855 [shutdown-hook-0] INFO org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer: Join RPC server for SCMSecurityProtocolServer.
> 2023-06-25 10:37:34,855 [shutdown-hook-0] INFO org.apache.hadoop.hdds.scm.server.SCMSecurityProtocolServer: Join gRPC server for SCMSecurityProtocolServer.{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org