You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Bharat Viswanadham (Jira)" <ji...@apache.org> on 2021/04/09 04:27:00 UTC

[jira] [Assigned] (HDDS-5078) NPE during secure SCM initialization with HA code updated to an already existing cluster

     [ https://issues.apache.org/jira/browse/HDDS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bharat Viswanadham reassigned HDDS-5078:
----------------------------------------

    Assignee: Bharat Viswanadham

> NPE during secure SCM initialization with HA code updated to an already existing cluster
> ----------------------------------------------------------------------------------------
>
>                 Key: HDDS-5078
>                 URL: https://issues.apache.org/jira/browse/HDDS-5078
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: SCM HA
>            Reporter: István Fajth
>            Assignee: Bharat Viswanadham
>            Priority: Blocker
>
> On a Cloudera Manager managed cluster, scm is started always with --init option specified, and this behaviour revealed the following null pointer dereference:
> StorageContainerManager#initializeCertificateClient initializes the scmCertificateClient only if scmStorageConfig#checkPrimarySCMIdInitialized() evaluates to true. This evaluates to true, if the VERSION file contains primaryScmNodeId with a value.
> If you upgrade an existing cluster with a single SCM to this code, the VERSION file does not contain a primaryScmNodeId, so the scmCertificateClient remains null.
> Later the initialization code calls the StorageContainerManager#initializeCAnSecurityProtocol method, which at the end creates the securityProtocolServer, for the constructor call the rootCACert is provided by calling the scmCertificateClient#getCACertificate method, but this is a null dereference as scmCertificateClient is null.
> The scmCertificateClient being null, can cause problems later as well, as it is used multiple times unconditionally.
> Later on after working around this particular problem (by simply let the code create the scmCertificateClient without conditions), it turned out that in the StorageContainerManager#initializeCAnSecurityProtocol call the scmCertificateServer and the rootCertificateServer instances are also remain uninitialized, with that causing problems when an scm client tries to get the root CA certificate from the SCM.
> For me this suggests that initialization of SCM fails after an upgrade on an old cluster, this was working fine before, and --init did not reinitialized anything, but worked fine.
> If I change Cloudera Manager behaviour to do not init the SCM when I start it, I still get the same NPE as with --init from the SCM.
> The exception I get in the SCM log is as follows, the command I issue is a recommission of a formerly (before upgrade) decommissioned DN.
> {code}
> java.lang.NullPointerException
> 	at org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMGetCertResponseProto$Builder.setX509RootCACertificate(SCMSecurityProtocolProtos.java:9026)
> 	at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.getCACertificate(SCMSecurityProtocolServerSideTranslatorPB.java:257)
> 	at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.processRequest(SCMSecurityProtocolServerSideTranslatorPB.java:104)
> 	at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> 	at org.apache.hadoop.hdds.scm.protocol.SCMSecurityProtocolServerSideTranslatorPB.submitRequest(SCMSecurityProtocolServerSideTranslatorPB.java:89)
> 	at org.apache.hadoop.hdds.protocol.proto.SCMSecurityProtocolProtos$SCMSecurityProtocolService$2.callBlockingMethod(SCMSecurityProtocolProtos.java:10537)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:986)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:914)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2887)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org