You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/02/17 13:20:26 UTC

[GitHub] [ozone] adoroszlai opened a new pull request #1931: HDDS-4834. Replication failure in secure environment

adoroszlai opened a new pull request #1931:
URL: https://github.com/apache/ozone/pull/1931


   ## What changes were proposed in this pull request?
   
   `GrpcReplicationClient` currently tries to read DN certificate with the wrong filename and missing path.  This change makes it get the certificate from `DNCertificateClient`, which already has logic to find and load the it.
   
   https://issues.apache.org/jira/browse/HDDS-4834
   
   ## How was this patch tested?
   
   Reproduced the bug in `ozonesecure` acceptance test:
   
   ```
   extradn_1   | 2021-02-17 08:41:00,481 [ContainerReplicationThread-0] INFO replication.DownloadAndImportReplicator: Starting replication of container 1 from [c61bf116-5ceb-4ed7-a94e-9a156433e2fe{ip: 192.168.240.5, host: ozonesecure_datanode_1.ozonesecure_default, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, b9d535d9-3a6e-44e6-b212-333f73c73c38{ip: 192.168.240.10, host: ozonesecure_datanode_2.ozonesecure_default, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}]
   extradn_1   | 2021-02-17 08:41:00,486 [ContainerReplicationThread-0] ERROR replication.SimpleContainerDownloader: Container 1 download from datanode c61bf116-5ceb-4ed7-a94e-9a156433e2fe{ip: 192.168.240.5, host: ozonesecure_datanode_1.ozonesecure_default, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0} was unsuccessful. Trying the next datanode
   extradn_1   | java.lang.IllegalArgumentException: File does not contain valid certificates: certificate.crt
   extradn_1   | 	at org.apache.ratis.thirdparty.io.netty.handler.ssl.SslContextBuilder.keyManager(SslContextBuilder.java:345)
   extradn_1   | 	at org.apache.ratis.thirdparty.io.netty.handler.ssl.SslContextBuilder.keyManager(SslContextBuilder.java:294)
   extradn_1   | 	at org.apache.hadoop.ozone.container.replication.GrpcReplicationClient.<init>(GrpcReplicationClient.java:81)
   extradn_1   | 	at org.apache.hadoop.ozone.container.replication.SimpleContainerDownloader.downloadContainer(SimpleContainerDownloader.java:135)
   extradn_1   | 	at org.apache.hadoop.ozone.container.replication.SimpleContainerDownloader.getContainerDataFromReplicas(SimpleContainerDownloader.java:88)
   extradn_1   | 	at org.apache.hadoop.ozone.container.replication.DownloadAndImportReplicator.replicate(DownloadAndImportReplicator.java:110)
   extradn_1   | 	at org.apache.hadoop.ozone.container.replication.MeasuredReplicator.replicate(MeasuredReplicator.java:69)
   extradn_1   | 	at org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:139)
   ...
   extradn_1   | 2021-02-17 08:41:00,488 [ContainerReplicationThread-0] ERROR replication.ReplicationSupervisor: Container 1 can't be downloaded from any of the datanodes.
   ```
   
   https://github.com/adoroszlai/hadoop-ozone/runs/1918037139
   
   Verified that replication is successful with the fix:
   
   ```
   extradn_1   | 2021-02-17 11:53:44,371 [ContainerReplicationThread-0] INFO replication.DownloadAndImportReplicator: Starting replication of container 1 from [60529ea2-54c0-4882-95fa-930afa556d98{ip: 172.18.0.2, host: ozonesecure_datanode_2.ozonesecure_default, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, f26be9a7-5084-438a-b4cb-5f875036030c{ip: 172.18.0.3, host: ozonesecure_datanode_1.ozonesecure_default, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}]
   extradn_1   | 2021-02-17 11:53:53,815 [grpc-default-executor-2] INFO replication.GrpcReplicationClient: Container 1 is downloaded to /tmp/container-copy/container-1.tar.gz
   extradn_1   | 2021-02-17 11:53:53,818 [ContainerReplicationThread-0] INFO replication.DownloadAndImportReplicator: Container 1 is downloaded with size 214343220, starting to import.
   extradn_1   | 2021-02-17 11:53:56,222 [ContainerReplicationThread-0] INFO replication.DownloadAndImportReplicator: Container 1 is replicated successfully
   extradn_1   | 2021-02-17 11:53:56,222 [ContainerReplicationThread-0] INFO replication.ReplicationSupervisor: Container 1 is replicated.
   ```
   
   https://github.com/adoroszlai/hadoop-ozone/runs/1918041342


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a change in pull request #1931: HDDS-4834. Replication failure in secure environment

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on a change in pull request #1931:
URL: https://github.com/apache/ozone/pull/1931#discussion_r577813812



##########
File path: hadoop-ozone/dist/src/main/compose/ozonesecure/docker-compose.yaml
##########
@@ -48,6 +48,18 @@ services:
     environment:
       KERBEROS_KEYTABS: dn HTTP
       OZONE_OPTS:
+  extradn:

Review comment:
       Thanks @elek and @sodonnel for the review.  I wasn't happy about adding `extradn`, either.  Thanks for the idea about scaling down then up.  The only issue is that it's harder to state our expectation about where the container should be located if the original and new datanodes have the same name.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] elek merged pull request #1931: HDDS-4834. Replication failure in secure environment

Posted by GitBox <gi...@apache.org>.
elek merged pull request #1931:
URL: https://github.com/apache/ozone/pull/1931


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] elek commented on a change in pull request #1931: HDDS-4834. Replication failure in secure environment

Posted by GitBox <gi...@apache.org>.
elek commented on a change in pull request #1931:
URL: https://github.com/apache/ozone/pull/1931#discussion_r577614598



##########
File path: hadoop-ozone/dist/src/main/compose/ozonesecure/docker-compose.yaml
##########
@@ -48,6 +48,18 @@ services:
     environment:
       KERBEROS_KEYTABS: dn HTTP
       OZONE_OPTS:
+  extradn:

Review comment:
       Thanks for the fix @adoroszlai 
   
   One question here: do really need this `extradn`. I have mixed feelings as this docker-compose environment is used not only for testing but as an example environment. It may be confusing for the users who would like to see a simple secure cluster, that we have this `extradn`.
   
   As far as I see, when you do a `docker-compose scale datanode=2` and `docker-compose scale datanode=3`, the original `/data` volume is recreated, so replication could be tested.
   
   Or (as an alternative, but may be a more complex approach) we can improve the `ClosedContainerReplication` freon test to use certificates...
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] elek commented on a change in pull request #1931: HDDS-4834. Replication failure in secure environment

Posted by GitBox <gi...@apache.org>.
elek commented on a change in pull request #1931:
URL: https://github.com/apache/ozone/pull/1931#discussion_r578246724



##########
File path: hadoop-hdds/tools/src/main/java/org/apache/hadoop/hdds/scm/cli/container/InfoSubcommand.java
##########
@@ -52,6 +54,11 @@
   @Spec
   private CommandSpec spec;
 
+  @CommandLine.Option(names = { "--json" },
+      defaultValue = "false",
+      description = "Format output as JSON")
+  private boolean json;
+

Review comment:
       Let me stop here for a moment and celebrate this flag. I really like it. I think it should be the standard in all the debug output: we should use either json or multi-line one-json per-line everywhere.
   
   It's not just for the robot test but for everyday usage, jq is very powerful.
   
   :heart:  :heart: 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #1931: HDDS-4834. Replication failure in secure environment

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #1931:
URL: https://github.com/apache/ozone/pull/1931#issuecomment-780725645


   I think these changes look good, but I am not really familiar with this code area, or these robot tests, so it would be good for @elek to check as well.
   
   +1 from me pending CI and @elek checking too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a change in pull request #1931: HDDS-4834. Replication failure in secure environment

Posted by GitBox <gi...@apache.org>.
sodonnel commented on a change in pull request #1931:
URL: https://github.com/apache/ozone/pull/1931#discussion_r577716777



##########
File path: hadoop-ozone/dist/src/main/compose/ozonesecure/docker-compose.yaml
##########
@@ -48,6 +48,18 @@ services:
     environment:
       KERBEROS_KEYTABS: dn HTTP
       OZONE_OPTS:
+  extradn:

Review comment:
       The logic changes look good to me, but I agree with @elek that it would be better to avoid the `extradn` if possible.
   
   One other possible idea - start a cluster with 4 DNs. Create a container and identify one of the DNs which has the container. Then decommission the host and wait for decommission to complete. For decommission to complete, replication must happen, otherwise decommission will hang forever.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org