You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "steveloughran (via GitHub)" <gi...@apache.org> on 2023/02/24 14:32:54 UTC

[GitHub] [hadoop] steveloughran commented on a diff in pull request #5391: HADOOP-18629. Hadoop DistCp supports specifying favoredNodes for data copying

steveloughran commented on code in PR #5391:
URL: https://github.com/apache/hadoop/pull/5391#discussion_r1117059919


##########
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java:
##########
@@ -594,4 +595,49 @@ public void testUpdateRoot() throws Exception {
     assertEquals(srcStatus.getModificationTime(),
         destStatus2.getModificationTime());
   }
+
+  @Test
+  public void testFavoredNodes() throws Exception {
+    final String testRoot = "/testdir";
+    final String testSrc = testRoot + "/" + SRCDAT;
+    final String testDst = testRoot + "/" + DSTDAT;
+
+    String nnUri = FileSystem.getDefaultUri(conf).toString();
+    String rootStr = nnUri + testSrc;
+    String tgtStr = nnUri + testDst;
+
+    FileEntry[] srcFiles = {
+        new FileEntry(SRCDAT, true),
+        new FileEntry(SRCDAT + "/file10", false)
+    };
+
+    DistributedFileSystem fs = (DistributedFileSystem) FileSystem.get(URI.create(nnUri), conf);
+    createFiles(fs, testRoot, srcFiles, -1);
+
+    // sad path
+    assertNotEquals(ToolRunner.run(conf, new DistCp(),

Review Comment:
   the args are the wrong way round fro these assertions.
   
   switch to AssertJ assertions and add good .descriptions, so if a jenkins runs fails we know what went wrong, rather than just what line. Right



##########
hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm:
##########
@@ -364,6 +364,7 @@ Command Line Options
 | `-direct` | Write directly to destination paths | Useful for avoiding potentially very expensive temporary file rename operations when the destination is an object store |
 | `-useiterator` | Uses single threaded listStatusIterator to build listing | Useful for saving memory at the client side. Using this option will ignore the numListstatusThreads option |
 | `-updateRoot` | Update root directory attributes (eg permissions, ownership ...) | Useful if you need to enforce root directory attributes update when using distcp |
+| `-favoredNodes` | Specify favored nodes (Desired option input format: host1:port1,host2:port2,...) | Useful if you need to specify favored nodes when using distcp |

Review Comment:
   +. Requires the destination to be an hdfs filesystem



##########
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java:
##########
@@ -32,6 +32,7 @@
 import java.util.List;
 import java.util.Random;
 
+import org.apache.hadoop.hdfs.server.datanode.DataNode;

Review Comment:
   nit: must go in the right place in the org.apache.hadoop imports.



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem targetFS,
         context);
   }
 
+  private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws UnknownHostException {
+    List<InetSocketAddress> result = new ArrayList<>();
+    for (String hostAndPort : favoredNodesStr.split(",")) {
+      String[] split = hostAndPort.split(":");

Review Comment:
   log at debug, or maybe even at info.



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java:
##########
@@ -239,6 +239,21 @@ public static DistCpOptions parse(String[] args)
       }
     }
 
+    if (command.hasOption(DistCpOptionSwitch.FAVORED_NODES.getSwitch())) {
+      String favoredNodesStr = getVal(command, DistCpOptionSwitch.FAVORED_NODES.getSwitch().trim());
+      if (StringUtils.isEmpty(favoredNodesStr)) {

Review Comment:
   if you pull this out to a @VisibleForTesting package scoped method then unit tests could to try to break it through invalid args, e.g
   * trailing ,
   * empty string
    invalid node/valid node bad port.



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem targetFS,
         context);
   }
 
+  private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws UnknownHostException {
+    List<InetSocketAddress> result = new ArrayList<>();
+    for (String hostAndPort : favoredNodesStr.split(",")) {
+      String[] split = hostAndPort.split(":");
+      if (split.length != 2) {
+        throw new IllegalArgumentException("Illegal favoredNodes parameter: " + hostAndPort);

Review Comment:
   prefer `org.apache.hadoop.util.Preconditions` here, e.g 
   ```
   checkArgument(split.length == 2, ""Illegal favoredNodes parameter: %s", hostAndPort)
   ```
    
   



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java:
##########
@@ -164,6 +164,8 @@ public final class DistCpOptions {
 
   private final boolean updateRoot;
 
+  private final String favoredNodes;
+

Review Comment:
   nit: add javadocs.



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem targetFS,
         context);
   }
 
+  private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws UnknownHostException {

Review Comment:
   javadocs to explain what happens, when exceptions are raised.



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -247,6 +276,22 @@ private long copyToFile(Path targetPath, FileSystem targetFS,
         context);
   }
 
+  private InetSocketAddress[] toFavoredNodes(String favoredNodesStr) throws UnknownHostException {
+    List<InetSocketAddress> result = new ArrayList<>();
+    for (String hostAndPort : favoredNodesStr.split(",")) {

Review Comment:
   what happens if an empty string is passed in? it should be an error, right? so add a test



##########
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpOptions.java:
##########
@@ -574,4 +574,15 @@ public void testUpdateRoot() {
         .build();
     Assert.assertTrue(options.shouldUpdateRoot());
   }
+
+  @Test
+  public void testFavoredNodes() {
+    final DistCpOptions options = new DistCpOptions.Builder(
+        Collections.singletonList(
+            new Path("hdfs://localhost:8020/source")),
+        new Path("hdfs://localhost:8020/target/"))
+        .withFavoredNodes("localhost:50010")
+        .build();
+    Assert.assertNotNull(options.getFavoredNodes());

Review Comment:
   prefer assertJ for new tests, with a message, and ideally verification the node value went all the way through



##########
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java:
##########
@@ -223,9 +233,25 @@ private long copyToFile(Path targetPath, FileSystem targetFS,
       FSDataOutputStream out;
       ChecksumOpt checksumOpt = getChecksumOpt(fileAttributes, sourceChecksum);
       if (!preserveEC || ecPolicy == null) {
-        out = targetFS.create(targetPath, permission,
-            EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE), copyBufferSize,
-            repl, blockSize, context, checksumOpt);
+        if (targetFS instanceof DistributedFileSystem

Review Comment:
   There should be one path if preserving ec or favoredNodes is set, so switch to hdfsbuilder, leaving the other path as create()
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org