You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/02 14:22:15 UTC

[GitHub] [spark] Ngone51 commented on a change in pull request #28967: [SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service

Ngone51 commented on a change in pull request #28967:
URL: https://github.com/apache/spark/pull/28967#discussion_r449038268



##########
File path: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExecutorDiskUtils.java
##########
@@ -45,34 +31,16 @@ public static File getFile(String[] localDirs, int subDirsPerLocalDir, String fi
     int hash = JavaUtils.nonNegativeHash(filename);
     String localDir = localDirs[hash % localDirs.length];
     int subDirId = (hash / localDirs.length) % subDirsPerLocalDir;
-    return new File(createNormalizedInternedPathname(
-        localDir, String.format("%02x", subDirId), filename));
-  }
-
-  /**
-   * This method is needed to avoid the situation when multiple File instances for the
-   * same pathname "foo/bar" are created, each with a separate copy of the "foo/bar" String.
-   * According to measurements, in some scenarios such duplicate strings may waste a lot
-   * of memory (~ 10% of the heap). To avoid that, we intern the pathname, and before that
-   * we make sure that it's in a normalized form (contains no "//", "///" etc.) Otherwise,
-   * the internal code in java.io.File would normalize it later, creating a new "foo/bar"
-   * String copy. Unfortunately, we cannot just reuse the normalization code that java.io.File
-   * uses, since it is in the package-private class java.io.FileSystem.
-   *
-   * On Windows, separator "\" is used instead of "/".
-   *
-   * "\\" is a legal character in path name on Unix-like OS, but illegal on Windows.
-   */
-  @VisibleForTesting
-  static String createNormalizedInternedPathname(String dir1, String dir2, String fname) {
-    String pathname = dir1 + File.separator + dir2 + File.separator + fname;
-    Matcher m = MULTIPLE_SEPARATORS.matcher(pathname);
-    pathname = m.replaceAll(Matcher.quoteReplacement(File.separator));
-    // A single trailing slash needs to be taken care of separately
-    if (pathname.length() > 1 && pathname.charAt(pathname.length() - 1) == File.separatorChar) {
-      pathname = pathname.substring(0, pathname.length() - 1);
-    }
-    return pathname.intern();
+    final String notNormalizedPath =
+      localDir + File.separator + String.format("%02x", subDirId) + File.separator + filename;
+    // Interning the normalized path as according to measurements, in some scenarios such
+    // duplicate strings may waste a lot of memory (~ 10% of the heap).
+    // Unfortunately, we cannot just call the normalization code that java.io.File
+    // uses, since it is in the package-private class java.io.FileSystem.
+    // So we are creating a File just to get the normalized path back to intern it.
+    // Finally a new File is built and returned with this interned normalized path.
+    final String normalizedInternedPath = new File(notNormalizedPath).getPath().intern();

Review comment:
       The first constructed File uses the un-interned paths, so it may also hit the issue commented above?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org