You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by countmdm <gi...@git.apache.org> on 2018/05/29 20:16:57 UTC

[GitHub] spark pull request #21456: [SPARK-24356] [CORE] Duplicate strings in File.pa...

GitHub user countmdm opened a pull request:

    https://github.com/apache/spark/pull/21456

    [SPARK-24356] [CORE] Duplicate strings in File.path managed by FileSegmentManagedBuffer

    This patch eliminates duplicate strings that come from the 'path' field of
    java.io.File objects created by FileSegmentManagedBuffer. That is, we want
    to avoid the situation when multiple File instances for the same pathname
    "foo/bar" are created, each with a separate copy of the "foo/bar" String
    instance. In some scenarios such duplicate strings may waste a lot of memory
    (~ 10% of the heap). To avoid that, we intern the pathname with
    String.intern(), and before that we make sure that it's in a normalized
    form (contains no "//", "///" etc.) Otherwise, the code in java.io.File
    would normalize it later, creating a new "foo/bar" String copy.
    Unfortunately, the normalization code that java.io.File uses internally
    is in the package-private class java.io.FileSystem, so we cannot call it
    here directly.
    
    ## What changes were proposed in this pull request?
    
    Added code to ExternalShuffleBlockResolver.getFile(), that normalizes and then interns the pathname string before passing it to the File() constructor.
    
    ## How was this patch tested?
    
    Added unit test


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/countmdm/spark misha/spark-24356

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21456.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21456
    
----
commit 0df2607dfe79917f89652ec395b5e66f6c40f17b
Author: Misha Dmitriev <mi...@...>
Date:   2018-05-29T18:56:41Z

    [SPARK-24356] [CORE] Duplicate strings in File.path managed by FileSegmentManagedBuffer
    
    This patch eliminates duplicate strings that come from the 'path' field of
    java.io.File objects created by FileSegmentManagedBuffer. That is, we want
    to avoid the situation when multiple File instances for thesame pathname
    "foo/bar" are created, each with a separate copy of the "foo/bar" String
    instance. In some scenarios such duplicate strings may waste a lot of memory
    (~ 10% of the heap). To avoid that, we intern the pathname with
    String.intern(), and before that we make sure that it's in a normalized
    form (contains no "//", "///" etc.) Otherwise, the code in java.io.File
    would normalize it later, creating a new "foo/bar" String copy.
    Unfortunately, the normalization code that java.io.File uses internally
    is in the package-private class java.io.FileSystem, so we cannot call it
    here directly.
    
    Added unit test

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21456: [SPARK-24356] [CORE] Duplicate strings in File.pa...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21456


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by Tagar <gi...@git.apache.org>.

Github user Tagar commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    @srowen can this be backported to 2.2 and 2.3 master branches as well? Thank you @countmdm 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Hm, I guess I was supposing that, in the context of a file stream, whatever I/O is done with it is going to be much more significant than string manipulation of its path. Would this really be called, say, a million times in a short period? I ran a quick benchmark to sense-check and looks like those million replacements take all of a second of CPU time. (This is with a precompiled Pattern)  It's minor here, but that's nontrivial extra code to get right and read and maintain, if that's all it's doing. Yes if it were an extreme hotspot it could be called for.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21456: [SPARK-24356] [CORE] Duplicate strings in File.pa...

Posted by countmdm <gi...@git.apache.org>.

Github user countmdm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21456#discussion_r192576365
  
    --- Diff: common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolverSuite.java ---
    @@ -135,4 +136,23 @@ public void jsonSerializationOfExecutorRegistration() throws IOException {
           "\"subDirsPerLocalDir\": 7, \"shuffleManager\": " + "\"" + SORT_MANAGER + "\"}";
         assertEquals(shuffleInfo, mapper.readValue(legacyShuffleJson, ExecutorShuffleInfo.class));
       }
    +
    +  @Test
    +  public void testNormalizeAndInternPathname() {
    +    assertPathsMatch("/foo", "bar", "baz", "/foo/bar/baz");
    +    assertPathsMatch("//foo/", "bar/", "//baz", "/foo/bar/baz");
    +    assertPathsMatch("foo", "bar", "baz///", "foo/bar/baz");
    +    assertPathsMatch("/foo/", "/bar//", "/baz", "/foo/bar/baz");
    +    assertPathsMatch("/", "", "", "/");
    +    assertPathsMatch("/", "/", "/", "/");
    +  }
    +
    +  private void assertPathsMatch(String p1, String p2, String p3, String expectedPathname) {
    +    String normPathname =
    +        ExternalShuffleBlockResolver.createNormalizedInternedPathname(p1, p2, p3);
    +    assertEquals(expectedPathname, normPathname);
    +    File file = new File(normPathname);
    +    String returnedPath = file.getPath();
    +    assertTrue(normPathname == returnedPath);
    --- End diff --
    
    I am not sure when I've last seen {{assertSame}} in tests, so I am not sure how many people understand well the difference between it and {{assertEquals}}. Thus I personally think that '==' is better, because it's unambiguous. But if you insist, no problem to replace.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    **[Test build #91362 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91362/testReport)** for PR 21456 at commit [`f49512b`](https://github.com/apache/spark/commit/f49512b61a6d5b73f4d5966c557867ee55be0b5a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by countmdm <gi...@git.apache.org>.

Github user countmdm commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Yes.
    
    On Tue, May 29, 2018 at 1:18 PM, UCB AMPLab <no...@github.com>
    wrote:
    
    > Can one of the admins verify this patch?
    >
    > —
    > You are receiving this because you authored the thread.
    > Reply to this email directly, view it on GitHub
    > <https://github.com/apache/spark/pull/21456#issuecomment-392929292>, or mute
    > the thread
    > <https://github.com/notifications/unsubscribe-auth/AGGQ85WbYV_ltlcD5A3MK3zOJrToHUWvks5t3a0ggaJpZM4USJyt>
    > .
    >



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    **[Test build #91362 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91362/testReport)** for PR 21456 at commit [`f49512b`](https://github.com/apache/spark/commit/f49512b61a6d5b73f4d5966c557867ee55be0b5a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    **[Test build #91321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)** for PR 21456 at commit [`88bb478`](https://github.com/apache/spark/commit/88bb4780d20ad952aa1936f4e78a420d9baf0f2c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91321/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    OK I get it, it's not that different forms of the same path are formed, but that they're not canonical to begin with and so discarded by the File internals. And this logic isn't the full canonicalization logic reimplemented, but just eliminating successive "/"?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    I guess I'm just very surprised if this single line is responsible for 10% of objects on the heap - are you sure? Can we otherwise optimize it elsewhere in the code? I am also not sure why this line would produce different paths for the same canonical path? because that's what's driving adding all this normalization logic.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by countmdm <gi...@git.apache.org>.

Github user countmdm commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    I confirm that the -XX:+UseStringDeduplication option is available only with G1 GC, and it is off by default. So if we decide to use it, I guess we won't be able to enforce it reliably, especially for applications that just use some library code from Spark (by the way, this issue was found in Yarn Node Manager). Another issue with the string deduplication in G1 is that it's not aggressive at all. It's done by one thread, that scans the entire heap when other threads are loaded lightly enough. In the case that we investigated, the number of duplicate strings was high, yet they were relatively short-lived. That is, they survived for enough time to create significant pressure on the GC, but I don't think that this timeframe would be enough for the deduplication thread to eliminate them.
    
    In summary, this kind of targeted explicit string deduplication is not uncommon at all, and works really well. Usually you just need to add the .intern() call in a few places in the code. What I had to do here is quite involved because of the extra problem with java.io.File.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    OK, that's good evidence this is causing a lot of garbage. Although yes ideally we figure out just why there are so many of these stream objects, I can see optimizing this as it is. Yes I understand the intern issue here, but, not the need for normalization. It sounded like the strings were already mostly identical?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21456: [SPARK-24356] [CORE] Duplicate strings in File.pa...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21456#discussion_r192567034
  
    --- Diff: common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolverSuite.java ---
    @@ -135,4 +136,23 @@ public void jsonSerializationOfExecutorRegistration() throws IOException {
           "\"subDirsPerLocalDir\": 7, \"shuffleManager\": " + "\"" + SORT_MANAGER + "\"}";
         assertEquals(shuffleInfo, mapper.readValue(legacyShuffleJson, ExecutorShuffleInfo.class));
       }
    +
    +  @Test
    +  public void testNormalizeAndInternPathname() {
    +    assertPathsMatch("/foo", "bar", "baz", "/foo/bar/baz");
    +    assertPathsMatch("//foo/", "bar/", "//baz", "/foo/bar/baz");
    +    assertPathsMatch("foo", "bar", "baz///", "foo/bar/baz");
    +    assertPathsMatch("/foo/", "/bar//", "/baz", "/foo/bar/baz");
    +    assertPathsMatch("/", "", "", "/");
    +    assertPathsMatch("/", "/", "/", "/");
    +  }
    +
    +  private void assertPathsMatch(String p1, String p2, String p3, String expectedPathname) {
    +    String normPathname =
    +        ExternalShuffleBlockResolver.createNormalizedInternedPathname(p1, p2, p3);
    +    assertEquals(expectedPathname, normPathname);
    +    File file = new File(normPathname);
    +    String returnedPath = file.getPath();
    +    assertTrue(normPathname == returnedPath);
    --- End diff --
    
    Last nit, should this be `assertSame` if you're asserting they're the same string object exactly?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    OK by me except those branches are about to cut maintenance releases. I'd wait for right now I suppose, but, do not feel strongly about it if someone wants to take responsibility for merging it now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Re: normalization -- if I understand correctly, its not that you know that the normalization definitely *does* change the strings for the heap dump you have.  Its just to make sure that your change is effective even if normalization were to change things.  In practice, I don't think spark's usage should lead to any de-normalized paths, but I think its a good precaution.
    
    Re: so many objects.  I don't think its that surprising actually.  Imagine a shuffle on a large cluster writing to 10k partitions.  The shuffle-read side is going to make a lot of simultaneous requests to the same shuffle-write side task -- all that data lives in the same file, just at different offsets.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Merged to master. Probably Ok to backport if anyone feels strongly about it, but, let's let 2.3.1 and other imminent releases go out first.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by countmdm <gi...@git.apache.org>.

Github user countmdm commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    @squito wrt. the added code for path normalization: exactly as you say. This is just a precaution in case spark (or even some code that above spark) ends up generating pathnames that contain repeated slashes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    I believe this would work, if the goal is simply to squash repeated slashes into one:
    
    ```
    private static final Pattern MULTIPLE_SEPARATORS = Pattern.compile(File.separator + "{2,}");
    private static String createNormalizedInternedPathname(String dir1, String dir2, String fname) {
      Matcher m = MULTIPLE_SEPARATORS.matcher(dir1 + File.separator + dir2 + File.separator + fname);
      return m.replaceAll("/").intern();
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by countmdm <gi...@git.apache.org>.

Github user countmdm commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Just modified the code to use regexp and pushed the updates.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by countmdm <gi...@git.apache.org>.

Github user countmdm commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    @srowen yes, I am pretty sure that this code generates all these duplicate objects. I've analyzed a heap dump from a real customer, so I cannot publish the entire jxray report, since it may potentially reveal some sensitive data. But I've just attached a screenshot of the part that covers duplicate strings to https://issues.apache.org/jira/browse/SPARK-24356 - hopefully it will give you a better idea of what's going on here. Note that the biggest part of the overhead (10.5%) come from FileInputStream.path strings. The FileInputStream objects themselves are unreachable (but they apparently stay in memory for long enough - if they were GCed immediately, they wouldn't be reflected in the heap dump). The sample of paths in FileInputStream.path look identical to the sample of File.paths. This, plus looking at the code, makes me think that the spark code in question generates a large number of File objects with a small number of unique paths, and then generates an even larger number of
  FileInputStream objects with the same paths.
    
    I don't think it can be optimized somewhere else, short of accessing and interning paths directly in the File objects using Java reflection. As to why this produces many different copies of identical Strings - consider the following simple code:
    
        String s1 = "foo";
        String bar = "bar";
        String s1 = foo + "/" + bar;
        String s2 = foo + "/" + bar;
        System.out.println(s1.equals(s2));
        System.out.println(s1 == s2);
    
    This will print
        true
        false
    
    That is, s1 and s2 have the same contents, but they are two different objects, taking twice the memory.
    
    In the spark code in question we have the same main problem: by concatenating identical initial strings, it keeps generating multiple copies of identical "derived" strings. To remove duplicates among these derived strings, we need to apply String.intern(). However, to make sure that the code in java.io.File will not subsequently change our canonicalized strings if they are not in the normalized form, we need to do this normalization ourselves as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91362/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21456: [SPARK-24356] [CORE] Duplicate strings in File.pa...

Posted by countmdm <gi...@git.apache.org>.

Github user countmdm commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21456#discussion_r192217237
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java ---
    @@ -272,6 +273,57 @@ void close() {
         }
       }
     
    +  /**
    +   * This method is needed to avoid the situation when multiple File instances for the
    +   * same pathname "foo/bar" are created, each with a separate copy of the "foo/bar" String.
    +   * According to measurements, in some scenarios such duplicate strings may waste a lot
    +   * of memory (~ 10% of the heap). To avoid that, we intern the pathname, and before that
    +   * we make sure that it's in a normalized form (contains no "//", "///" etc.) Otherwise,
    +   * the internal code in java.io.File would normalize it later, creating a new "foo/bar"
    +   * String copy. Unfortunately, we cannot just reuse the normalization code that java.io.File
    +   * uses, since it is in the package-private class java.io.FileSystem.
    +   */
    +  private static String createNormalizedInternedPathname(String dir1, String dir2, String fname) {
    +    String pathname = dir1 + File.separator + dir2 + File.separator + fname;
    +    pathname = normalizePathname(pathname);
    +    return pathname.intern();
    +  }
    +
    +  @VisibleForTesting
    +  static String normalizePathname(String pathname) {
    +    int len = pathname.length();
    +    char prevChar = 0;
    +    for (int i = 0; i < len; i++) {
    +      char c = pathname.charAt(i);
    +      if (c == '/' && prevChar == '/') {
    +        return normalizePathnameRemainder(pathname, i - 1);
    +      }
    +      prevChar = c;
    +    }
    +    if (prevChar == '/') return normalizePathnameRemainder(pathname, len - 1);
    --- End diff --
    
    Will change the code to put the last 'return' in new line and braces (though I personally think that the obvious one-line statements like 'return' and 'continue' can be exempt from this rule).
    
    Regarding replacing this with regex: being a performance-oriented guy, I know too well that regex code can be notoriously slow. Probably it's because some general-purpose logic is used almost regardless of the complexity of your expression. Here it looks like this code may be called quite intensively, so it seems prudent to use custom-made code (largely copied from java.io.File source) that's guaranteed to run fast.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    **[Test build #91321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91321/testReport)** for PR 21456 at commit [`88bb478`](https://github.com/apache/spark/commit/88bb4780d20ad952aa1936f4e78a420d9baf0f2c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    From my understanding, that option is only available with G1GC, which is not really a good fit for spark (forget the exact details but something about humongous allocations which are common with all the large byte buffers typical for spark).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21456: [SPARK-24356] [CORE] Duplicate strings in File.pa...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21456#discussion_r192204172
  
    --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java ---
    @@ -272,6 +273,57 @@ void close() {
         }
       }
     
    +  /**
    +   * This method is needed to avoid the situation when multiple File instances for the
    +   * same pathname "foo/bar" are created, each with a separate copy of the "foo/bar" String.
    +   * According to measurements, in some scenarios such duplicate strings may waste a lot
    +   * of memory (~ 10% of the heap). To avoid that, we intern the pathname, and before that
    +   * we make sure that it's in a normalized form (contains no "//", "///" etc.) Otherwise,
    +   * the internal code in java.io.File would normalize it later, creating a new "foo/bar"
    +   * String copy. Unfortunately, we cannot just reuse the normalization code that java.io.File
    +   * uses, since it is in the package-private class java.io.FileSystem.
    +   */
    +  private static String createNormalizedInternedPathname(String dir1, String dir2, String fname) {
    +    String pathname = dir1 + File.separator + dir2 + File.separator + fname;
    +    pathname = normalizePathname(pathname);
    +    return pathname.intern();
    +  }
    +
    +  @VisibleForTesting
    +  static String normalizePathname(String pathname) {
    +    int len = pathname.length();
    +    char prevChar = 0;
    +    for (int i = 0; i < len; i++) {
    +      char c = pathname.charAt(i);
    +      if (c == '/' && prevChar == '/') {
    +        return normalizePathnameRemainder(pathname, i - 1);
    +      }
    +      prevChar = c;
    +    }
    +    if (prevChar == '/') return normalizePathnameRemainder(pathname, len - 1);
    --- End diff --
    
    Style note: we put the body of all if/while/for statements on a new line with braces.
    
    What about something like just `path.replaceAll("/{2,}", "/")`? does that cover a lot of the logic here? (Or something more efficient with Pattern and Matcher, but equivalent)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by countmdm <gi...@git.apache.org>.

Github user countmdm commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    If we don't do normalization ourselves, we may potentially run into the following: 
    
    path = ...  // Produces "foo//bar"
    path = path.intern();  // Ok, no separate copies of "foo//bar" anymore
    File f = new Fille(path);
    // Internally, code in File() constructor looks at the given path.
    // If path is not in the normalized form, it normalizes it, producing a new stirng "foo/bar"
    System.out.println(f.getPath());  // Prints "foo/bar"
    
    Since the code inside java.io.File doesn't do any string interning, we will keep canonicalizing the "foo//bar" string, but then java.io.File will still generate multiple copies of "foo/bar".



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

Posted by countmdm <gi...@git.apache.org>.

Github user countmdm commented on the issue:

    https://github.com/apache/spark/pull/21456
  
    Ok, if you believe this is not a performance problem here, then it's fine with me. To save us some possible further bouncing of this review, can you please share here your pattern/regex code? I will then change my patch and re-push it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org