You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2018/06/03 04:10:00 UTC

[jira] [Updated] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

     [ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-24356:
------------------------------
    Fix Version/s:     (was: 3.0.0)
                   2.4.0

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> ------------------------------------------------------------------
>
>                 Key: SPARK-24356
>                 URL: https://issues.apache.org/jira/browse/SPARK-24356
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 2.3.0
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>            Priority: Major
>             Fix For: 2.4.0
>
>         Attachments: SPARK-24356.01.patch, dup-file-strings-details.png
>
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from high GC pressure due to high object churn. Analysis was done with the jxray tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a number of well-known memory issues. One problem that it found in this dump is 19.5% of memory wasted due to duplicate strings. Of these duplicates, more than a half come from {{FileInputStream.path}} and {{File.path}}. All the {{FileInputStream}} objects that JXRay shows are garbage - looks like they are used for a very short period and then discarded (I guess there is a separate question of whether that's a good pattern). But {{File}} instances are traceable to {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very similar, so I think {{FileInputStream}}s are generated by the {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested that in the above code we create a File with a complete, normalized pathname, that has been already interned. This will prevent the code inside {{java.io.File}} from modifying this string, and thus it will use the interned copy, and will pass it to FileInputStream. Essentially the current line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org