You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "andrzej.stankevich@gmail.com (JIRA)" <ji...@apache.org> on 2018/08/09 01:39:00 UTC

[jira] [Created] (SPARK-25062) Clean up BlockLocations in FileStatus objects

andrzej.stankevich@gmail.com created SPARK-25062:
----------------------------------------------------

             Summary: Clean up BlockLocations in FileStatus objects
                 Key: SPARK-25062
                 URL: https://issues.apache.org/jira/browse/SPARK-25062
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.2.2
            Reporter: andrzej.stankevich@gmail.com


When Spark lists collection of files it does it on a driver or creates tasks to list files depending on number of files. here [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]

If spark creates tasks to list files each task creates one FileStatus object per file. Before sending  FileStatus to a driver Spark converts FileStatus to SerializableFileStatus. On driver side Spark turns SerializableFileStatus back to FileStatus and it also creates BlockLocation object for each FileStatus using 

 
{code:java}
new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
{code}
 

After deserialization on a driver side BlockLocation doesn't have a lot of information that original HDFSBlockLocation had.

 

If Spark does listing on a driver side FileStatus object has HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. Because of this FileStatus objects takes more memory than if it would created on executor side.

 

Later Spark puts all this objects into _SharedInMemoryCache_ cache and that cache takes 2.2x more memory if files were listed on driver side than if were listed on executor side.

 

In our case it takes 125M when we do scan on executors  and 270M when we do it on a driver for about 19K files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org