You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "andrzej.stankevich@gmail.com (JIRA)" <ji...@apache.org> on 2018/08/09 01:41:00 UTC
[jira] [Updated] (SPARK-25062) Clean up BlockLocations in FileStatus objects

     [ https://issues.apache.org/jira/browse/SPARK-25062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

andrzej.stankevich@gmail.com updated SPARK-25062:
-------------------------------------------------
    Description: 
When Spark lists collection of files it does it on a driver or creates tasks to list files depending on number of files. here [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]

If spark creates tasks to list files each task creates one FileStatus object per file. Before sending  FileStatus to a driver Spark converts FileStatus to SerializableFileStatus. On driver side Spark turns SerializableFileStatus back to FileStatus and it also creates BlockLocation object for each FileStatus using 

 
{code:java}
new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
{code}
 

After deserialization on a driver side BlockLocation doesn't have a lot of information that original HDFSBlockLocation had.

 

If Spark does listing on a driver side FileStatus object has HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. Because of this FileStatus objects takes more memory than if it would created on executor side.

 

Later Spark puts all this objects into _SharedInMemoryCache_ and that cache takes 2.2x more memory if files were listed on driver side than if they were listed on executor side.

 

In our case it takes 125M when we do scan on executors  and 270M when we do it on a driver for about 19K files.

  was:
When Spark lists collection of files it does it on a driver or creates tasks to list files depending on number of files. here [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]

If spark creates tasks to list files each task creates one FileStatus object per file. Before sending  FileStatus to a driver Spark converts FileStatus to SerializableFileStatus. On driver side Spark turns SerializableFileStatus back to FileStatus and it also creates BlockLocation object for each FileStatus using 

 
{code:java}
new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
{code}
 

After deserialization on a driver side BlockLocation doesn't have a lot of information that original HDFSBlockLocation had.

 

If Spark does listing on a driver side FileStatus object has HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. Because of this FileStatus objects takes more memory than if it would created on executor side.

 

Later Spark puts all this objects into _SharedInMemoryCache_ and that cache takes 2.2x more memory if files were listed on driver side than if were listed on executor side.

 

In our case it takes 125M when we do scan on executors  and 270M when we do it on a driver for about 19K files.


> Clean up BlockLocations in FileStatus objects
> ---------------------------------------------
>
>                 Key: SPARK-25062
>                 URL: https://issues.apache.org/jira/browse/SPARK-25062
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.2
>            Reporter: andrzej.stankevich@gmail.com
>            Priority: Major
>
> When Spark lists collection of files it does it on a driver or creates tasks to list files depending on number of files. here [https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L170]
> If spark creates tasks to list files each task creates one FileStatus object per file. Before sending  FileStatus to a driver Spark converts FileStatus to SerializableFileStatus. On driver side Spark turns SerializableFileStatus back to FileStatus and it also creates BlockLocation object for each FileStatus using 
>  
> {code:java}
> new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length) 
> {code}
>  
> After deserialization on a driver side BlockLocation doesn't have a lot of information that original HDFSBlockLocation had.
>  
> If Spark does listing on a driver side FileStatus object has HSDFBlockLocation objects and they have a lot of info that Spark doesn't use. Because of this FileStatus objects takes more memory than if it would created on executor side.
>  
> Later Spark puts all this objects into _SharedInMemoryCache_ and that cache takes 2.2x more memory if files were listed on driver side than if they were listed on executor side.
>  
> In our case it takes 125M when we do scan on executors  and 270M when we do it on a driver for about 19K files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org