You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "SHU WANG (Jira)" <ji...@apache.org> on 2023/07/02 15:59:00 UTC

[jira] [Updated] (SPARK-44272) Path Inconsistency when Operating statCache within Yarn Client

     [ https://issues.apache.org/jira/browse/SPARK-44272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

SHU WANG updated SPARK-44272:
-----------------------------
    Description: 
The *addResource* from *ClientDistributedCacheManager* can add *FileStatus* to 

*statCache* when it is not yet cached. Also, there is a subtle bug from *isPublic* from 

*getVisibility* method. *uri.getPath()* will not retain URI information like 

scheme, host, etc. So, the *uri* passed to checkPermissionOfOther will differ from the original {*}uri{*}.

For example, if uri is "file:/foo.invalid.com:8080/tmp/testing", then 
{code:java}
uri.getPath -> /foo.invalid.com:8080/tmp/testing
uri.toString -> file:/foo.invalid.com:8080/tmp/testing{code}
The consequence of this bug is that we will *double RPC calls* when the resources are remote, which is unnecessary. We see nontrivial overhead when checking those resources from our HDFS, especially when HDFS is overloaded. 

 

Ref: related code within *ClientDistributedCacheManager*
{code:java}
def addResource(...) {
    val destStatus = statCache.getOrElse(destPath.toUri(), fs.getFileStatus(destPath))
val visibility = getVisibility(conf, destPath.toUri(), statCache)
}
private[yarn] def getVisibility() {
isPublic(conf, uri, statCache)
}
private def isPublic(conf: Configuration, uri: URI, statCache: Map[URI, FileStatus]): Boolean = {
val current = new Path(uri.getPath()) // Should not use getPath
checkPermissionOfOther(fs, uri, FsAction.READ, statCache)
}
{code}
 

  was:
The *addResource* from *ClientDistributedCacheManager* can add *FileStatus* to 

*statCache* when it is not yet cached. Also, there is a subtle bug from *isPublic* from 

*getVisibility* method. *uri.getPath()* will not retain URI information like 

scheme, host, etc. So, the *uri* passed to checkPermissionOfOther will differ from the original {*}uri{*}.

For example, if uri is "file:/foo.invalid.com:8080/tmp/testing", then 
{code:java}
uri.getPath -> /foo.invalid.com:8080/tmp/testing
uri.toString -> file:/foo.invalid.com:8080/tmp/testing{code}
The consequence of this bug is that we will *double RPC calls* when the resources are remote, which is unnecessary. We see nontrivial overhead when checking those resources from our HDFS, especially when HDFS is overloaded. 

 

Ref: related code within *ClientDistributedCacheManager*
{code:java}
def addResource(...) {
    val destStatus = statCache.getOrElse(destPath.toUri(), fs.getFileStatus(destPath))
val visibility = getVisibility(conf, destPath.toUri(), statCache)
}
private[yarn] def getVisibility() {
isPublic(conf, uri, statCache)
}
private def isPublic(conf: Configuration, uri: URI, statCache: Map[URI, FileStatus]): Boolean = {
val current = new Path(uri.getPath()) // Should not use
getPath
checkPermissionOfOther(fs, uri, FsAction.READ, statCache)
}
{code}
 


> Path Inconsistency when Operating statCache within Yarn Client
> --------------------------------------------------------------
>
>                 Key: SPARK-44272
>                 URL: https://issues.apache.org/jira/browse/SPARK-44272
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Submit
>    Affects Versions: 0.9.1, 2.3.0, 3.4.0, 3.5.0
>            Reporter: SHU WANG
>            Priority: Critical
>
> The *addResource* from *ClientDistributedCacheManager* can add *FileStatus* to 
> *statCache* when it is not yet cached. Also, there is a subtle bug from *isPublic* from 
> *getVisibility* method. *uri.getPath()* will not retain URI information like 
> scheme, host, etc. So, the *uri* passed to checkPermissionOfOther will differ from the original {*}uri{*}.
> For example, if uri is "file:/foo.invalid.com:8080/tmp/testing", then 
> {code:java}
> uri.getPath -> /foo.invalid.com:8080/tmp/testing
> uri.toString -> file:/foo.invalid.com:8080/tmp/testing{code}
> The consequence of this bug is that we will *double RPC calls* when the resources are remote, which is unnecessary. We see nontrivial overhead when checking those resources from our HDFS, especially when HDFS is overloaded. 
>  
> Ref: related code within *ClientDistributedCacheManager*
> {code:java}
> def addResource(...) {
>     val destStatus = statCache.getOrElse(destPath.toUri(), fs.getFileStatus(destPath))
> val visibility = getVisibility(conf, destPath.toUri(), statCache)
> }
> private[yarn] def getVisibility() {
> isPublic(conf, uri, statCache)
> }
> private def isPublic(conf: Configuration, uri: URI, statCache: Map[URI, FileStatus]): Boolean = {
> val current = new Path(uri.getPath()) // Should not use getPath
> checkPermissionOfOther(fs, uri, FsAction.READ, statCache)
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org