You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "kevin su (JIRA)" <ji...@apache.org> on 2019/07/04 07:33:00 UTC

[jira] [Commented] (YARN-2663) Race condintion in shared cache CleanerTask during deletion of resource

    [ https://issues.apache.org/jira/browse/YARN-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878395#comment-16878395 ] 

kevin su commented on YARN-2663:
--------------------------------

*removeResourceFromCacheFileSystem* already rename the file after rm scm cache,so it should support atomic already

Do we also need lock for these ?
 If yes, I can do it
{code:java}
  private boolean removeResourceFromCacheFileSystem(Path path)
      throws IOException {
    // rename the directory to make the delete atomic
    Path renamedPath = new Path(path.toString() + RENAMED_SUFFIX);
    if (fs.rename(path, renamedPath)) {
      // the directory can be removed safely now
      // log the original path
      LOG.info("Deleting " + path.toString());
      return fs.delete(renamedPath, true);
    }
{code}

> Race condintion in shared cache CleanerTask during deletion of resource
> -----------------------------------------------------------------------
>
>                 Key: YARN-2663
>                 URL: https://issues.apache.org/jira/browse/YARN-2663
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Chris Trezzo
>            Priority: Blocker
>
> In CleanerTask, store.removeResource(key) and removeResourceFromCacheFileSystem(path) do not happen together in atomic fashion.
> Since resources could be uploaded with different file names, the SCM could receive a notification to add a resource to the SCM between the two operations. Thus, we have a scenario where the cleaner service deletes the entry from the scm, receives a notification from the uploader (adding the entry back into the scm) and then deletes the file from HDFS.
> Cleaner code that deletes resource:
> {code}
>       if (store.isResourceEvictable(key, resource)) {
>         try {
>           /*
>            * TODO: There is a race condition between store.removeResource(key)
>            * and removeResourceFromCacheFileSystem(path) operations because they
>            * do not happen atomically and resources can be uploaded with
>            * different file names by the node managers.
>            */
>           // remove the resource from scm (checks for appIds as well)
>           if (store.removeResource(key)) {
>             // remove the resource from the file system
>             boolean deleted = removeResourceFromCacheFileSystem(path);
>             if (deleted) {
>               resourceStatus = ResourceStatus.DELETED;
>             } else {
>               LOG.error("Failed to remove path from the file system."
>                   + " Skipping this resource: " + path);
>               resourceStatus = ResourceStatus.ERROR;
>             }
>           } else {
>             // we did not delete the resource because it contained application
>             // ids
>             resourceStatus = ResourceStatus.PROCESSED;
>           }
>         } catch (IOException e) {
>           LOG.error(
>               "Failed to remove path from the file system. Skipping this resource: "
>                   + path, e);
>           resourceStatus = ResourceStatus.ERROR;
>         }
>       } else {
>         resourceStatus = ResourceStatus.PROCESSED;
>       }
> {code}
> Uploader code that uploads resource:
> {code}
>       // create the temporary file
>       tempPath = new Path(directoryPath, getTemporaryFileName(actualPath));
>       if (!uploadFile(actualPath, tempPath)) {
>         LOG.warn("Could not copy the file to the shared cache at " + tempPath);
>         return false;
>       }
>       // set the permission so that it is readable but not writable
>       // TODO should I create the file with the right permission so I save the
>       // permission call?
>       fs.setPermission(tempPath, FILE_PERMISSION);
>       // rename it to the final filename
>       Path finalPath = new Path(directoryPath, actualPath.getName());
>       if (!fs.rename(tempPath, finalPath)) {
>         LOG.warn("The file already exists under " + finalPath +
>             ". Ignoring this attempt.");
>         deleteTempFile(tempPath);
>         return false;
>       }
>       // notify the SCM
>       if (!notifySharedCacheManager(checksumVal, actualPath.getName())) {
>         // the shared cache manager rejected the upload (as it is likely
>         // uploaded under a different name
>         // clean up this file and exit
>         fs.delete(finalPath, false);
>         return false;
>       }
> {code}
> One solution is to have the UploaderService always rename the resource file to the checksum of the resource plus the extension. With this fix we will never receive a notify for the resource before the delete from the FS has happened because the rename on the node manager will fail. If the node manager uploads the file after it is deleted from the FS, we are ok and the resource will simply get added back to the scm once a notification is received.
> The classpath at the MapReduce layer is still usable because we leverage links to preserve the original client file name.
> The downside is that now the shared cache files in HDFS are less readable. This could be mitigated with an added admin command to the SCM that gives a list of filenames associated with a checksum or vice versa.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org