You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Joel Baranick (JIRA)" <ji...@apache.org> on 2017/02/27 23:21:45 UTC

[jira] [Comment Edited] (HADOOP-14124) S3AFileSystem silently deletes "fake" directories when writing a file.

    [ https://issues.apache.org/jira/browse/HADOOP-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886734#comment-15886734 ] 

Joel Baranick edited comment on HADOOP-14124 at 2/27/17 11:20 PM:
------------------------------------------------------------------

Hey Steve,

Thanks for the info.  I read the Hadoop Filesystem specification and it seems like this scenario is breaking some of the specification.

First, the postcondition of the specification for {{FSDataOutputStream create(Path, ...)}} states that the "... updated (valid) FileSystem must contains all the parent directories of the path, as created by mkdirs(parent(p)).".  I would content that in this scenario, the opposite is happening.

Second, the "Empty (non-root) directory" postcondition of the specification for {{FileSystem.delete(Path P, boolean recursive)}} states that "Deleting an empty directory that is not root will remove the path from the FS and return true.".  While this is occurring, I think that considering the a fake directory empty even if it has another fake directory in it is incorrect.  For example, on debian, the following doesn't work. 
{noformat}
[~]# mkdir job
[~]# cd job
[job]# mkdir task
[job]# cd ..
[~]# rmdir job
rmdir: failed to remove ‘job’: Directory not empty
{noformat}

Additionally, the interaction of AmazonS3Client/CyberDuck with empty directories seems different than you described.  See the following scenario:
# Open CyberDuck and connect to an S3 Bucket
# Create a folder called {{job}} in CyberDuck
# Right Click on the {{job}} folder and open +Info+. Result: _Size = 0B_ and _S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}.  Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
# Navigate into the {{job}} folder in CyberDuck
# Create a folder called {{task}} in CyberDuck
# Right Click on the {{task}} folder and open +Info+.  Result: _Size = 0B_ and _S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}  Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}}  Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}.  Result:
#* _job/_
#* _job/task/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Upload _file_ into _/job/task_ via CyberDuck
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}  Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}}  Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}.  Result:
#* _job/_
#* _job/task/_
#* _job/task/file_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/task/"))}}.  Result: 
#* _s3a://bucket/job/task_ ^dir;empty^
#* _s3a://bucket/job/task/file_ ^file^

One thing to note above is the inconsistent results for {{S3AFileSystem.listStatus(...))}}.  In some cases a folder will be indicated as empty and in other it will be not empty. 

At this point, if you delete {{/job/task/file}} in CyberDuck or the AWS Console, the {{/job}} and {{/job/task}} folders continue to exists and all calls continue to return the same results as before (except {{/job/task/file}} is excluded from any list results).  If, on the other hand, you created {{/job/task/file}} via S3AFileSystem, it would implicitly remove the parent folders which it considers "empty".  Then when {{/job/task/file}} is deleted, the parent "empty" directories are also gone.

My last counterpoint to the current Hadoop behavior with regard to S3A is the AWS S3 Console.  It effectively models a filesystem despite the fact that it is backed by a blobstore.  I'm able to create nested folders, upload a file, delete the file, and the nested "empty" folders still exist.  As to the consistency guarantees, this is solved by EMR, making even more like a true FileSystem.

Thanks!


was (Author: jbaranick):
Hey Steve,

Thanks for the info.  I read the Hadoop Filesystem specification and it seems like this scenario is breaking some of the specification.

First, the postcondition of the specification for {{FSDataOutputStream create(Path, ...)}} states that the "... updated (valid) FileSystem must contains all the parent directories of the path, as created by mkdirs(parent(p)).".  I would content that in this scenario, the opposite is happening.

Second, the "Empty (non-root) directory" postcondition of the specification for {{FileSystem.delete(Path P, boolean recursive)}} states that "Deleting an empty directory that is not root will remove the path from the FS and return true.".  While this is occurring, I think that considering the a fake directory empty even if it has another fake directory in it is incorrect.  For example, on debian, the following doesn't work. 
{noformat}
[~]# mkdir job
[~]# cd job
[job]# mkdir task
[job]# cd ..
[~]# rmdir job
rmdir: failed to remove ‘job’: Directory not empty
{noformat}

Additionally, the interaction of AmazonS3Client/CyberDuck with empty directories seems different than you described.  See the following scenario:
# Open CyberDuck and connect to an S3 Bucket
# Create a folder called {{job}} in CyberDuck
# Right Click on the {{job}} folder and open +Info+. Result: _Size = 0B_ and _S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}.  Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
# Navigate into the {{job}} folder in CyberDuck
# Create a folder called {{task}} in CyberDuck
# Right Click on the {{task}} folder and open +Info+.  Result: _Size = 0B_ and _S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}  Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}}  Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}.  Result:
#* _job/_
#* _job/task/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Upload _file_ into _/job/task_ via CyberDuck
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}  Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}}  Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}.  Result:
#* _job/_
#* _job/task/_
#* _job/task/file_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}.  Result: 
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/task/"))}}.  Result: 
#* _s3a://bucket/job/task_ ^dir;empty^
#* _s3a://bucket/job/task/file_ ^file^

At this point, if you delete {{/job/task/file}} in CyberDuck or the AWS Console, the {{/job}} and {{/job/task}} folders continue to exists and all calls continue to return the same results as before (except {{/job/task/file}} is excluded from any list results).  If, on the other hand, you created {{/job/task/file}} via S3AFileSystem, it would implicitly remove the parent folders which it considers "empty".  Then when {{/job/task/file}} is deleted, the parent "empty" directories are also gone.

My last counterpoint to the current Hadoop behavior with regard to S3A is the AWS S3 Console.  It effectively models a filesystem despite the fact that it is backed by a blobstore.  I'm able to create nested folders, upload a file, delete the file, and the nested "empty" folders still exist.  As to the consistency guarantees, this is solved by EMR, making even more like a true FileSystem.

Thanks!

> S3AFileSystem silently deletes "fake" directories when writing a file.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-14124
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14124
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs, fs/s3
>    Affects Versions: 2.6.0
>            Reporter: Joel Baranick
>              Labels: filesystem, s3
>
> I realize that you guys probably have a good reason for {{S3AFileSystem}} to cleanup "fake" folders when a file is written to S3.  That said, that fact that it silently does this feels like a separation of concerns issue.  It also leads to weird behavior issues where calls to {{AmazonS3Client.getObjectMetadata}} for folders work before calling {{S3AFileSystem.create}} but not after.  Also, there seems to be no mention in the javadoc that the {{deleteUnnecessaryFakeDirectories}} method is automatically invoked. Lastly, it seems like the goal of {{FileSystem}} should be to ensure that code built on top of it is portable to different implementations.  This behavior is an example of a case where this can break down.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org