You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Joel Baranick (JIRA)" <ji...@apache.org> on 2017/02/27 23:21:45 UTC
[jira] [Comment Edited] (HADOOP-14124) S3AFileSystem silently
deletes "fake" directories when writing a file.
[ https://issues.apache.org/jira/browse/HADOOP-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886734#comment-15886734 ]
Joel Baranick edited comment on HADOOP-14124 at 2/27/17 11:20 PM:
------------------------------------------------------------------
Hey Steve,
Thanks for the info. I read the Hadoop Filesystem specification and it seems like this scenario is breaking some of the specification.
First, the postcondition of the specification for {{FSDataOutputStream create(Path, ...)}} states that the "... updated (valid) FileSystem must contains all the parent directories of the path, as created by mkdirs(parent(p)).". I would content that in this scenario, the opposite is happening.
Second, the "Empty (non-root) directory" postcondition of the specification for {{FileSystem.delete(Path P, boolean recursive)}} states that "Deleting an empty directory that is not root will remove the path from the FS and return true.". While this is occurring, I think that considering the a fake directory empty even if it has another fake directory in it is incorrect. For example, on debian, the following doesn't work.
{noformat}
[~]# mkdir job
[~]# cd job
[job]# mkdir task
[job]# cd ..
[~]# rmdir job
rmdir: failed to remove ‘job’: Directory not empty
{noformat}
Additionally, the interaction of AmazonS3Client/CyberDuck with empty directories seems different than you described. See the following scenario:
# Open CyberDuck and connect to an S3 Bucket
# Create a folder called {{job}} in CyberDuck
# Right Click on the {{job}} folder and open +Info+. Result: _Size = 0B_ and _S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}. Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
# Navigate into the {{job}} folder in CyberDuck
# Create a folder called {{task}} in CyberDuck
# Right Click on the {{task}} folder and open +Info+. Result: _Size = 0B_ and _S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}} Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}} Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
#* _job/task/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Upload _file_ into _/job/task_ via CyberDuck
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}} Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}} Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
#* _job/task/_
#* _job/task/file_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/task/"))}}. Result:
#* _s3a://bucket/job/task_ ^dir;empty^
#* _s3a://bucket/job/task/file_ ^file^
One thing to note above is the inconsistent results for {{S3AFileSystem.listStatus(...))}}. In some cases a folder will be indicated as empty and in other it will be not empty.
At this point, if you delete {{/job/task/file}} in CyberDuck or the AWS Console, the {{/job}} and {{/job/task}} folders continue to exists and all calls continue to return the same results as before (except {{/job/task/file}} is excluded from any list results). If, on the other hand, you created {{/job/task/file}} via S3AFileSystem, it would implicitly remove the parent folders which it considers "empty". Then when {{/job/task/file}} is deleted, the parent "empty" directories are also gone.
My last counterpoint to the current Hadoop behavior with regard to S3A is the AWS S3 Console. It effectively models a filesystem despite the fact that it is backed by a blobstore. I'm able to create nested folders, upload a file, delete the file, and the nested "empty" folders still exist. As to the consistency guarantees, this is solved by EMR, making even more like a true FileSystem.
Thanks!
was (Author: jbaranick):
Hey Steve,
Thanks for the info. I read the Hadoop Filesystem specification and it seems like this scenario is breaking some of the specification.
First, the postcondition of the specification for {{FSDataOutputStream create(Path, ...)}} states that the "... updated (valid) FileSystem must contains all the parent directories of the path, as created by mkdirs(parent(p)).". I would content that in this scenario, the opposite is happening.
Second, the "Empty (non-root) directory" postcondition of the specification for {{FileSystem.delete(Path P, boolean recursive)}} states that "Deleting an empty directory that is not root will remove the path from the FS and return true.". While this is occurring, I think that considering the a fake directory empty even if it has another fake directory in it is incorrect. For example, on debian, the following doesn't work.
{noformat}
[~]# mkdir job
[~]# cd job
[job]# mkdir task
[job]# cd ..
[~]# rmdir job
rmdir: failed to remove ‘job’: Directory not empty
{noformat}
Additionally, the interaction of AmazonS3Client/CyberDuck with empty directories seems different than you described. See the following scenario:
# Open CyberDuck and connect to an S3 Bucket
# Create a folder called {{job}} in CyberDuck
# Right Click on the {{job}} folder and open +Info+. Result: _Size = 0B_ and _S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}}. Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
# Navigate into the {{job}} folder in CyberDuck
# Create a folder called {{task}} in CyberDuck
# Right Click on the {{task}} folder and open +Info+. Result: _Size = 0B_ and _S3 tab works_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}} Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}} Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
#* _job/task/_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Upload _file_ into _/job/task_ via CyberDuck
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/")}} Result: _Success_
# Call {{AmazonS3Client.getObjectMetadata("bucket", "job/task/")}} Result: _Success_
# Call {{AmazonS3Client.listObjects("bucket", "job/")}}. Result:
#* _job/_
#* _job/task/_
#* _job/task/file_
# Call {{S3AFileSystem.listStatus(new Path("/"))}}. Result:
#* _s3a://bucket/job_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/"))}}. Result:
#* _s3a://bucket/job_ ^dir;empty^
#* _s3a://bucket/job/task_ ^dir^
# Call {{S3AFileSystem.listStatus(new Path("/job/task/"))}}. Result:
#* _s3a://bucket/job/task_ ^dir;empty^
#* _s3a://bucket/job/task/file_ ^file^
At this point, if you delete {{/job/task/file}} in CyberDuck or the AWS Console, the {{/job}} and {{/job/task}} folders continue to exists and all calls continue to return the same results as before (except {{/job/task/file}} is excluded from any list results). If, on the other hand, you created {{/job/task/file}} via S3AFileSystem, it would implicitly remove the parent folders which it considers "empty". Then when {{/job/task/file}} is deleted, the parent "empty" directories are also gone.
My last counterpoint to the current Hadoop behavior with regard to S3A is the AWS S3 Console. It effectively models a filesystem despite the fact that it is backed by a blobstore. I'm able to create nested folders, upload a file, delete the file, and the nested "empty" folders still exist. As to the consistency guarantees, this is solved by EMR, making even more like a true FileSystem.
Thanks!
> S3AFileSystem silently deletes "fake" directories when writing a file.
> ----------------------------------------------------------------------
>
> Key: HADOOP-14124
> URL: https://issues.apache.org/jira/browse/HADOOP-14124
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs, fs/s3
> Affects Versions: 2.6.0
> Reporter: Joel Baranick
> Labels: filesystem, s3
>
> I realize that you guys probably have a good reason for {{S3AFileSystem}} to cleanup "fake" folders when a file is written to S3. That said, that fact that it silently does this feels like a separation of concerns issue. It also leads to weird behavior issues where calls to {{AmazonS3Client.getObjectMetadata}} for folders work before calling {{S3AFileSystem.create}} but not after. Also, there seems to be no mention in the javadoc that the {{deleteUnnecessaryFakeDirectories}} method is automatically invoked. Lastly, it seems like the goal of {{FileSystem}} should be to ensure that code built on top of it is portable to different implementations. This behavior is an example of a case where this can break down.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org