You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/11/22 13:14:24 UTC

[PR] GH-38618: [C++] S3FileSystem: fix regression in deleting explicitly created sub-directories [arrow]

jorisvandenbossche opened a new pull request, #38845:
URL: https://github.com/apache/arrow/pull/38845

   ### Rationale for this change
   
   See https://github.com/apache/arrow/issues/38618#issuecomment-1821252024 and below for the analysis. When deleting the dir contents, we use a GetFileInfo with recursive FileSelector to list all objects to delete, but when doing that the file paths for directories don't end in a trailing `/`, so for deleting explicitly created directories we need to add the `kSep` here as well to properly delete the object.
   
   ### Are these changes tested?
   
   I tested them manually with an actual S3 bucket. The problem is that MinIO doesn't have the same problem, and so it's not actually tested with the test I added using our MinIO testing setup.
   
   ### Are there any user-facing changes?
   
   Fixes the regression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38618: [C++] S3FileSystem: fix regression in deleting explicitly created sub-directories [arrow]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38845:
URL: https://github.com/apache/arrow/pull/38845#issuecomment-1822750901

   :warning: GitHub issue #38618 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38618: [C++] S3FileSystem: fix regression in deleting explicitly created sub-directories [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #38845:
URL: https://github.com/apache/arrow/pull/38845#discussion_r1415381331


##########
python/pyarrow/tests/test_fs.py:
##########
@@ -760,6 +760,36 @@ def test_delete_dir(fs, pathfn):
         fs.delete_dir(d)
 
 
+def test_delete_dir_with_explicit_subdir(fs, pathfn):
+    # https://github.com/apache/arrow/issues/38618 (regression with S3 failing
+    # to delete directories, depending on whether they were created explicitly)

Review Comment:
   ```suggestion
       # GH-38618: regression with AWS failing to delete directories,
       # depending on whether they were created explicitly. Note that
       # Minio doesn't reproduce the issue, so this test is not a regression
       # test in itself.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38618: [C++] S3FileSystem: fix regression in deleting explicitly created sub-directories [arrow]

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #38845:
URL: https://github.com/apache/arrow/pull/38845#issuecomment-1842638746

   After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit cf80bd1135bbd9cee7c0ae3e6370f93270cba250.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/19365763609) has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38618: [C++] S3FileSystem: fix regression in deleting explicitly created sub-directories [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche merged PR #38845:
URL: https://github.com/apache/arrow/pull/38845


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38618: [C++] S3FileSystem: fix regression in deleting explicitly created sub-directories [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #38845:
URL: https://github.com/apache/arrow/pull/38845#discussion_r1413710622


##########
python/pyarrow/tests/test_fs.py:
##########
@@ -760,6 +760,34 @@ def test_delete_dir(fs, pathfn):
         fs.delete_dir(d)
 
 
+def test_delete_dir_with_explicit_subdir(fs, pathfn):
+    skip_fsspec_s3fs(fs)

Review Comment:
   Also, mention the GH issue and explain what this test is about? Otherwise, it's not obvious why the test is needed at all.



##########
cpp/src/arrow/filesystem/s3fs.cc:
##########
@@ -2408,7 +2408,12 @@ class S3FileSystem::Impl : public std::enable_shared_from_this<S3FileSystem::Imp
                   std::vector<std::string> file_paths;
                   for (const auto& file_info : file_infos) {
                     DCHECK_GT(file_info.path().size(), bucket.size());
-                    file_paths.push_back(file_info.path().substr(bucket.size() + 1));
+                    auto file_path = file_info.path().substr(bucket.size() + 1);
+                    if (file_info.IsDirectory()) {
+                      DCHECK_OK(internal::AssertNoTrailingSlash(file_path));
+                      file_path = file_path + kSep;

Review Comment:
   Can you please add a comment here referring to the GH issue? Otherwise this will easily get lost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38618: [C++] S3FileSystem: fix regression in deleting explicitly created sub-directories [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #38845:
URL: https://github.com/apache/arrow/pull/38845#discussion_r1402702389


##########
cpp/src/arrow/filesystem/s3fs.cc:
##########
@@ -2408,7 +2408,11 @@ class S3FileSystem::Impl : public std::enable_shared_from_this<S3FileSystem::Imp
                   std::vector<std::string> file_paths;
                   for (const auto& file_info : file_infos) {
                     DCHECK_GT(file_info.path().size(), bucket.size());
-                    file_paths.push_back(file_info.path().substr(bucket.size() + 1));
+                    auto file_path = file_info.path().substr(bucket.size() + 1);
+                    if (file_info.IsDirectory()) {
+                      file_path = file_path + kSep;

Review Comment:
   Can you `DCHECK` here that `file_path` doesn't end with a `/` (kSep)?
   
   It looks like `FileInfo` is normalizing paths to never end with `/` and this `DCHECK` can protect us if that changes in the future.



##########
cpp/src/arrow/filesystem/s3fs.cc:
##########
@@ -2408,7 +2408,11 @@ class S3FileSystem::Impl : public std::enable_shared_from_this<S3FileSystem::Imp
                   std::vector<std::string> file_paths;
                   for (const auto& file_info : file_infos) {
                     DCHECK_GT(file_info.path().size(), bucket.size());
-                    file_paths.push_back(file_info.path().substr(bucket.size() + 1));
+                    auto file_path = file_info.path().substr(bucket.size() + 1);
+                    if (file_info.IsDirectory()) {
+                      file_path = file_path + kSep;
+                    }
+                    file_paths.push_back(file_path);

Review Comment:
   Now that `file_path` lives in a variable, you need to explicitly `std::move` it to avoid a copy when pushing:
   
   ```cpp
   file_paths.push_back(std::move(file_path));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38618: [C++] S3FileSystem: fix regression in deleting explicitly created sub-directories [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #38845:
URL: https://github.com/apache/arrow/pull/38845#discussion_r1415329842


##########
cpp/src/arrow/filesystem/s3fs.cc:
##########
@@ -2408,7 +2408,16 @@ class S3FileSystem::Impl : public std::enable_shared_from_this<S3FileSystem::Imp
                   std::vector<std::string> file_paths;
                   for (const auto& file_info : file_infos) {
                     DCHECK_GT(file_info.path().size(), bucket.size());
-                    file_paths.push_back(file_info.path().substr(bucket.size() + 1));
+                    auto file_path = file_info.path().substr(bucket.size() + 1);
+                    if (file_info.IsDirectory()) {
+                      // the selector returns FileInfo objects for directories with a
+                      // a path that never ends in a trailing slash, but for S3 the file
+                      // needs to have a trailing slash to recognize it as direcotory
+                      // (https://github.com/apache/arrow/issues/38618)

Review Comment:
   ```suggestion
                         // The selector returns FileInfo objects for directories with a
                         // a path that never ends in a trailing slash, but for AWS the file
                         // needs to have a trailing slash to recognize it as directory
                         // (https://github.com/apache/arrow/issues/38618)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-38618: [C++] S3FileSystem: fix regression in deleting explicitly created sub-directories [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #38845:
URL: https://github.com/apache/arrow/pull/38845#discussion_r1415326712


##########
python/pyarrow/tests/test_fs.py:
##########
@@ -760,6 +760,34 @@ def test_delete_dir(fs, pathfn):
         fs.delete_dir(d)
 
 
+def test_delete_dir_with_explicit_subdir(fs, pathfn):
+    skip_fsspec_s3fs(fs)

Review Comment:
   Added a link+comment, although the test itself doesn't actually catch the regression .. (minio doesn't have the same issue, the test only fails when actually running with AWS)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org