You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/02 20:47:29 UTC

[GitHub] [arrow] bkietz opened a new pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

bkietz opened a new pull request #8101:
URL: https://github.com/apache/arrow/pull/8101


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8101:
URL: https://github.com/apache/arrow/pull/8101#discussion_r483019430



##########
File path: cpp/src/arrow/filesystem/filesystem.h
##########
@@ -387,6 +387,14 @@ Result<std::shared_ptr<FileSystem>> FileSystemFromUriOrPath(
 
 /// @}
 
+/// \brief Copy files from one FileSystem to another
+ARROW_EXPORT
+Status CopyFiles(const std::shared_ptr<FileSystem>& src_fs,

Review comment:
       That sounds ok, assuming having potentially different filesystems in those vectors is not a problem.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #8101:
URL: https://github.com/apache/arrow/pull/8101






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8101:
URL: https://github.com/apache/arrow/pull/8101#issuecomment-685997393


   https://issues.apache.org/jira/browse/ARROW-9868


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #8101:
URL: https://github.com/apache/arrow/pull/8101#discussion_r483011163



##########
File path: cpp/src/arrow/filesystem/filesystem.h
##########
@@ -387,6 +387,14 @@ Result<std::shared_ptr<FileSystem>> FileSystemFromUriOrPath(
 
 /// @}
 
+/// \brief Copy files from one FileSystem to another
+ARROW_EXPORT
+Status CopyFiles(const std::shared_ptr<FileSystem>& src_fs,

Review comment:
       Would you then want the signature to be `Status CopyFiles(const std::vector<FileLocator>& sources, const std::vector<FileLocator>& destinations, ...)`?
   
   https://issues.apache.org/jira/browse/ARROW-9909




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #8101:
URL: https://github.com/apache/arrow/pull/8101#discussion_r482477183



##########
File path: r/src/filesystem.cpp
##########
@@ -230,6 +230,17 @@ cpp11::writable::list fs___FileSystemFromUri(const std::string& path) {
   return cpp11::writable::list({"fs"_nm = file_system, "path"_nm = out_path});
 }
 
+// [[arrow::export]]
+void fs___CopyFiles(const std::shared_ptr<FileSystem>& src_fs,
+                    const std::vector<std::string>& src_paths,
+                    const std::shared_ptr<FileSystem>& dest_fs,

Review comment:
       ```suggestion
   void fs___CopyFiles(const std::shared_ptr<fs::FileSystem>& src_fs,
                       const std::vector<std::string>& src_paths,
                       const std::shared_ptr<fs::FileSystem>& dest_fs,
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #8101:
URL: https://github.com/apache/arrow/pull/8101#discussion_r482457407



##########
File path: r/R/filesystem.R
##########
@@ -280,6 +280,21 @@ SubTreeFileSystem$create <- function(base_path, base_fs) {
   shared_ptr(SubTreeFileSystem, xp)
 }
 
+#' Copy files between FileSystems
+#'
+#' @param src_fs The FileSystem from which files will be copied.
+#' @param src_paths The paths of files to be copied.
+#' @param dest_fs The FileSystem into which files will be copied.
+#' @param dest_paths Where the copied files should be placed.
+#' @param chunk_size The maximum size of block to read before flushing
+#' to the destination file. A larger chunk_size will use more memory while
+#' copying but may help accommodate high latency FileSystems.
+#' @param use_threads Whether to copy files in parallel.

Review comment:
       The R package is set up to use a global use-threads setting rather than exposing that in every function signature. So instead of exposing `use_threads` as a function argument, call `options_use_threads()` in L295. See `as.data.frame.RecordBatch` for an example.

##########
File path: cpp/src/arrow/filesystem/filesystem.cc
##########
@@ -439,6 +440,28 @@ Result<std::shared_ptr<io::OutputStream>> SlowFileSystem::OpenAppendStream(
   return base_fs_->OpenAppendStream(path);
 }
 
+Status CopyFiles(const std::shared_ptr<FileSystem>& src_fs,
+                 const std::vector<std::string>& src_paths,
+                 const std::shared_ptr<FileSystem>& dest_fs,
+                 const std::vector<std::string>& dest_paths, int64_t chunk_size,
+                 bool use_threads) {
+  if (src_paths.size() != dest_paths.size()) {
+    return Status::Invalid("Trying to copy ", src_paths.size(), " files into ",
+                           dest_paths.size(), " paths.");
+  }
+
+  return ::arrow::internal::OptionalParallelFor(
+      use_threads, static_cast<int>(src_paths.size()), [&](int i) {
+        ARROW_ASSIGN_OR_RAISE(auto src, src_fs->OpenInputStream(src_paths[i]));
+        auto dest_dir = internal::GetAbstractPathParent(dest_paths[i]).first;
+        if (!dest_dir.empty()) {
+          RETURN_NOT_OK(dest_fs->CreateDir(dest_dir));
+        }
+        ARROW_ASSIGN_OR_RAISE(auto dest, dest_fs->OpenOutputStream(dest_paths[i]));
+        return internal::CopyStream(src, dest, chunk_size);

Review comment:
       Should CopyStream also be parallelized? Looking at its implementation, I wonder whether we could benefit from having separate threads do the Read and the Write




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #8101:
URL: https://github.com/apache/arrow/pull/8101






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #8101:
URL: https://github.com/apache/arrow/pull/8101#discussion_r483056088



##########
File path: cpp/src/arrow/filesystem/filesystem.h
##########
@@ -387,6 +387,14 @@ Result<std::shared_ptr<FileSystem>> FileSystemFromUriOrPath(
 
 /// @}
 
+/// \brief Copy files from one FileSystem to another
+ARROW_EXPORT
+Status CopyFiles(const std::shared_ptr<FileSystem>& src_fs,

Review comment:
       I've wanted some kind of object to contain fs and path (like what you get from FileSystemFromUri) for a while too. Yesterday in fact I was experimenting with using SubTreeFileSystem to hold that, since it initializes with a fs and path. That mostly works as expected (for the operations I was trying at least).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson closed pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
nealrichardson closed pull request #8101:
URL: https://github.com/apache/arrow/pull/8101


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8101:
URL: https://github.com/apache/arrow/pull/8101#discussion_r482921231



##########
File path: cpp/src/arrow/filesystem/filesystem.cc
##########
@@ -439,6 +440,28 @@ Result<std::shared_ptr<io::OutputStream>> SlowFileSystem::OpenAppendStream(
   return base_fs_->OpenAppendStream(path);
 }
 
+Status CopyFiles(const std::shared_ptr<FileSystem>& src_fs,
+                 const std::vector<std::string>& src_paths,
+                 const std::shared_ptr<FileSystem>& dest_fs,
+                 const std::vector<std::string>& dest_paths, int64_t chunk_size,
+                 bool use_threads) {
+  if (src_paths.size() != dest_paths.size()) {
+    return Status::Invalid("Trying to copy ", src_paths.size(), " files into ",
+                           dest_paths.size(), " paths.");
+  }
+
+  return ::arrow::internal::OptionalParallelFor(
+      use_threads, static_cast<int>(src_paths.size()), [&](int i) {
+        ARROW_ASSIGN_OR_RAISE(auto src, src_fs->OpenInputStream(src_paths[i]));
+        auto dest_dir = internal::GetAbstractPathParent(dest_paths[i]).first;
+        if (!dest_dir.empty()) {
+          RETURN_NOT_OK(dest_fs->CreateDir(dest_dir));
+        }
+        ARROW_ASSIGN_OR_RAISE(auto dest, dest_fs->OpenOutputStream(dest_paths[i]));
+        return internal::CopyStream(src, dest, chunk_size);

Review comment:
       Writing is typically asynchronous. A separate write thread may speed up things a little, but I don't think it would make a sizable difference.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8101:
URL: https://github.com/apache/arrow/pull/8101#issuecomment-690775030


   @bkietz is this done?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8101:
URL: https://github.com/apache/arrow/pull/8101#discussion_r482921953



##########
File path: cpp/src/arrow/filesystem/filesystem.h
##########
@@ -387,6 +387,14 @@ Result<std::shared_ptr<FileSystem>> FileSystemFromUriOrPath(
 
 /// @}
 
+/// \brief Copy files from one FileSystem to another
+ARROW_EXPORT
+Status CopyFiles(const std::shared_ptr<FileSystem>& src_fs,

Review comment:
       We may introduce a `FileLocator` struct or something that would be a `filesystem, path` pair. Would also be useful with `FromURI`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #8101: ARROW-9868: [C++][R] Provide CopyFiles for copying files between FileSystems

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #8101:
URL: https://github.com/apache/arrow/pull/8101#discussion_r483012720



##########
File path: r/R/filesystem.R
##########
@@ -280,6 +280,21 @@ SubTreeFileSystem$create <- function(base_path, base_fs) {
   shared_ptr(SubTreeFileSystem, xp)
 }
 
+#' Copy files between FileSystems
+#'
+#' @param src_fs The FileSystem from which files will be copied.
+#' @param src_paths The paths of files to be copied.
+#' @param dest_fs The FileSystem into which files will be copied.
+#' @param dest_paths Where the copied files should be placed.
+#' @param chunk_size The maximum size of block to read before flushing
+#' to the destination file. A larger chunk_size will use more memory while
+#' copying but may help accommodate high latency FileSystems.
+#' @param use_threads Whether to copy files in parallel.

Review comment:
       will do




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org