You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/18 12:18:55 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10958: ARROW-13652: [Python] Expose copy_files in pyarrow.fs

jorisvandenbossche commented on a change in pull request #10958:
URL: https://github.com/apache/arrow/pull/10958#discussion_r691180439



##########
File path: python/pyarrow/_fs.pyx
##########
@@ -1124,3 +1124,43 @@ cdef void _cb_open_append_stream(
 cdef void _cb_normalize_path(handler, const c_string& path,
                              c_string* out) except *:
     out[0] = tobytes(handler.normalize_path(frombytes(path)))
+
+
+def _copy_files(FileSystem source_fs, str source_path,
+                FileSystem destination_fs, str destination_path):
+    # low-level helper exposed through pyarrow/fs.py::copy_files
+    cdef:
+        CFileLocator c_source
+        vector[CFileLocator] c_sources
+        CFileLocator c_destination
+        vector[CFileLocator] c_destinations
+        FileSystem fs
+        CStatus c_status
+        shared_ptr[CFileSystem] c_fs
+
+    c_source.filesystem = source_fs.unwrap()
+    c_source.path = tobytes(source_path)
+    c_sources.push_back(c_source)
+
+    c_destination.filesystem = destination_fs.unwrap()
+    c_destination.path = tobytes(destination_path)
+    c_destinations.push_back(c_destination)
+
+    with nogil:
+        check_status(CCopyFiles(
+            c_sources, c_destinations,
+            c_default_io_context(), 1024*1024, True

Review comment:
       It might be worth exposing `chunk_size` and `use_threads` arguments in the python function as well.

##########
File path: python/pyarrow/fs.py
##########
@@ -182,6 +184,59 @@ def _resolve_filesystem_and_path(
     return filesystem, path
 
 
+def copy_files(source, destination,
+               source_filesystem=None, destination_filesystem=None):
+    """
+    Copy files between FileSystems.
+
+    This functions allows you to recursively copy directories of files from
+    one file system to another, such as from S3 to your local machine.
+
+    Parameters
+    ----------
+    source : string
+        Source file path or URI to a single file or directory.
+        If a directory, files will be copied recursively from this path.
+    destination : string
+        Destination file path or URI. If `source` is a file, `destination`
+        is also interpreted as the destination file (not directory).
+        Directories will be created as necessary.
+    source_filesystem : FileSystem, optional
+        Source filesystem, needs to be specified if `source` is not a URI,
+        otherwise inferred.
+    destination_filesystem : FileSystem, optional
+        Destination filesystem, needs to be specified if `destination` is not
+        a URI, otherwise inferred.
+
+    Examples
+    --------
+    Copy an S3 bucket's files to a local directory:
+
+    >>> copy_files("s3://your-bucket-name", "local-directory")
+
+    Using a FileSystem object:
+
+    >>> copy_files("your-bucket-name", "local-directory",
+    ...            source_filesystem=S3FileSystem(...))
+
+    """
+    source_fs, source_path = _resolve_filesystem_and_path(
+        source, source_filesystem
+    )
+    destination_fs, destination_path = _resolve_filesystem_and_path(
+        destination, destination_filesystem
+    )
+
+    file_info = source_fs.get_file_info(source_path)
+    if file_info.type == FileType.Directory:
+        source_sel = FileSelector(source_path, recursive=True)

Review comment:
       We should maybe also directly accept a FileSelector as input to the `copy_files` function (instead of only a string path/URI for which we infer it it's a directory)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org