You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/04 12:41:25 UTC

[GitHub] [arrow] pitrou opened a new pull request #7349: ARROW-8766: [Python] Allow implementing filesystems in Python

pitrou opened a new pull request #7349:
URL: https://github.com/apache/arrow/pull/7349


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7349: ARROW-8766: [Python] Allow implementing filesystems in Python

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7349:
URL: https://github.com/apache/arrow/pull/7349#discussion_r435496748



##########
File path: python/pyarrow/_fs.pyx
##########
@@ -55,24 +56,56 @@ def _normalize_path(FileSystem filesystem, path):
     return frombytes(c_path_normalized)
 
 
+cdef object _wrap_file_type(CFileType ty):
+    return FileType(<int8_t> ty)
+
+
+cdef CFileType _unwrap_file_type(FileType ty) except *:
+    if ty == FileType.Unknown:
+        return CFileType_Unknown
+    elif ty == FileType.NotFound:
+        return CFileType_NotFound
+    elif ty == FileType.File:
+        return CFileType_File
+    elif ty == FileType.Directory:
+        return CFileType_Directory
+    assert 0
+
+
 cdef class FileInfo:
     """
     FileSystem entry info.
     """
 
-    def __init__(self):
-        raise TypeError("FileInfo cannot be instantiated directly, use "
-                        "FileSystem.get_file_info method instead.")
+    def __init__(self, path, FileType type=FileType.Unknown, *,
+                 mtime=None, size=None):
+        self.info.set_path(tobytes(path))
+        self.info.set_type(_unwrap_file_type(type))
+        if mtime is not None:
+            if isinstance(mtime, datetime):
+                self.info.set_mtime(PyDateTime_to_TimePoint(
+                    <PyDateTime_DateTime*> mtime))
+            else:
+                self.info.set_mtime(TimePoint_from_ns(mtime))

Review comment:
       Note to self: should instead add a separate `mtime_ns` argument.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7349: ARROW-8766: [Python] Allow implementing filesystems in Python

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7349:
URL: https://github.com/apache/arrow/pull/7349#issuecomment-638825160


   https://issues.apache.org/jira/browse/ARROW-8766


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7349: ARROW-8766: [Python] Allow implementing filesystems in Python

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #7349:
URL: https://github.com/apache/arrow/pull/7349#discussion_r437373988



##########
File path: python/pyarrow/_fs.pyx
##########
@@ -685,3 +756,234 @@ cdef class _MockFileSystem(FileSystem):
     cdef init(self, const shared_ptr[CFileSystem]& wrapped):
         FileSystem.init(self, wrapped)
         self.mockfs = <CMockFileSystem*> wrapped.get()
+
+
+cdef class PyFileSystem(FileSystem):
+    """
+    A FileSystem with behavior implemented in Python.
+
+    A PyFileSystem is backed by a FileSystemHandler instance.
+    """
+
+    def __init__(self, handler):
+        cdef:
+            CPyFileSystemVtable vtable
+            shared_ptr[CPyFileSystem] wrapped
+
+        if not isinstance(handler, FileSystemHandler):
+            raise TypeError("Expected a FileSystemHandler instance, got {0}"
+                            .format(type(handler)))
+
+        vtable.get_type_name = _cb_get_type_name
+        vtable.equals = _cb_equals
+        vtable.get_file_info = _cb_get_file_info
+        vtable.get_file_info_vector = _cb_get_file_info_vector
+        vtable.get_file_info_selector = _cb_get_file_info_selector
+        vtable.create_dir = _cb_create_dir
+        vtable.delete_dir = _cb_delete_dir
+        vtable.delete_dir_contents = _cb_delete_dir_contents
+        vtable.delete_file = _cb_delete_file
+        vtable.move = _cb_move
+        vtable.copy_file = _cb_copy_file
+        vtable.open_input_stream = _cb_open_input_stream
+        vtable.open_input_file = _cb_open_input_file
+        vtable.open_output_stream = _cb_open_output_stream
+        vtable.open_append_stream = _cb_open_append_stream
+
+        wrapped = CPyFileSystem.Make(handler, move(vtable))
+        self.init(<shared_ptr[CFileSystem]> wrapped)
+
+    cdef init(self, const shared_ptr[CFileSystem]& wrapped):
+        FileSystem.init(self, wrapped)
+        self.pyfs = <CPyFileSystem*> wrapped.get()
+
+    @property
+    def handler(self):
+        """
+        The filesystem's underlying handler.
+
+        Returns
+        -------
+        handler : FileSystemHandler
+        """
+        return <object> self.pyfs.handler()
+
+    def __reduce__(self):
+        return PyFileSystem, (self.handler,)
+
+
+class FileSystemHandler(ABC):
+    """
+    An abstract class exposing methods to implement PyFileSystem's behavior.
+    """
+
+    @abstractmethod
+    def get_type_name(self):
+        """
+        Implement PyFileSystem.type_name.
+        """
+
+    @abstractmethod
+    def get_file_info(self, paths):
+        """
+        Implement PyFileSystem.get_file_info(paths).
+        """
+
+    @abstractmethod
+    def get_file_info_selector(self, selector):
+        """
+        Implement PyFileSystem.get_file_info(selector).
+        """
+
+    @abstractmethod
+    def create_dir(self, path, recursive):
+        """
+        Implement PyFileSystem.create_dir(...).
+        """
+
+    @abstractmethod
+    def delete_dir(self, path):
+        """
+        Implement PyFileSystem.delete_dir(...).
+        """
+
+    @abstractmethod
+    def delete_dir_contents(self, path):
+        """
+        Implement PyFileSystem.delete_dir_contents(...).
+        """
+
+    @abstractmethod
+    def delete_file(self, path):
+        """
+        Implement PyFileSystem.delete_file(...).
+        """
+
+    @abstractmethod
+    def move(self, src, dest):
+        """
+        Implement PyFileSystem.move(...).

Review comment:
       I suppose methods like this are expected to just raise the appropriate FileNotFoundError in case `src` does not exist?

##########
File path: cpp/src/arrow/python/filesystem.h
##########
@@ -0,0 +1,115 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "arrow/filesystem/filesystem.h"
+#include "arrow/python/common.h"
+#include "arrow/python/visibility.h"
+#include "arrow/util/macros.h"
+
+namespace arrow {
+namespace py {
+namespace fs {
+
+class ARROW_PYTHON_EXPORT PyFileSystemVtable {
+ public:
+  std::function<void(PyObject*, std::string* out)> get_type_name;
+  std::function<bool(PyObject*, const arrow::fs::FileSystem& other)> equals;
+
+  std::function<void(PyObject*, const std::string& path, arrow::fs::FileInfo* out)>
+      get_file_info;
+  std::function<void(PyObject*, const std::vector<std::string>& paths,
+                     std::vector<arrow::fs::FileInfo>* out)>
+      get_file_info_vector;
+  std::function<void(PyObject*, const arrow::fs::FileSelector&,
+                     std::vector<arrow::fs::FileInfo>* out)>
+      get_file_info_selector;
+
+  std::function<void(PyObject*, const std::string& path, bool)> create_dir;
+  std::function<void(PyObject*, const std::string& path)> delete_dir;
+  std::function<void(PyObject*, const std::string& path)> delete_dir_contents;
+  std::function<void(PyObject*, const std::string& path)> delete_file;
+  std::function<void(PyObject*, const std::string& src, const std::string& dest)> move;
+  std::function<void(PyObject*, const std::string& src, const std::string& dest)>
+      copy_file;
+
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::InputStream>* out)>
+      open_input_stream;
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::RandomAccessFile>* out)>
+      open_input_file;
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::OutputStream>* out)>
+      open_output_stream;
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::OutputStream>* out)>
+      open_append_stream;
+};
+
+class ARROW_PYTHON_EXPORT PyFileSystem : public arrow::fs::FileSystem {

Review comment:
       Does this need some doc comments? (I don't know how "public" this is, in the end it is only to be used in the python bindings I think, and there the PyFileSystem class has docstrings for python users, so that might be sufficient)

##########
File path: cpp/src/arrow/python/filesystem.h
##########
@@ -0,0 +1,115 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "arrow/filesystem/filesystem.h"
+#include "arrow/python/common.h"
+#include "arrow/python/visibility.h"
+#include "arrow/util/macros.h"
+
+namespace arrow {
+namespace py {
+namespace fs {
+
+class ARROW_PYTHON_EXPORT PyFileSystemVtable {
+ public:
+  std::function<void(PyObject*, std::string* out)> get_type_name;
+  std::function<bool(PyObject*, const arrow::fs::FileSystem& other)> equals;
+
+  std::function<void(PyObject*, const std::string& path, arrow::fs::FileInfo* out)>
+      get_file_info;
+  std::function<void(PyObject*, const std::vector<std::string>& paths,
+                     std::vector<arrow::fs::FileInfo>* out)>
+      get_file_info_vector;
+  std::function<void(PyObject*, const arrow::fs::FileSelector&,
+                     std::vector<arrow::fs::FileInfo>* out)>
+      get_file_info_selector;
+
+  std::function<void(PyObject*, const std::string& path, bool)> create_dir;
+  std::function<void(PyObject*, const std::string& path)> delete_dir;
+  std::function<void(PyObject*, const std::string& path)> delete_dir_contents;
+  std::function<void(PyObject*, const std::string& path)> delete_file;
+  std::function<void(PyObject*, const std::string& src, const std::string& dest)> move;
+  std::function<void(PyObject*, const std::string& src, const std::string& dest)>
+      copy_file;
+
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::InputStream>* out)>
+      open_input_stream;
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::RandomAccessFile>* out)>
+      open_input_file;
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::OutputStream>* out)>
+      open_output_stream;
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::OutputStream>* out)>
+      open_append_stream;
+};
+
+class ARROW_PYTHON_EXPORT PyFileSystem : public arrow::fs::FileSystem {

Review comment:
       Maybe just one-liner indicating this is only used to implement pyarrow.fs.PyFileSystem could be added then, but not that important




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7349: ARROW-8766: [Python] Allow implementing filesystems in Python

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7349:
URL: https://github.com/apache/arrow/pull/7349#issuecomment-641187317






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #7349: ARROW-8766: [Python] Allow implementing filesystems in Python

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #7349:
URL: https://github.com/apache/arrow/pull/7349


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7349: ARROW-8766: [Python] Allow implementing filesystems in Python

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7349:
URL: https://github.com/apache/arrow/pull/7349#issuecomment-641322746


   > Recognizing append mode as a "write" mode seems to work:
   
   You'll have to check that writing indeed appends at the end rather than e.g. truncating.
   
   More generally, while I originally added the append-open method, I'm not sure it will ever be useful, so in the meantime if an implementation wants to raise `NotImplementedError`, it's fine to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7349: ARROW-8766: [Python] Allow implementing filesystems in Python

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7349:
URL: https://github.com/apache/arrow/pull/7349#discussion_r437451340



##########
File path: cpp/src/arrow/python/filesystem.h
##########
@@ -0,0 +1,115 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "arrow/filesystem/filesystem.h"
+#include "arrow/python/common.h"
+#include "arrow/python/visibility.h"
+#include "arrow/util/macros.h"
+
+namespace arrow {
+namespace py {
+namespace fs {
+
+class ARROW_PYTHON_EXPORT PyFileSystemVtable {
+ public:
+  std::function<void(PyObject*, std::string* out)> get_type_name;
+  std::function<bool(PyObject*, const arrow::fs::FileSystem& other)> equals;
+
+  std::function<void(PyObject*, const std::string& path, arrow::fs::FileInfo* out)>
+      get_file_info;
+  std::function<void(PyObject*, const std::vector<std::string>& paths,
+                     std::vector<arrow::fs::FileInfo>* out)>
+      get_file_info_vector;
+  std::function<void(PyObject*, const arrow::fs::FileSelector&,
+                     std::vector<arrow::fs::FileInfo>* out)>
+      get_file_info_selector;
+
+  std::function<void(PyObject*, const std::string& path, bool)> create_dir;
+  std::function<void(PyObject*, const std::string& path)> delete_dir;
+  std::function<void(PyObject*, const std::string& path)> delete_dir_contents;
+  std::function<void(PyObject*, const std::string& path)> delete_file;
+  std::function<void(PyObject*, const std::string& src, const std::string& dest)> move;
+  std::function<void(PyObject*, const std::string& src, const std::string& dest)>
+      copy_file;
+
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::InputStream>* out)>
+      open_input_stream;
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::RandomAccessFile>* out)>
+      open_input_file;
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::OutputStream>* out)>
+      open_output_stream;
+  std::function<void(PyObject*, const std::string& path,
+                     std::shared_ptr<io::OutputStream>* out)>
+      open_append_stream;
+};
+
+class ARROW_PYTHON_EXPORT PyFileSystem : public arrow::fs::FileSystem {

Review comment:
       It's not public as a C++ API, indeed, but if there's some stuff there that needs explaining, feel free to point it out and I'll add some comments.

##########
File path: python/pyarrow/_fs.pyx
##########
@@ -685,3 +756,234 @@ cdef class _MockFileSystem(FileSystem):
     cdef init(self, const shared_ptr[CFileSystem]& wrapped):
         FileSystem.init(self, wrapped)
         self.mockfs = <CMockFileSystem*> wrapped.get()
+
+
+cdef class PyFileSystem(FileSystem):
+    """
+    A FileSystem with behavior implemented in Python.
+
+    A PyFileSystem is backed by a FileSystemHandler instance.
+    """
+
+    def __init__(self, handler):
+        cdef:
+            CPyFileSystemVtable vtable
+            shared_ptr[CPyFileSystem] wrapped
+
+        if not isinstance(handler, FileSystemHandler):
+            raise TypeError("Expected a FileSystemHandler instance, got {0}"
+                            .format(type(handler)))
+
+        vtable.get_type_name = _cb_get_type_name
+        vtable.equals = _cb_equals
+        vtable.get_file_info = _cb_get_file_info
+        vtable.get_file_info_vector = _cb_get_file_info_vector
+        vtable.get_file_info_selector = _cb_get_file_info_selector
+        vtable.create_dir = _cb_create_dir
+        vtable.delete_dir = _cb_delete_dir
+        vtable.delete_dir_contents = _cb_delete_dir_contents
+        vtable.delete_file = _cb_delete_file
+        vtable.move = _cb_move
+        vtable.copy_file = _cb_copy_file
+        vtable.open_input_stream = _cb_open_input_stream
+        vtable.open_input_file = _cb_open_input_file
+        vtable.open_output_stream = _cb_open_output_stream
+        vtable.open_append_stream = _cb_open_append_stream
+
+        wrapped = CPyFileSystem.Make(handler, move(vtable))
+        self.init(<shared_ptr[CFileSystem]> wrapped)
+
+    cdef init(self, const shared_ptr[CFileSystem]& wrapped):
+        FileSystem.init(self, wrapped)
+        self.pyfs = <CPyFileSystem*> wrapped.get()
+
+    @property
+    def handler(self):
+        """
+        The filesystem's underlying handler.
+
+        Returns
+        -------
+        handler : FileSystemHandler
+        """
+        return <object> self.pyfs.handler()
+
+    def __reduce__(self):
+        return PyFileSystem, (self.handler,)
+
+
+class FileSystemHandler(ABC):
+    """
+    An abstract class exposing methods to implement PyFileSystem's behavior.
+    """
+
+    @abstractmethod
+    def get_type_name(self):
+        """
+        Implement PyFileSystem.type_name.
+        """
+
+    @abstractmethod
+    def get_file_info(self, paths):
+        """
+        Implement PyFileSystem.get_file_info(paths).
+        """
+
+    @abstractmethod
+    def get_file_info_selector(self, selector):
+        """
+        Implement PyFileSystem.get_file_info(selector).
+        """
+
+    @abstractmethod
+    def create_dir(self, path, recursive):
+        """
+        Implement PyFileSystem.create_dir(...).
+        """
+
+    @abstractmethod
+    def delete_dir(self, path):
+        """
+        Implement PyFileSystem.delete_dir(...).
+        """
+
+    @abstractmethod
+    def delete_dir_contents(self, path):
+        """
+        Implement PyFileSystem.delete_dir_contents(...).
+        """
+
+    @abstractmethod
+    def delete_file(self, path):
+        """
+        Implement PyFileSystem.delete_file(...).
+        """
+
+    @abstractmethod
+    def move(self, src, dest):
+        """
+        Implement PyFileSystem.move(...).

Review comment:
       Right.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #7349: ARROW-8766: [Python] Allow implementing filesystems in Python

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #7349:
URL: https://github.com/apache/arrow/pull/7349#issuecomment-639062788


   A first test case is working, using fsspec's in-memory filesystem:
   
   ```
   In [1]: import pyarrow.parquet as pq
      ...: import pyarrow.dataset as ds
      ...:
      ...: from pyarrow.fs import PyFileSystem
      ...: from pyarrow.tests.test_fs import FSSpecHandler
   
   In [2]: import fsspec
   
   In [3]: memfs = fsspec.filesystem("memory")
   
   In [4]: table = pa.table({'a': [1, 2, 3]})
   
   In [5]: with memfs.open("test", "wb") as f:
      ...:     pq.write_table(table, f)
      ...:
   
   In [6]: dataset = ds.dataset("test", filesystem=PyFileSystem(FSSpecHandler(memfs)))
   
   In [7]: dataset.to_table(filter=ds.field('a') > 1).to_pandas()
   Out[7]:
      a
   0  2
   1  3
   ```
   
   I only have a bit trouble finding a robust way to get a file handle for an open file from the fsspec filesystem, but so that is not related to this PR ;) For the rest it was quite easy to get this working!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org