You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/24 11:30:44 UTC

[GitHub] [arrow] maartenbreddels opened a new pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

maartenbreddels opened a new pull request #8755:
URL: https://github.com/apache/arrow/pull/8755


   Related https://github.com/vaexio/vaex/pull/1075
   With this change, I'm seeing a 5x speedup and still allow for 'classical' .read(). I'm not so sure why the 5x speedup happens, I expected less of a difference. Maybe memoryview.tobytes() doesn't release the GIL.
   
   In any case, this makes it possible to create a Python-Land-based file object with great performance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-732902645


   https://issues.apache.org/jira/browse/ARROW-10709


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733078145


   Ok, I think that would solve the issue, because `Read(int64_t nbytes, void* out)` would then *not* call `.read` and hence not see the memoryview.
   
   I'm ok with that solution, but such a file object would not be compatible with e.g. Pandas. Also, as you mentioned, returning a memoryview from `.read` is not really standard. Do you think we should introduce another method (read_buffer) to return a memoryview/buffer instead?
   
   (Btw, to make this concrete, I have a memory mapped file which I expose using the File, which I do not want to copy at all)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733081662


   I'm really confused. Why would you be returning a `memoryview` from `read` again?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733088615


   Happy to hop on a call tomorrow if that helps.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733019321


   No, but if I call PythonFile.reader_buffer() from Python, it will call `Result<std::shared_ptr<Buffer>> Read(int64_t nbytes)`, which will call PyReadableFile.read(), and will do a zero memory copy if it returns a memoryview/buffer, all happy.
   
   But if I now pass this file object to say the parquet reader, it will (via a path unknown to me) call PyReadableFile.read() as well, but via `Result<int64_t> Read(int64_t nbytes, void* out)` which will fail because it will only accept a bytes object as return value of PyReadableFile.read.
   
   So, this PR enables the use of returning a memoryview from PyReadableFile.read, and that will work within all of `pyarrow`. This file object will however, not work with pandas (who also expects a bytes object AFAICT).
   
   So the other solution would be to call `PyReadableFile.read_buffer`, when implemented from `Result<std::shared_ptr<Buffer>> Read(int64_t nbytes)`.
   
   Does that make sense (took me a while to wrap my head about it).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou closed pull request #8755: ARROW-10709: [C++][Python] Allow PyReadableFile::Read() to call pyobj.read_buffer()

Posted by GitBox <gi...@apache.org>.

pitrou closed pull request #8755:
URL: https://github.com/apache/arrow/pull/8755


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733646443


   Yes, works beautifully. I can mmap from python, and return a memoryview from my `read_buffer` method and there are no memory copies at all, while at the same time returning a bytes object from my `read` method making it compatible with other libraries (such as pandas).
   
   I did leave in the change of accepting a buffer via `Result<int64_t> PyReadableFile::Read(int64_t nbytes, void* out)` since it's more consistent with the `Result<std::shared_ptr<Buffer>> PyReadableFile::Read(int64_t nbytes)` calling `file_->Read` and accepting a buffer.
   
   Summary:
    * In Python land `read` can return a buffer object, but to be compatible (with e.g. pandas), it should return a bytes object.
    * To support zero mem copy and have better compatibility, `read_buffer` should be implemented and return a buffer object, and `read` should return a bytes object.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#discussion_r532731697



##########
File path: python/pyarrow/tests/test_io.py
##########
@@ -163,6 +163,34 @@ def test_python_file_readinto():
         assert len(dst_buf) == length
 
 
+def test_python_file_read_buffer():
+    length = 10
+    data = b'0123456798'
+    dst_buf = bytearray(data)
+
+    class DuckReader:
+        def close(self):
+            pass
+
+        @property
+        def closed(self):
+            return False
+
+        def read_buffer(self, nbytes):
+            assert nbytes == length
+            return memoryview(dst_buf)[:nbytes]
+
+    duck_reader = DuckReader()
+    with pa.PythonFile(duck_reader, mode='r') as f:
+        buf = f.read_buffer(length)

Review comment:
       This also needs to check that the type returned by `read(...)` is no longer bytes?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733058548


   `Read(int64_t nbytes, void* out)` would call `readinto`.
   But `Read(int64_t nbytes)` can keep calling `read`, since it can then wrap the returned Python bytes object in a C++ buffer.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#discussion_r533193607



##########
File path: cpp/src/arrow/python/io.cc
##########
@@ -199,25 +219,32 @@ Result<int64_t> PyReadableFile::Read(int64_t nbytes, void* out) {
     PyObject* bytes_obj = bytes.obj();
     DCHECK(bytes_obj != NULL);
 
-    if (!PyBytes_Check(bytes_obj)) {
+    Py_buffer py_buf_;
+    if (!PyObject_GetBuffer(bytes_obj, &py_buf_, PyBUF_ANY_CONTIGUOUS)) {
+      const uint8_t* data = reinterpret_cast<const uint8_t*>(py_buf_.buf);
+      ARROW_CHECK_NE(data, nullptr) << "Null pointer in Py_buffer";

Review comment:
       Hmm, it was a debug mode check originally. I think it was converted to ARROW_CHECK by mistake.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-732952249


   Well, the contract for `.read()` is normally to return a bytes object. I guess we could still be liberal here and make an exception.
   
   Another possibility would be to call `.readinto()`. It would work with all file objects providing an optimized `.readinto()`, not only your own file object with a non-bytes-returning `.read()`.
   
   Also, I don't understand what this has to do with `read_buffer`...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733094016


   Hmm, so IIUC we could:
   * try to call `read_buffer` from `PyReadableFile.Read(int64_t)` (but stop if it fails once with a `AttributeError`, to avoid looking it up every time?)
   * call `readinto` from `PyReadableFile.Read(int64_t, void*)`
   
   Would that work?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#discussion_r530985982



##########
File path: cpp/src/arrow/python/io.cc
##########
@@ -199,25 +219,32 @@ Result<int64_t> PyReadableFile::Read(int64_t nbytes, void* out) {
     PyObject* bytes_obj = bytes.obj();
     DCHECK(bytes_obj != NULL);
 
-    if (!PyBytes_Check(bytes_obj)) {
+    Py_buffer py_buf_;
+    if (!PyObject_GetBuffer(bytes_obj, &py_buf_, PyBUF_ANY_CONTIGUOUS)) {
+      const uint8_t* data = reinterpret_cast<const uint8_t*>(py_buf_.buf);
+      ARROW_CHECK_NE(data, nullptr) << "Null pointer in Py_buffer";
+      std::memcpy(out, data, py_buf_.len);
+      return py_buf_.len;

Review comment:
       You must call `PyBuffer_Release`, otherwise you have a memory leak.

##########
File path: cpp/src/arrow/python/io.cc
##########
@@ -199,25 +219,32 @@ Result<int64_t> PyReadableFile::Read(int64_t nbytes, void* out) {
     PyObject* bytes_obj = bytes.obj();
     DCHECK(bytes_obj != NULL);
 
-    if (!PyBytes_Check(bytes_obj)) {
+    Py_buffer py_buf_;

Review comment:
       Nit, but underscore-ending names are for instance members.

##########
File path: cpp/src/arrow/python/io.cc
##########
@@ -199,25 +219,32 @@ Result<int64_t> PyReadableFile::Read(int64_t nbytes, void* out) {
     PyObject* bytes_obj = bytes.obj();
     DCHECK(bytes_obj != NULL);
 
-    if (!PyBytes_Check(bytes_obj)) {
+    Py_buffer py_buf_;
+    if (!PyObject_GetBuffer(bytes_obj, &py_buf_, PyBUF_ANY_CONTIGUOUS)) {
+      const uint8_t* data = reinterpret_cast<const uint8_t*>(py_buf_.buf);
+      ARROW_CHECK_NE(data, nullptr) << "Null pointer in Py_buffer";

Review comment:
       Is this check required? It should basically never happen.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733043581


   What I don't understand is that you talk about "zero copy" but your PR does a `memcpy` call.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733734030


   @pitrou this is ready for review


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733050262


   Yeah, agree, that is confusing. Basically, I need to fix that code path to accept a buffer (which does a memcpy), to allow me to use the other codepath (that passes the buffer). The problem is that the two codepaths call the same python method (read), but one of them accepts buffer and bytes, and the other (that i changed) only accepts bytes. If they both accept buffer and bytes, I can use the other codepaths.
   Do I make sense?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733088432


   Because I want to :)
   
   Well, I have a memory mapped file open in Python, and I want to wrap that in my own Python File-like object that does some checks before every read (it's a caching mechanism) and pass that File like object down to arrow (which performs badly now). 
   
   So I have a buffer/memoryview (the memory mapped file) that I want to expose to Arrow via a File API without a memory copy.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on a change in pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on a change in pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#discussion_r533159508



##########
File path: python/pyarrow/tests/test_io.py
##########
@@ -163,6 +163,34 @@ def test_python_file_readinto():
         assert len(dst_buf) == length
 
 
+def test_python_file_read_buffer():
+    length = 10
+    data = b'0123456798'
+    dst_buf = bytearray(data)
+
+    class DuckReader:
+        def close(self):
+            pass
+
+        @property
+        def closed(self):
+            return False
+
+        def read_buffer(self, nbytes):
+            assert nbytes == length
+            return memoryview(dst_buf)[:nbytes]
+
+    duck_reader = DuckReader()
+    with pa.PythonFile(duck_reader, mode='r') as f:
+        buf = f.read_buffer(length)

Review comment:
       I think both `read` and `read_buffer` should be able to return bytes (which exposes the buffer interface), and a buffer/memoryview, for backward compatibility.

##########
File path: cpp/src/arrow/python/io.cc
##########
@@ -199,25 +219,32 @@ Result<int64_t> PyReadableFile::Read(int64_t nbytes, void* out) {
     PyObject* bytes_obj = bytes.obj();
     DCHECK(bytes_obj != NULL);
 
-    if (!PyBytes_Check(bytes_obj)) {
+    Py_buffer py_buf_;
+    if (!PyObject_GetBuffer(bytes_obj, &py_buf_, PyBUF_ANY_CONTIGUOUS)) {
+      const uint8_t* data = reinterpret_cast<const uint8_t*>(py_buf_.buf);
+      ARROW_CHECK_NE(data, nullptr) << "Null pointer in Py_buffer";

Review comment:
       According to https://docs.python.org/3/c-api/buffer.html#c.PyObject_GetBuffer it seems that I cannot, I was following what you've done in https://github.com/apache/arrow/blob/070c4810d4d19b44672ed215a5f4580396d5b3dc/cpp/src/arrow/python/common.cc#L178 so maybe in practice you've noticed issues.
   I think it was originally added in https://github.com/apache/arrow/commit/5dce01f76f0569d820089682aef32d87c806b61d 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-732877780


   If this approach looks good, I can add a test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-736369145


   I think the failure is unrelated.
   
   I'm really excited about this (seemingly trivial) feature, this gives such a massive performance improvement (equal to memory mapping) while being able to do operations (like caching) on a file basis in Python land. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-732962016


   AFAIK read_buffer in Python land maps to https://github.com/apache/arrow/blob/a8c2d290a263548c6dc14b04c1735b3f17c363e1/cpp/src/arrow/python/io.h#L43 via https://github.com/apache/arrow/blob/a8c2d290a263548c6dc14b04c1735b3f17c363e1/python/pyarrow/includes/libarrow.pxd#L1129 (but I could be wrong).
   
   .readinto would still require a memcpy to from the the final destination. How about trying to call a `read_buffer` on the Python object?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maartenbreddels commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

maartenbreddels commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-733055953


   I now also understand your point about readinto I think. I would be even better if the read that I changed would call .readinto right? That would make that codepath more efficient.
   
   So, I guess the simplest option is to have `Result<int64_t> PyReadableFile::Read(int64_t nbytes, void* out)` and `Result<std::shared_ptr<Buffer>> PyReadableFile::Read(int64_t nbytes)` call `.readinto` right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-732981726


   Well, your PR doesn't change `Result<std::shared_ptr<Buffer>> Read(int64_t nbytes)`, does it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #8755: ARROW-10709: [Python] Allow PythonFile.read() to always return a buffer

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #8755:
URL: https://github.com/apache/arrow/pull/8755#issuecomment-736373876


   > I think the failure is unrelated.
   
   It is. We're dropping Python 3.5 support soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org