You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/05 13:48:25 UTC

[GitHub] [arrow] raulcd opened a new pull request, #12800: ARROW-15776: [Python] Expose IpcReadOptions

raulcd opened a new pull request, #12800:
URL: https://github.com/apache/arrow/pull/12800

   This PR intends to expose IpcReadOptions to pyarrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12800:
URL: https://github.com/apache/arrow/pull/12800#issuecomment-1110294487

   ['Python', 'R'] benchmarks have high level of regressions.
   [test-mac-arm](https://conbench.ursa.dev/compare/runs/e326cf0f75b24af09b7443f99755009d...8cbc8426a14f40df83d237a9f78cdd9e/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r845202997


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()

Review Comment:
   Naively in Python, I would interpret an empty list as not reading any columns at all



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12800:
URL: https://github.com/apache/arrow/pull/12800#issuecomment-1088727011

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
raulcd commented on PR #12800:
URL: https://github.com/apache/arrow/pull/12800#issuecomment-1090342901

   @jorisvandenbossche As discussed I think this can start review process. I was going to assign you as reviewer but I don't have permission.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
raulcd commented on PR #12800:
URL: https://github.com/apache/arrow/pull/12800#issuecomment-1092871718

   > Actually, we should also add it to the docs here:
   > 
   > https://github.com/apache/arrow/blob/b8299436c8b1a2d7cd3d6e019a2b750893a3af87/docs/source/python/api/ipc.rst?plain=1#L41
   
   Done and validated documentation generated:
   ![image](https://user-images.githubusercontent.com/639755/162447260-307db4bc-0b5e-41ee-ac5a-9a8c7e729a63.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
raulcd commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r845262179


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()

Review Comment:
   I don't have a bunch of experience with cython but from my tests vector doesn't seem to allow nullable:
   ```
   pyarrow/ipc.pxi:157: in pyarrow.lib.IpcReadOptions.included_fields.__set__
       self.c_options.included_fields = None
   stringsource:47: in vector.from_py.__pyx_convert_vector_from_py_int
       ???
   E   TypeError: 'NoneType' object is not iterable
   ```
   I am not sure I understand what would be the proposed change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
raulcd commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r846138841


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,65 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        if included_fields is not None:
+            self.included_fields = included_fields
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        cdef:
+            vector[int] included_fields
+        if value is None:
+            self.c_options.included_fields = included_fields

Review Comment:
   you are correct, thanks I missed to push it, pushed now here: 189c8a8f2c6584a98be96a9b6a72bee61506fb37



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
raulcd commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r845925279


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()

Review Comment:
   Thanks! I have pushed d813029bdf2f4a424e1e267c564acd246cc9aa9b which maintains the default C++ value for `vector[int]` in the case of None.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r845179940


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()

Review Comment:
   Hmm vector here isn't nullable so they'd be effectively the same? (We should probably accept both None and empty list though.)



##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()
+
+    @property
+    def memory_pool(self):
+        cdef:
+            MemoryPool pool = MemoryPool.__new__(MemoryPool)
+        pool.init(self.c_options.memory_pool)
+        return pool

Review Comment:
   It certainly seems like for consistency we should keep memory_pool separate, yeah.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
raulcd commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r846006032


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()
+
+    @property
+    def memory_pool(self):
+        cdef:
+            MemoryPool pool = MemoryPool.__new__(MemoryPool)
+        pool.init(self.c_options.memory_pool)
+        return pool

Review Comment:
   As discussed I have removed `MemoryPool` from `IpcReadOptions` and have exposed directly in `RecordBatchStreamReader/RecordBatchFileReader/open_stream/open_file`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r852846119


##########
python/pyarrow/ipc.py:
##########
@@ -89,10 +95,19 @@ class RecordBatchFileReader(lib._RecordBatchFileReader):
     footer_offset : int, default None
         If the file is embedded in some larger file, this is the byte offset to
         the very end of the file data
+    options : pyarrow.ipc.IpcReadOptions
+        Options for IPC serialization.
+

Review Comment:
   Nit: remove this empty line?



##########
python/pyarrow/ipc.py:
##########
@@ -39,10 +39,16 @@ class RecordBatchStreamReader(lib._RecordBatchStreamReader):
     ----------
     source : bytes/buffer-like, pyarrow.NativeFile, or file-like Python object
         Either an in-memory buffer, or a readable file object.
+    options : pyarrow.ipc.IpcReadOptions
+        Options for IPC deserialization.
+        If None, default values will be used.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified.

Review Comment:
   Let's use a consistent style, so:
   ```suggestion
           If None, default memory pool is used.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
raulcd commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r849317078


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,63 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        if included_fields is not None:
+            self.included_fields = included_fields
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        if value is None:

Review Comment:
   Thanks! Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #12800:
URL: https://github.com/apache/arrow/pull/12800#issuecomment-1092829961

   Actually, we should also add it to the docs here: https://github.com/apache/arrow/blob/b8299436c8b1a2d7cd3d6e019a2b750893a3af87/docs/source/python/api/ipc.rst?plain=1#L41


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r845379953


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()

Review Comment:
   vector is not nullable so you can never get None, just an empty list. I think that's fine, but it's just nice to be able to set it to None and do the right thing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r846080309


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,65 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        if included_fields is not None:
+            self.included_fields = included_fields
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        cdef:
+            vector[int] included_fields
+        if value is None:
+            self.c_options.included_fields = included_fields

Review Comment:
   You _should_ be able to just `self.c_options.included_fields.clear()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r848428439


##########
python/pyarrow/ipc.py:
##########
@@ -39,10 +39,17 @@ class RecordBatchStreamReader(lib._RecordBatchStreamReader):
     ----------
     source : bytes/buffer-like, pyarrow.NativeFile, or file-like Python object
         Either an in-memory buffer, or a readable file object.
+    options : pyarrow.ipc.IpcReadOptions
+        Options for IPC serialization.
+
+        If None, default values will be used.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified.
     """
 
-    def __init__(self, source):
-        self._open(source)
+    def __init__(self, source, *, options=None, memory_pool=None):

Review Comment:
   Why not put the `memory_pool` parameter in `IpcReadOptions` as in C++?



##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,63 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        if included_fields is not None:
+            self.included_fields = included_fields
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        if value is None:

Review Comment:
   I'm not sure allowing None is useful. You can instead write the method argument as `list value not None`.



##########
python/pyarrow/tests/test_ipc.py:
##########
@@ -487,6 +538,51 @@ def test_stream_options_roundtrip(stream_fixture, options):
         reader.read_next_batch()
 
 
+def test_read_options():
+    options = pa.ipc.IpcReadOptions()
+    assert options.use_threads is True
+    assert options.ensure_native_endian is True
+    assert options.included_fields == []
+
+    options.ensure_native_endian = False
+    assert options.ensure_native_endian is False
+
+    options.use_threads = False
+    assert options.use_threads is False
+
+    options.included_fields = [0, 1]
+    assert options.included_fields == [0, 1]
+    options.included_fields = None
+    assert options.included_fields == []
+
+    options = pa.ipc.IpcReadOptions(
+        use_threads=False, ensure_native_endian=False,
+        included_fields=[1]
+    )
+    assert options.use_threads is False
+    assert options.ensure_native_endian is False
+    assert options.included_fields == [1]
+
+
+def test_read_options_included_fields(stream_fixture):
+    options1 = pa.ipc.IpcReadOptions()
+    options2 = pa.ipc.IpcReadOptions(included_fields=[1])
+    stream_fixture.write_batches()
+    source = stream_fixture.get_source()
+
+    reader1 = pa.ipc.open_stream(source, options=options1)
+    reader2 = pa.ipc.open_stream(
+        source, options=options2, memory_pool=pa.system_memory_pool())
+
+    result1 = reader1.read_all()
+    result2 = reader2.read_all()
+
+    assert result1.num_columns == 2
+    assert result2.num_columns == 1
+
+    assert result2[0] == result1[1]

Review Comment:
   Can we compare with expected results instead?



##########
python/pyarrow/ipc.py:
##########
@@ -39,10 +39,17 @@ class RecordBatchStreamReader(lib._RecordBatchStreamReader):
     ----------
     source : bytes/buffer-like, pyarrow.NativeFile, or file-like Python object
         Either an in-memory buffer, or a readable file object.
+    options : pyarrow.ipc.IpcReadOptions
+        Options for IPC serialization.
+
+        If None, default values will be used.

Review Comment:
   No need for the empty line IMHO.
   ```suggestion
           Options for IPC deserialization.
           If None, default values will be used.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r845140200


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()

Review Comment:
   We could also use None to signal all columns, instead of an empty list? (like `compression` in the IpcWriteOptions uses None to indicate the default which is a nullptr on the C++ side. Or the default vector is a empty one and not a nullptr?)



##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()
+
+    @property
+    def memory_pool(self):
+        cdef:
+            MemoryPool pool = MemoryPool.__new__(MemoryPool)
+        pool.init(self.c_options.memory_pool)
+        return pool

Review Comment:
   Something @raulcd and I were discussing: it is a bit strange to be creating a MemoryPool object here just for the property on the options class. 
   But, more in general, almost all other APIs in pyarrow expose a direct `memory_pool` keyword, and not through an options class (eg also the CSV readers have a `memory_pool` keyword _next_ to the different option classes). So it would also be an option here to expose `memory_pool` directly in RecordBatchStraem/FileReader/open_stream/open_file, and not through the IpcReadOptions class?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r845204125


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()

Review Comment:
   Right. Actually, I think that's a limitation of our current APIs: it's not possible to specify "read no columns"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r845266094


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()

Review Comment:
   Yes, if we want to use None this way, I think you need to do it like `compression` in the write options, and so do in the init something like:
   
   ```
   if included_fields is not None:
       self.included_fields = included_fields
   ```
   
   (so it uses the C++ default if None was specified)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12800:
URL: https://github.com/apache/arrow/pull/12800#issuecomment-1103607682

   Benchmark runs are scheduled for baseline = 6c10a389bbc35b67187930dc0db2a88671e76c2d and contender = 2a2c0873b3ae62d2d08225cc88e53ae004864a7f. 2a2c0873b3ae62d2d08225cc88e53ae004864a7f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/2c575cfb84fc42abb4abf7d4196d1151...91bc0ed6069e4b44aa81bd231785752c/)
   [Finished :arrow_down:0.29% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/e326cf0f75b24af09b7443f99755009d...8cbc8426a14f40df83d237a9f78cdd9e/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/fe181a4811e845788fefcf1712449032...206f3fff53774837bb9961d764653636/)
   [Finished :arrow_down:0.26% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/69f4a593f6e84e25ab93365d3ab1fa15...723f87fbdf324bc1a2c0ba46e8a4b3e6/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/539| `2a2c0873` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/526| `2a2c0873` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/525| `2a2c0873` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/536| `2a2c0873` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/538| `6c10a389` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/525| `6c10a389` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/524| `6c10a389` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/535| `6c10a389` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r846123286


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,65 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        if included_fields is not None:
+            self.included_fields = included_fields
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        cdef:
+            vector[int] included_fields
+        if value is None:
+            self.c_options.included_fields = included_fields

Review Comment:
   I don't see a change here, is there a missing commit?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r845266094


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()

Review Comment:
   Yes, if we want to use None this way, I think you need to do it like `compression` in the write options, and so do in the init something like:
   
   ```
   if included_fields is not None:
       self.included_fields = included_fields
   ```
   
   (so it uses the C++ default if None was specified)
   
   Doing it this way might keep the option open for the future to have `[]` mean to "read no columns"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r849312668


##########
python/pyarrow/ipc.py:
##########
@@ -39,10 +39,17 @@ class RecordBatchStreamReader(lib._RecordBatchStreamReader):
     ----------
     source : bytes/buffer-like, pyarrow.NativeFile, or file-like Python object
         Either an in-memory buffer, or a readable file object.
+    options : pyarrow.ipc.IpcReadOptions
+        Options for IPC serialization.
+
+        If None, default values will be used.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified.
     """
 
-    def __init__(self, source):
-        self._open(source)
+    def __init__(self, source, *, options=None, memory_pool=None):

Review Comment:
   Ok, that's a reasonable argument.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
raulcd commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r848530602


##########
python/pyarrow/ipc.py:
##########
@@ -39,10 +39,17 @@ class RecordBatchStreamReader(lib._RecordBatchStreamReader):
     ----------
     source : bytes/buffer-like, pyarrow.NativeFile, or file-like Python object
         Either an in-memory buffer, or a readable file object.
+    options : pyarrow.ipc.IpcReadOptions
+        Options for IPC serialization.
+
+        If None, default values will be used.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified.
     """
 
-    def __init__(self, source):
-        self._open(source)
+    def __init__(self, source, *, options=None, memory_pool=None):

Review Comment:
   This was the original implementation but based on the review comments from @jorisvandenbossche and @lidavidm in order to be consistent with other APIs in pyarrow we decided to expose it directly, the original comment is the following one:
   > But, more in general, almost all other APIs in pyarrow expose a direct memory_pool keyword, and not through an options class (eg also the CSV readers have a memory_pool keyword next to the different option classes). So it would also be an option here to expose memory_pool directly in RecordBatchStraem/FileReader/open_stream/open_file, and not through the IpcReadOptions class?
   
   and the thread were we discussed it:
   https://github.com/apache/arrow/pull/12800#discussion_r845144860



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12800:
URL: https://github.com/apache/arrow/pull/12800#issuecomment-1088726972

   https://issues.apache.org/jira/browse/ARROW-15776


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
raulcd commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r846121555


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,65 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        if included_fields is not None:
+            self.included_fields = included_fields
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        cdef:
+            vector[int] included_fields
+        if value is None:
+            self.c_options.included_fields = included_fields

Review Comment:
   Done, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
raulcd commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r845287402


##########
python/pyarrow/ipc.pxi:
##########
@@ -100,6 +100,74 @@ cdef _wrap_read_stats(CIpcReadStats c):
                      c.num_replaced_dictionaries)
 
 
+cdef class IpcReadOptions(_Weakrefable):
+    """
+    Serialization options for reading IPC format.
+
+    Parameters
+    ----------
+    use_threads : bool
+        Whether to use the global CPU thread pool to parallelize any
+        computational tasks like decompression.
+    ensure_native_endian : bool
+        Whether to convert incoming data to platform-native endianness.
+        Default is true.
+    included_fields : list
+        If empty (the default), return all deserialized fields.
+        If non-empty, the values are the indices of fields to read on
+        the top-level schema.
+    memory_pool : MemoryPool, default None
+        Uses default memory pool if not specified
+    """
+    __slots__ = ()
+
+    # cdef block is in lib.pxd
+
+    def __init__(self, *, bint ensure_native_endian=True,
+                 bint use_threads=True, list included_fields=None,
+                 MemoryPool memory_pool=None):
+        self.c_options = CIpcReadOptions.Defaults()
+        self.ensure_native_endian = ensure_native_endian
+        self.use_threads = use_threads
+        self.included_fields = included_fields
+        self.memory_pool = memory_pool
+
+    @property
+    def ensure_native_endian(self):
+        return self.c_options.ensure_native_endian
+
+    @ensure_native_endian.setter
+    def ensure_native_endian(self, bint value):
+        self.c_options.ensure_native_endian = value
+
+    @property
+    def use_threads(self):
+        return self.c_options.use_threads
+
+    @use_threads.setter
+    def use_threads(self, bint value):
+        self.c_options.use_threads = value
+
+    @property
+    def included_fields(self):
+        return self.c_options.included_fields
+
+    @included_fields.setter
+    def included_fields(self, list value):
+        self.c_options.included_fields = value or list()

Review Comment:
   my test above was incorrect but my question is the same, is it possible to set `included_fields` to null with the current API implementation?
   ```python
       def included_fields(self, list value):
           cdef:
               vector[int] included_fields
           if value is None:
               self.c_options.included_fields = included_fields
           else:
               self.c_options.included_fields = value
   ```
   As this still returns an empty list in the case of `None`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou closed pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions
URL: https://github.com/apache/arrow/pull/12800


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on a diff in pull request #12800: ARROW-15776: [Python] Expose IpcReadOptions

Posted by GitBox <gi...@apache.org>.
raulcd commented on code in PR #12800:
URL: https://github.com/apache/arrow/pull/12800#discussion_r852880184


##########
python/pyarrow/ipc.py:
##########
@@ -89,10 +95,19 @@ class RecordBatchFileReader(lib._RecordBatchFileReader):
     footer_offset : int, default None
         If the file is embedded in some larger file, this is the byte offset to
         the very end of the file data
+    options : pyarrow.ipc.IpcReadOptions
+        Options for IPC serialization.
+

Review Comment:
   Removed not required empty line and updated memory_pool comment in all occurrences as requested:
   38bdde3d4dd1cba26846a88f467f1af0416bf1a8



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org