You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Tao He (Jira)" <ji...@apache.org> on 2021/02/02 03:06:00 UTC
[jira] [Commented] (ARROW-11463) Allow configuration of
IpcWriterOptions 64Bit from PyArrow
[ https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276819#comment-17276819 ]
Tao He commented on ARROW-11463:
--------------------------------
The `pa.ipc.new_stream` accepts an option: [https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream|https://arrow.apache.org/docs/python/generated/pyarrow.ipc.new_stream.html#pyarrow.ipc.new_stream,]
I think we just need to expose the `allow_64bit` field to python.
I could make a PR for that.
> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> ----------------------------------------------------------
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
> Issue Type: Task
> Components: Python
> Reporter: Leonard Lausen
> Priority: Major
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take` will be around 1000x slower compared to the `pyarrow.Table.take` on the table with combined chunks (1 chunk). Unfortunately, if such table contains large list data type, it's easy for the flattened table to contain more than 2**31 rows and serialization of the table with combined chunks (eg for Plasma store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions 64 bit setting; further the Python serialization APIs do not allow specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
> /// \brief If true, allow field lengths that don't fit in a signed 32-bit int.
> ///
> /// Some implementations may not be able to parse streams created with this option.
> - bool allow_64bit = false;
> + bool allow_64bit = true;
>
> /// \brief The maximum permitted schema nesting depth.
> int max_recursion_depth = kMaxNestingDepth;
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)