You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/03/28 08:39:50 UTC

[GitHub] [arrow] jorisvandenbossche opened a new issue, #34755: [Python] How to handle chunked arrays output in pyarrow.array(...)

jorisvandenbossche opened a new issue, #34755:
URL: https://github.com/apache/arrow/issues/34755

   From https://github.com/apache/arrow/pull/34289#pullrequestreview-1355094099
   
   Currently, the `pyarrow.array(..)` constructor is meant to create Array object, but can return a ChunkedArray instead in two cases: 1) the object is too big to fit into a single array (eg offset gets too large for single StringArray), and 2) the object has a `__arrow_array__` that returns a ChunkedArray.
   
   However, if this starts to happen more and more, that can be annoying for places in our code where we assume `pyarrow.array(..)` gives us an Array, see for example https://github.com/apache/arrow/issues/33727#issuecomment-1387323624. For this specific case, we updated `pyarrow.array(..)` to special case chunked arrays with only 1 chunk to unpack this into a normal Array, since that's an easy zero-copy conversion (done in https://github.com/apache/arrow/pull/34289). 
   
   Longer term, what do we want to do with `pyarrow.array(..)` returning chunked arrays? 
   
   For example, passing a pandas.Series to `pyarrow.array(..)` can easily give a ChunkedArray:
   
   ```python
   >>> arr = pa.chunked_array([[1, 2], [3, 4]])
   >>> ser = pd.Series(arr, dtype=pd.ArrowDtype(arr.type))
   >>> ser
   0    1
   1    2
   2    3
   3    4
   dtype: int64[pyarrow]
   >>> pa.array(ser)
   <pyarrow.lib.ChunkedArray object at 0x7effe3f9ea90>
   [
     [
       1,
       2
     ],
     [
       3,
       4
     ]
   ]
   ```
   
   Some thoughts:
   
   - To ensure you can rely more on `pa.array(..)` to actually return an Array, we could concat chunked arrays in the example above, and then users could use `pa.asarray(..)` to get either Array or ChunkedArray
   - Keep `pa.array(..)` as flexible giving either Array/ChunkedArray, but add other function that is more strict, or a helper that ensures we always have an Array and concats chunks if necesssary, which could then be used internally where needed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34755: [Python] How to handle chunked arrays output in pyarrow.array(...)

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34755:
URL: https://github.com/apache/arrow/issues/34755#issuecomment-1492612227

   What about adding a `chunk_if_needed` param to `pa.array`?  It could default to `True` and the existence of the parameter should inform the user of the possibilities.
   
   > here are 36 possible string array data types which are represented as 5 possible classes (ChunkedArray, StringArray, LargeStringArray, RunEndArray, DictionaryArray)
   
   Note that `StringArray`, `LargeStringArray`, `RunEndArray`, and `DictionaryArray` are all subclasses of `Array`.  So I still think `ChunkedArray` is a unique problem.
   
   In my fantasy world chunked arrays don't exist and tables are just lists of record batches :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #34755: [Python] How to handle chunked arrays output in pyarrow.array(...)

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #34755:
URL: https://github.com/apache/arrow/issues/34755#issuecomment-1489183918

   Maybe this is a tangent, but I think this question gets at how complex we want arrays to be. I sometimes wish whether an array is chunked or not were an implementation detail, rather than a top-level type. This is especially when considered in combination with other array differences. A good example of this is string arrays: between chunked and contiguous, indices size, and encodings, there are 36 possible string array data types which are represented as 5 possible classes (ChunkedArray, StringArray, LargeStringArray, RunEndArray, DictionaryArray).
   
   <details>
   <summary>All 36 string arrays in PyArrow</summary>
   
   ```python
   strings = ["hello", "world"]
   # Can have i32 or i64 indices:
   pa.array(strings, pa.utf8())
   pa.array(strings, pa.large_utf8())
   # Can also be chunked
   pa.chunked_array(strings, pa.utf8())
   pa.chunked_array(strings, pa.large_utf8())
   # Can be dictionary encoded (with different indices width)
   pa.array(strings, pa.dictionary(pa.int32(), pa.utf8()))
   pa.array(strings, pa.dictionary(pa.int8(), pa.large_utf8()))
   # Can be run-end encoded
   pa.array(strings, pa.ree(pa.utf()))
   # Can be any combination of the above
   pa.chunked_array(strings, pa.ree(pa.dictionary(pa.int8(), pa.utf8())))
   ```
   
   ```python
   num_possible_indices = 2 # i32 or i64
   num_possible_chunking = 2 # contiguous or chunked
   num_possible_encodings = 3 # dictionary, ree, or ree + dictionary
   num_possible_dictionary_index = 4
   
   2 * 2 * ((2 * 4) + 1) = 36 possible string arrays in arrow
   ```
   
   These can be one of the following Python classes:
   ```
   ChunkedArray
   StringArray
   LargeStringArray
   RunEndArray
   DictionaryArray
   ```
   </details>
   
   I'd be in favor of keeping `pa.array()` returning either Array/ChunkedArray, since it's a high level function and I think I'd rather our higher-level APIs not care as much about the buffer layout.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [Python] How to handle chunked arrays output in pyarrow.array(...) [arrow]

Posted by "JoeEdwardsGitHub (via GitHub)" <gi...@apache.org>.
JoeEdwardsGitHub commented on issue #34755:
URL: https://github.com/apache/arrow/issues/34755#issuecomment-1991933357

   Cross referencing an external Apache Spark Pandas UDF bug which refers to this issue: https://issues.apache.org/jira/browse/SPARK-46776


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org