You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/05/31 14:18:50 UTC

[GitHub] [arrow] jorisvandenbossche commented on issue #35748: [Python] Implement efficient merging of chunked arrays

jorisvandenbossche commented on issue #35748:
URL: https://github.com/apache/arrow/issues/35748#issuecomment-1570330670

   We already have `pa.concat_arrays` that will give a single (non-chunked) array. However, that expects a list of Arrays, and doesn't work with ChunkedArrays. So to use it concatenate a list of chunked arrays into a single one, we need some more gymnastics to flatten the chunks, currently:
   
   ```python
   >>> merged = pa.concat_arrays([chunk for arr in [a1, a2] for chunk in arr.chunks])
   >>> merged
   <pyarrow.lib.Int64Array object at 0x7fe2e69710c0>
   [
     1,
     2,
     3,
     6,
     7,
     4,
     7,
     8
   ]
   ```
   
   We should maybe update `pa.concat_arrays` to also accept ChunkedArrays.
   
   Of course, that doesn't make them unique. You can then get the unique values of the merged array:
   
   ```python
   >>> merged.unique()
   <pyarrow.lib.Int64Array object at 0x7fe2e6bd12a0>
   [
     1,
     2,
     3,
     6,
     7,
     4,
     8
   ]
   ```
   
   But for larger arrays, it might be more efficient to first get the uniques before actually concatenating, since we can also calculate the uniques values directly for a ChunkedArray. If we convert the list of chunked arrays into one chunked array (which is zero copy), and then get the uniques of this:
   
   ```python
   >>> merged_chunked = pa.chunked_array([chunk for arr in [a1, a2] for chunk in arr.chunks])
   >>> merged_chunked.unique()
   <pyarrow.lib.Int64Array object at 0x7fe2e692f880>
   [
     1,
     2,
     3,
     6,
     7,
     4,
     8
   ]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org