You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Artem KOZHEVNIKOV (Jira)" <ji...@apache.org> on 2019/08/21 21:04:00 UTC
[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on
ChunkedArray for DataFrame use
[ https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912675#comment-16912675 ]
Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/21/19 9:03 PM:
-------------------------------------------------------------------
if it were in pure python, we could do something like below (relying on `pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
def take_on_chunked_array(charr, indices):
indices = np.asarray(indices, dtype=np.int)
if indices.max() > len(charr):
raise IndexError()
indices[indices < 0] += len(charr)
if indices.min() < 0:
raise IndexError()
lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
cum_lengths = lengths.cumsum()
sort_idx = np.argsort(indices)
indices = indices[sort_idx]
sort_idx = np.argsort(sort_idx) # inverse sort indices
# btw, we could check if indices are already sorted to avoid an extra copy in this case
limit_idx = [(0, 0, 0)]
for i, cum_length in enumerate(cum_lengths):
limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, cum_length)))
limit_idx = limit_idx[1:]
cum_lengths -= lengths
res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end])
return res_array.take(pa.array(sort_idx))
charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]),
pa.array([5, 6, 7, 8])])
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method (or we want to avoid an extra copy) ? We certainly can avoid global indices sorting as well.
was (Author: artemk):
if it were in pure python, we could do something like below (relying on `pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
def take_on_chunked_array(charr, indices):
indices = np.asarray(indices, dtype=np.int)
if indices.max() > len(charr):
raise IndexError()
indices[indices < 0] += len(charr)
if indices.min() < 0:
raise IndexError()
lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
cum_lengths = lengths.cumsum()
sort_idx = np.argsort(indices)
indices = indices[sort_idx]
sort_idx = np.argsort(sort_idx) # inverse sort indices
# btw, we could check if indices are already sorted to avoid an extra copy in this case
limit_idx = [(0, 0, 0)]
for i, cum_length in enumerate(cum_lengths):
limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, cum_length)))
limit_idx = limit_idx[1:]
cum_lengths -= lengths
res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end])
return res_array.take(pa.array(sort_idx))
charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]),
pa.array([5, 6, 7, 8])])
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy() {code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method (or we want to avoid an extra copy) ?
> [C++] Implement Take on ChunkedArray for DataFrame use
> ------------------------------------------------------
>
> Key: ARROW-5454
> URL: https://issues.apache.org/jira/browse/ARROW-5454
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Wes McKinney
> Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to ARROW-2667
--
This message was sent by Atlassian Jira
(v8.3.2#803003)