You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Artem KOZHEVNIKOV (Jira)" <ji...@apache.org> on 2019/08/21 21:04:00 UTC
[jira] [Comment Edited] (ARROW-5454) [C++] Implement Take on ChunkedArray for DataFrame use

    [ https://issues.apache.org/jira/browse/ARROW-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912675#comment-16912675 ] 

Artem KOZHEVNIKOV edited comment on ARROW-5454 at 8/21/19 9:03 PM:
-------------------------------------------------------------------

if it were in pure python, we could do something like below (relying on `pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
def take_on_chunked_array(charr, indices):
    indices = np.asarray(indices, dtype=np.int)
    if indices.max() > len(charr):
        raise IndexError()    
    indices[indices < 0] += len(charr)    
    if indices.min() < 0:
        raise IndexError()    
    lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
    cum_lengths = lengths.cumsum()    
    sort_idx = np.argsort(indices)  
    indices = indices[sort_idx]
    sort_idx = np.argsort(sort_idx)  # inverse sort indices
    # btw, we could check if indices are already sorted to avoid an extra copy in this case
    
    limit_idx = [(0, 0, 0)]    
    for i, cum_length in enumerate(cum_lengths):
        limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, cum_length)))    
    limit_idx = limit_idx[1:]    
    cum_lengths -= lengths
    res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end])
    return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), 
                          pa.array([5, 6, 7, 8])])      
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()                    
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy()                                                                                {code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method (or we want to avoid an extra copy) ? We certainly can avoid global indices sorting as well.


was (Author: artemk):
if it were in pure python, we could do something like below (relying on `pa.array.take`)
{code:python}
import numpy as np
import pyarrow as pa
def take_on_chunked_array(charr, indices):
    indices = np.asarray(indices, dtype=np.int)
    if indices.max() > len(charr):
        raise IndexError()    
    indices[indices < 0] += len(charr)    
    if indices.min() < 0:
        raise IndexError()    
    lengths = np.fromiter(map(len, charr.chunks), dtype=np.int64)
    cum_lengths = lengths.cumsum()    
    sort_idx = np.argsort(indices)  
    indices = indices[sort_idx]
    sort_idx = np.argsort(sort_idx)  # inverse sort indices
    # btw, we could check if indices are already sorted to avoid an extra copy in this case
    
    limit_idx = [(0, 0, 0)]    
    for i, cum_length in enumerate(cum_lengths):
        limit_idx.append((i, limit_idx[-1][-1], np.searchsorted(indices, cum_length)))    
    limit_idx = limit_idx[1:]    
    cum_lengths -= lengths
    res_array = pa.concat_arrays([charr.chunks[i].take(pa.array(indices[j_start:j_end] - cum_lengths[i])) for i, j_start, j_end in limit_idx if j_start < j_end])
    return res_array.take(pa.array(sort_idx))


charr = pa.chunked_array([pa.array([0, 1]), pa.array([2, 3, 4]), 
                          pa.array([5, 6, 7, 8])])      
take_on_chunked_array(charr, np.array([6, 0, 3])).to_numpy()                    
pa.concat_arrays(charr.chunks).take(pa.array([6, 0, 3])).to_numpy()                                                                                {code}
Do we want something similar in C++ ? Should we reuse `cpp:Array:Take` method (or we want to avoid an extra copy) ?

> [C++] Implement Take on ChunkedArray for DataFrame use
> ------------------------------------------------------
>
>                 Key: ARROW-5454
>                 URL: https://issues.apache.org/jira/browse/ARROW-5454
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Follow up to ARROW-2667



--
This message was sent by Atlassian Jira
(v8.3.2#803003)