You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/01/04 22:28:11 UTC

[jira] [Updated] (ARROW-7394) [C++][DataFrame] Implement zero-copy optimizations when performing Filter

     [ https://issues.apache.org/jira/browse/ARROW-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson updated ARROW-7394:
-----------------------------------
    Fix Version/s:     (was: 3.0.0)
                   4.0.0

> [C++][DataFrame] Implement zero-copy optimizations when performing Filter
> -------------------------------------------------------------------------
>
>                 Key: ARROW-7394
>                 URL: https://issues.apache.org/jira/browse/ARROW-7394
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: dataframe
>             Fix For: 4.0.0
>
>
> For high-selectivity filters (most elements included), it may be wasteful and slow to copy large contiguous ranges of array chunks into the resulting ChunkedArray. Instead, we can scan the filter boolean array and slice off chunks of the source data rather than copying. 
> We will need to empirically determine how large the contiguous range needs to be in order to merit the slice-based approach versus simple/native materialization. For example, in a filter array like
> 1 0 1 0 1 0 1 0 1
> it would not make sense to slice 5 times because slicing carries some overhead. But if we had
> 1 ... 1 [100 1's] 0 1 ... 1 [100 1's] 0 1 ... 1 [100 1's] 0 1 ... 1 [100 1's] 
> then performing 4 slices may be faster than doing a copy materialization. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)