You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Leonhard Gruenschloss <le...@gruenschloss.org> on 2021/08/13 10:16:43 UTC

[C++] ListArray filtering

Hi,

I'd like to filter a ListArray, based on whether a particular value is
present in each list. Is there a better approach than the one described
below? Particularly, are there any existing compute functions that I could
use instead?

Here's a concrete example, with rows consisting of variable-length lists of
strings:
["a", "b", "x"]
["c", "d"]
["e", "x", "a"]
["c"]
["d, "e"]

If the element to search for is "x", only the first and third row would be
retained after filtering:
["a", "b", "x"]
["e", "x", "a"]

To implement this, the following should work, but is there a better way?

(1) Run the "equal" compute function on the values of the list:
[false, false, true, false, false, false, true, false, false, false, false]

(2) Linearly scan the result of (1) in lockstep with the list's offsets, to
keep track of which rows matched:
[true, false, true, false, false]

(3) Expand the result of (2) by the list lengths:
[true, true, true, false, false, true, true, true, false, false, false]

(4) Use the "filter" compute function (using the result from (3)) to copy
only the matching values.
["a", "b", "x", "e", "x", "a"]

(5) Using the result of (2), sum up lengths to compute new offsets:
[0, 3, 6]

(2), (3), and (5) are of course not difficult to implement, but is there
maybe a trick to use existing compute functions instead? Particularly for
non-C++ implementations that could make a big performance difference.

Cheers,
Leo

>

Re: [C++] ListArray filtering

Posted by Leonhard Gruenschloss <le...@gruenschloss.org>.
>
> Is there a shortcut in Arrow to transform a list of indices to a
> corresponding boolean mask ([0, 2] --> [true, false, true, false, false])?
>

I just realized that one doesn't even need this transformation when using
the "take" compute function! So it all boils down to this short list of
compute functions, without the need for any "low level" code:

(1) Run the "equal" compute function on the values of the list:
[false, false, true, false, false, false, true, false, false, false, false]

(2) Run the "list_parent_indices" compute function:
[0, 0, 0, 1, 1, 2, 2, 2, 3, 4, 4]

(3) Use the result of (1) to "filter" the result of (2):
[0, 2]

(4) Run "unique" on the result of (3) in case the same value may appear
multiple times in a list:
[0, 2]

(5) Use the result of (4) to "take" from the list:
["a", "b", "x"]
["e", "x", "a"]

On Fri, Aug 13, 2021 at 11:24 PM Leonhard Gruenschloss <
leonhard@gruenschloss.org> wrote:

> Hi Niranda,
>
> Thanks a lot for the quick response! Yes, you're absolutely right, that
> should work! I somehow missed that the filter compute function takes _any_
> input type.
>
> So that only leaves (2)... Using (1) to filter the result of
> list_parent_indices gets pretty close:
> filter([0, 0, 0, 1, 1, 2, 2, 2, 3, 4, 4], [false, false, true, false,
> false, false, true, false, false, false, false]) = [0, 2]
>
> Is there a shortcut in Arrow to transform a list of indices to a
> corresponding boolean mask ([0, 2] --> [true, false, true, false, false])?
>
> Cheers,
> Leo
>
> On Fri, Aug 13, 2021 at 10:45 PM Niranda Perera <ni...@gmail.com>
> wrote:
>
>> Hi Leo,
>>
>> Can't you call compute.filter with the resultant bool array from step
>> (2)? 🤔
>>
>> On Fri, Aug 13, 2021, 06:17 Leonhard Gruenschloss <
>> leonhard@gruenschloss.org> wrote:
>>
>>> Hi,
>>>
>>> I'd like to filter a ListArray, based on whether a particular value is
>>> present in each list. Is there a better approach than the one described
>>> below? Particularly, are there any existing compute functions that I could
>>> use instead?
>>>
>>> Here's a concrete example, with rows consisting of variable-length lists
>>> of strings:
>>> ["a", "b", "x"]
>>> ["c", "d"]
>>> ["e", "x", "a"]
>>> ["c"]
>>> ["d, "e"]
>>>
>>> If the element to search for is "x", only the first and third row would
>>> be retained after filtering:
>>> ["a", "b", "x"]
>>> ["e", "x", "a"]
>>>
>>> To implement this, the following should work, but is there a better way?
>>>
>>> (1) Run the "equal" compute function on the values of the list:
>>> [false, false, true, false, false, false, true, false, false, false,
>>> false]
>>>
>>> (2) Linearly scan the result of (1) in lockstep with the list's offsets,
>>> to keep track of which rows matched:
>>> [true, false, true, false, false]
>>>
>>> (3) Expand the result of (2) by the list lengths:
>>> [true, true, true, false, false, true, true, true, false, false, false]
>>>
>>> (4) Use the "filter" compute function (using the result from (3)) to
>>> copy only the matching values.
>>> ["a", "b", "x", "e", "x", "a"]
>>>
>>> (5) Using the result of (2), sum up lengths to compute new offsets:
>>> [0, 3, 6]
>>>
>>> (2), (3), and (5) are of course not difficult to implement, but is there
>>> maybe a trick to use existing compute functions instead? Particularly for
>>> non-C++ implementations that could make a big performance difference.
>>>
>>> Cheers,
>>> Leo
>>>
>>>>

Re: [C++] ListArray filtering

Posted by Leonhard Gruenschloss <le...@gruenschloss.org>.
Hi Niranda,

Thanks a lot for the quick response! Yes, you're absolutely right, that
should work! I somehow missed that the filter compute function takes _any_
input type.

So that only leaves (2)... Using (1) to filter the result of
list_parent_indices gets pretty close:
filter([0, 0, 0, 1, 1, 2, 2, 2, 3, 4, 4], [false, false, true, false,
false, false, true, false, false, false, false]) = [0, 2]

Is there a shortcut in Arrow to transform a list of indices to a
corresponding boolean mask ([0, 2] --> [true, false, true, false, false])?

Cheers,
Leo

On Fri, Aug 13, 2021 at 10:45 PM Niranda Perera <ni...@gmail.com>
wrote:

> Hi Leo,
>
> Can't you call compute.filter with the resultant bool array from step (2)?
> 🤔
>
> On Fri, Aug 13, 2021, 06:17 Leonhard Gruenschloss <
> leonhard@gruenschloss.org> wrote:
>
>> Hi,
>>
>> I'd like to filter a ListArray, based on whether a particular value is
>> present in each list. Is there a better approach than the one described
>> below? Particularly, are there any existing compute functions that I could
>> use instead?
>>
>> Here's a concrete example, with rows consisting of variable-length lists
>> of strings:
>> ["a", "b", "x"]
>> ["c", "d"]
>> ["e", "x", "a"]
>> ["c"]
>> ["d, "e"]
>>
>> If the element to search for is "x", only the first and third row would
>> be retained after filtering:
>> ["a", "b", "x"]
>> ["e", "x", "a"]
>>
>> To implement this, the following should work, but is there a better way?
>>
>> (1) Run the "equal" compute function on the values of the list:
>> [false, false, true, false, false, false, true, false, false, false,
>> false]
>>
>> (2) Linearly scan the result of (1) in lockstep with the list's offsets,
>> to keep track of which rows matched:
>> [true, false, true, false, false]
>>
>> (3) Expand the result of (2) by the list lengths:
>> [true, true, true, false, false, true, true, true, false, false, false]
>>
>> (4) Use the "filter" compute function (using the result from (3)) to copy
>> only the matching values.
>> ["a", "b", "x", "e", "x", "a"]
>>
>> (5) Using the result of (2), sum up lengths to compute new offsets:
>> [0, 3, 6]
>>
>> (2), (3), and (5) are of course not difficult to implement, but is there
>> maybe a trick to use existing compute functions instead? Particularly for
>> non-C++ implementations that could make a big performance difference.
>>
>> Cheers,
>> Leo
>>
>>>

Re: [C++] ListArray filtering

Posted by Niranda Perera <ni...@gmail.com>.
Hi Leo,

Can't you call compute.filter with the resultant bool array from step (2)?
🤔

On Fri, Aug 13, 2021, 06:17 Leonhard Gruenschloss <le...@gruenschloss.org>
wrote:

> Hi,
>
> I'd like to filter a ListArray, based on whether a particular value is
> present in each list. Is there a better approach than the one described
> below? Particularly, are there any existing compute functions that I could
> use instead?
>
> Here's a concrete example, with rows consisting of variable-length lists
> of strings:
> ["a", "b", "x"]
> ["c", "d"]
> ["e", "x", "a"]
> ["c"]
> ["d, "e"]
>
> If the element to search for is "x", only the first and third row would be
> retained after filtering:
> ["a", "b", "x"]
> ["e", "x", "a"]
>
> To implement this, the following should work, but is there a better way?
>
> (1) Run the "equal" compute function on the values of the list:
> [false, false, true, false, false, false, true, false, false, false, false]
>
> (2) Linearly scan the result of (1) in lockstep with the list's offsets,
> to keep track of which rows matched:
> [true, false, true, false, false]
>
> (3) Expand the result of (2) by the list lengths:
> [true, true, true, false, false, true, true, true, false, false, false]
>
> (4) Use the "filter" compute function (using the result from (3)) to copy
> only the matching values.
> ["a", "b", "x", "e", "x", "a"]
>
> (5) Using the result of (2), sum up lengths to compute new offsets:
> [0, 3, 6]
>
> (2), (3), and (5) are of course not difficult to implement, but is there
> maybe a trick to use existing compute functions instead? Particularly for
> non-C++ implementations that could make a big performance difference.
>
> Cheers,
> Leo
>
>>