You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Suresh V <su...@gmail.com> on 2022/11/02 17:22:38 UTC

Filter a list array based on the contents of the list.

Hi ..

Is there a compute function I can use to filter an array with list entries
based on the contents of the list?

For eg.
arr = pa.array([1,2],[3],[3,4,5]). I want to run a computer function which
return true if the entries have 3 or 4.

Expected output is:
pa.array(False, True, True).

The closest I could find was map lookup which expects the entries to be map.

Thanks

Re: Filter a list array based on the contents of the list.

Posted by "Lee, David" <Da...@blackrock.com>.
If your original arrays are really large you should be able to apply a series of compute functions on a chunked array using iteration. This lowers the memory overhead of creating new objects.

On Nov 2, 2022, at 2:32 PM, Suresh V <su...@gmail.com> wrote:



External Email: Use caution with links and attachments


Thanks Joris and She. This is exactly what I was looking for. With the new custom functions feature of pyarrow, it might be possible to do it in single pass .. though the cost of jumping to python might be prohibitively expensive.

On Wed, Nov 2, 2022, 4:46 PM Joris Van den Bossche <jo...@gmail.com>> wrote:
While there are indeed some workarounds possible by composing the
existing kernels (as David shows), we should ideally have a direct
kernel for this kind of operation, but that kernel currently doesn't
exist.

I recently ran into a similar issue, and I opened
https://issues.apache.org/jira/browse/ARROW-18097<https://urldefense.com/v3/__https://issues.apache.org/jira/browse/ARROW-18097__;!!KSjYCgUGsB4!eLmGRAZbyUiVH8sn0xaJAzTDFazRjf2-KrLHzeUCdVGS7JE-vBoR_3gfURrLS3O5XUTRnl2KSXL65yCQKyTlfQ$> about a
"list_contains" scalar kernel, which would already for checking
against a single value. Maybe we then also want a "list_is_in" kernel
for checking with multiple values (although one could already combine
multiple "list_contains" calls).

Joris

On Wed, 2 Nov 2022 at 20:01, Suresh V <su...@gmail.com>> wrote:
>
> HI David .. Thank you very much for the response. I apologize for not posing the question correctly.
>
> The method you have does give the right answer, but it results in multiple new objects and multiple data passes.
>
> I was looking for a kernel which avoids that as I am dealing with really large arrays. Please let me know if I am not being clear.
>
> Thanks again for your help.
>
> On Wed, Nov 2, 2022, 2:40 PM Lee, David <Da...@blackrock.com>> wrote:
>>
>> Slight correction for 3 or 4 instead of just 3..
>>
>>
>>
>> result = pc.is_in(list(range(len(arr))), pc.filter(indices, pc.is_in(flat_arr, pa.array([3,4]))))
>>
>>
>>
>> From: Lee, David
>> Sent: Wednesday, November 2, 2022 11:26 AM
>> To: user@arrow.apache.org<ma...@arrow.apache.org>
>> Subject: RE: Filter a list array based on the contents of the list.
>>
>>
>>
>> This works..
>>
>>
>>
>> import pyarrow as pa
>>
>> import pyarrow.compute as pc
>>
>>
>>
>> arr = pa.array([[1,2],[3],[3,4,5]])
>>
>>
>>
>> indices = pc.list_parent_indices(arr)
>>
>> flat_arr = pc.list_flatten(arr)
>>
>>
>>
>>
>>
>> result = pc.is_in(list(range(len(arr))), pc.filter(indices, pc.equal(flat_arr, 3)))
>>
>>
>>
>> >>> result
>>
>> <pyarrow.lib.BooleanArray object at 0x00000243EA2D4D00>
>>
>> [
>>
>>   false,
>>
>>   true,
>>
>>   true
>>
>> ]
>>
>>
>>
>>
>>
>> From: Suresh V <su...@gmail.com>>
>> Sent: Wednesday, November 2, 2022 10:23 AM
>> To: user@arrow.apache.org<ma...@arrow.apache.org>
>> Subject: Filter a list array based on the contents of the list.
>>
>>
>>
>> External Email: Use caution with links and attachments
>>
>> Hi ..
>>
>>
>>
>> Is there a compute function I can use to filter an array with list entries based on the contents of the list?
>>
>>
>>
>> For eg.
>>
>> arr = pa.array([1,2],[3],[3,4,5]). I want to run a computer function which return true if the entries have 3 or 4.
>>
>>
>>
>> Expected output is:
>>
>> pa.array(False, True, True).
>>
>>
>>
>> The closest I could find was map lookup which expects the entries to be map.
>>
>>
>>
>> Thanks
>>
>>
>>
>> This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.
>>
>>
>> For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.
>>
>> © 2022 BlackRock, Inc. All rights reserved.

Re: Filter a list array based on the contents of the list.

Posted by Suresh V <su...@gmail.com>.
Thanks Joris and She. This is exactly what I was looking for. With the new
custom functions feature of pyarrow, it might be possible to do it in
single pass .. though the cost of jumping to python might be prohibitively
expensive.

On Wed, Nov 2, 2022, 4:46 PM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> While there are indeed some workarounds possible by composing the
> existing kernels (as David shows), we should ideally have a direct
> kernel for this kind of operation, but that kernel currently doesn't
> exist.
>
> I recently ran into a similar issue, and I opened
> https://issues.apache.org/jira/browse/ARROW-18097 about a
> "list_contains" scalar kernel, which would already for checking
> against a single value. Maybe we then also want a "list_is_in" kernel
> for checking with multiple values (although one could already combine
> multiple "list_contains" calls).
>
> Joris
>
> On Wed, 2 Nov 2022 at 20:01, Suresh V <su...@gmail.com> wrote:
> >
> > HI David .. Thank you very much for the response. I apologize for not
> posing the question correctly.
> >
> > The method you have does give the right answer, but it results in
> multiple new objects and multiple data passes.
> >
> > I was looking for a kernel which avoids that as I am dealing with really
> large arrays. Please let me know if I am not being clear.
> >
> > Thanks again for your help.
> >
> > On Wed, Nov 2, 2022, 2:40 PM Lee, David <Da...@blackrock.com> wrote:
> >>
> >> Slight correction for 3 or 4 instead of just 3..
> >>
> >>
> >>
> >> result = pc.is_in(list(range(len(arr))), pc.filter(indices,
> pc.is_in(flat_arr, pa.array([3,4]))))
> >>
> >>
> >>
> >> From: Lee, David
> >> Sent: Wednesday, November 2, 2022 11:26 AM
> >> To: user@arrow.apache.org
> >> Subject: RE: Filter a list array based on the contents of the list.
> >>
> >>
> >>
> >> This works..
> >>
> >>
> >>
> >> import pyarrow as pa
> >>
> >> import pyarrow.compute as pc
> >>
> >>
> >>
> >> arr = pa.array([[1,2],[3],[3,4,5]])
> >>
> >>
> >>
> >> indices = pc.list_parent_indices(arr)
> >>
> >> flat_arr = pc.list_flatten(arr)
> >>
> >>
> >>
> >>
> >>
> >> result = pc.is_in(list(range(len(arr))), pc.filter(indices,
> pc.equal(flat_arr, 3)))
> >>
> >>
> >>
> >> >>> result
> >>
> >> <pyarrow.lib.BooleanArray object at 0x00000243EA2D4D00>
> >>
> >> [
> >>
> >>   false,
> >>
> >>   true,
> >>
> >>   true
> >>
> >> ]
> >>
> >>
> >>
> >>
> >>
> >> From: Suresh V <su...@gmail.com>
> >> Sent: Wednesday, November 2, 2022 10:23 AM
> >> To: user@arrow.apache.org
> >> Subject: Filter a list array based on the contents of the list.
> >>
> >>
> >>
> >> External Email: Use caution with links and attachments
> >>
> >> Hi ..
> >>
> >>
> >>
> >> Is there a compute function I can use to filter an array with list
> entries based on the contents of the list?
> >>
> >>
> >>
> >> For eg.
> >>
> >> arr = pa.array([1,2],[3],[3,4,5]). I want to run a computer function
> which return true if the entries have 3 or 4.
> >>
> >>
> >>
> >> Expected output is:
> >>
> >> pa.array(False, True, True).
> >>
> >>
> >>
> >> The closest I could find was map lookup which expects the entries to be
> map.
> >>
> >>
> >>
> >> Thanks
> >>
> >>
> >>
> >> This message may contain information that is confidential or
> privileged. If you are not the intended recipient, please advise the sender
> immediately and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
> >>
> >>
> >> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
> >>
> >> © 2022 BlackRock, Inc. All rights reserved.
>

Re: Filter a list array based on the contents of the list.

Posted by Chang She <ch...@eto.ai>.
+1. We have the same issue. A direct kernel would be very useful.

On Wed, Nov 2, 2022 at 1:46 PM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> While there are indeed some workarounds possible by composing the
> existing kernels (as David shows), we should ideally have a direct
> kernel for this kind of operation, but that kernel currently doesn't
> exist.
>
> I recently ran into a similar issue, and I opened
> https://issues.apache.org/jira/browse/ARROW-18097 about a
> "list_contains" scalar kernel, which would already for checking
> against a single value. Maybe we then also want a "list_is_in" kernel
> for checking with multiple values (although one could already combine
> multiple "list_contains" calls).
>
> Joris
>
> On Wed, 2 Nov 2022 at 20:01, Suresh V <su...@gmail.com> wrote:
> >
> > HI David .. Thank you very much for the response. I apologize for not
> posing the question correctly.
> >
> > The method you have does give the right answer, but it results in
> multiple new objects and multiple data passes.
> >
> > I was looking for a kernel which avoids that as I am dealing with really
> large arrays. Please let me know if I am not being clear.
> >
> > Thanks again for your help.
> >
> > On Wed, Nov 2, 2022, 2:40 PM Lee, David <Da...@blackrock.com> wrote:
> >>
> >> Slight correction for 3 or 4 instead of just 3..
> >>
> >>
> >>
> >> result = pc.is_in(list(range(len(arr))), pc.filter(indices,
> pc.is_in(flat_arr, pa.array([3,4]))))
> >>
> >>
> >>
> >> From: Lee, David
> >> Sent: Wednesday, November 2, 2022 11:26 AM
> >> To: user@arrow.apache.org
> >> Subject: RE: Filter a list array based on the contents of the list.
> >>
> >>
> >>
> >> This works..
> >>
> >>
> >>
> >> import pyarrow as pa
> >>
> >> import pyarrow.compute as pc
> >>
> >>
> >>
> >> arr = pa.array([[1,2],[3],[3,4,5]])
> >>
> >>
> >>
> >> indices = pc.list_parent_indices(arr)
> >>
> >> flat_arr = pc.list_flatten(arr)
> >>
> >>
> >>
> >>
> >>
> >> result = pc.is_in(list(range(len(arr))), pc.filter(indices,
> pc.equal(flat_arr, 3)))
> >>
> >>
> >>
> >> >>> result
> >>
> >> <pyarrow.lib.BooleanArray object at 0x00000243EA2D4D00>
> >>
> >> [
> >>
> >>   false,
> >>
> >>   true,
> >>
> >>   true
> >>
> >> ]
> >>
> >>
> >>
> >>
> >>
> >> From: Suresh V <su...@gmail.com>
> >> Sent: Wednesday, November 2, 2022 10:23 AM
> >> To: user@arrow.apache.org
> >> Subject: Filter a list array based on the contents of the list.
> >>
> >>
> >>
> >> External Email: Use caution with links and attachments
> >>
> >> Hi ..
> >>
> >>
> >>
> >> Is there a compute function I can use to filter an array with list
> entries based on the contents of the list?
> >>
> >>
> >>
> >> For eg.
> >>
> >> arr = pa.array([1,2],[3],[3,4,5]). I want to run a computer function
> which return true if the entries have 3 or 4.
> >>
> >>
> >>
> >> Expected output is:
> >>
> >> pa.array(False, True, True).
> >>
> >>
> >>
> >> The closest I could find was map lookup which expects the entries to be
> map.
> >>
> >>
> >>
> >> Thanks
> >>
> >>
> >>
> >> This message may contain information that is confidential or
> privileged. If you are not the intended recipient, please advise the sender
> immediately and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
> >>
> >>
> >> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
> >>
> >> © 2022 BlackRock, Inc. All rights reserved.
>

Re: Filter a list array based on the contents of the list.

Posted by Joris Van den Bossche <jo...@gmail.com>.
While there are indeed some workarounds possible by composing the
existing kernels (as David shows), we should ideally have a direct
kernel for this kind of operation, but that kernel currently doesn't
exist.

I recently ran into a similar issue, and I opened
https://issues.apache.org/jira/browse/ARROW-18097 about a
"list_contains" scalar kernel, which would already for checking
against a single value. Maybe we then also want a "list_is_in" kernel
for checking with multiple values (although one could already combine
multiple "list_contains" calls).

Joris

On Wed, 2 Nov 2022 at 20:01, Suresh V <su...@gmail.com> wrote:
>
> HI David .. Thank you very much for the response. I apologize for not posing the question correctly.
>
> The method you have does give the right answer, but it results in multiple new objects and multiple data passes.
>
> I was looking for a kernel which avoids that as I am dealing with really large arrays. Please let me know if I am not being clear.
>
> Thanks again for your help.
>
> On Wed, Nov 2, 2022, 2:40 PM Lee, David <Da...@blackrock.com> wrote:
>>
>> Slight correction for 3 or 4 instead of just 3..
>>
>>
>>
>> result = pc.is_in(list(range(len(arr))), pc.filter(indices, pc.is_in(flat_arr, pa.array([3,4]))))
>>
>>
>>
>> From: Lee, David
>> Sent: Wednesday, November 2, 2022 11:26 AM
>> To: user@arrow.apache.org
>> Subject: RE: Filter a list array based on the contents of the list.
>>
>>
>>
>> This works..
>>
>>
>>
>> import pyarrow as pa
>>
>> import pyarrow.compute as pc
>>
>>
>>
>> arr = pa.array([[1,2],[3],[3,4,5]])
>>
>>
>>
>> indices = pc.list_parent_indices(arr)
>>
>> flat_arr = pc.list_flatten(arr)
>>
>>
>>
>>
>>
>> result = pc.is_in(list(range(len(arr))), pc.filter(indices, pc.equal(flat_arr, 3)))
>>
>>
>>
>> >>> result
>>
>> <pyarrow.lib.BooleanArray object at 0x00000243EA2D4D00>
>>
>> [
>>
>>   false,
>>
>>   true,
>>
>>   true
>>
>> ]
>>
>>
>>
>>
>>
>> From: Suresh V <su...@gmail.com>
>> Sent: Wednesday, November 2, 2022 10:23 AM
>> To: user@arrow.apache.org
>> Subject: Filter a list array based on the contents of the list.
>>
>>
>>
>> External Email: Use caution with links and attachments
>>
>> Hi ..
>>
>>
>>
>> Is there a compute function I can use to filter an array with list entries based on the contents of the list?
>>
>>
>>
>> For eg.
>>
>> arr = pa.array([1,2],[3],[3,4,5]). I want to run a computer function which return true if the entries have 3 or 4.
>>
>>
>>
>> Expected output is:
>>
>> pa.array(False, True, True).
>>
>>
>>
>> The closest I could find was map lookup which expects the entries to be map.
>>
>>
>>
>> Thanks
>>
>>
>>
>> This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.
>>
>>
>> For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.
>>
>> © 2022 BlackRock, Inc. All rights reserved.

Re: Filter a list array based on the contents of the list.

Posted by Suresh V <su...@gmail.com>.
HI David .. Thank you very much for the response. I apologize for not
posing the question correctly.

The method you have does give the right answer, but it results in multiple
new objects and multiple data passes.

I was looking for a kernel which avoids that as I am dealing with really
large arrays. Please let me know if I am not being clear.

Thanks again for your help.

On Wed, Nov 2, 2022, 2:40 PM Lee, David <Da...@blackrock.com> wrote:

> Slight correction for 3 or 4 instead of just 3..
>
>
>
> result = pc.is_in(list(range(len(arr))), pc.filter(indices,
> pc.is_in(flat_arr, pa.array([3,4]))))
>
>
>
> *From:* Lee, David
> *Sent:* Wednesday, November 2, 2022 11:26 AM
> *To:* user@arrow.apache.org
> *Subject:* RE: Filter a list array based on the contents of the list.
>
>
>
> This works..
>
>
>
> import pyarrow as pa
>
> import pyarrow.compute as pc
>
>
>
> arr = pa.array([[1,2],[3],[3,4,5]])
>
>
>
> indices = pc.list_parent_indices(arr)
>
> flat_arr = pc.list_flatten(arr)
>
>
>
>
>
> result = pc.is_in(list(range(len(arr))), pc.filter(indices,
> pc.equal(flat_arr, 3)))
>
>
>
> >>> result
>
> <pyarrow.lib.BooleanArray object at 0x00000243EA2D4D00>
>
> [
>
>   false,
>
>   true,
>
>   true
>
> ]
>
>
>
>
>
> *From:* Suresh V <su...@gmail.com>
> *Sent:* Wednesday, November 2, 2022 10:23 AM
> *To:* user@arrow.apache.org
> *Subject:* Filter a list array based on the contents of the list.
>
>
>
> External Email: Use caution with links and attachments
>
> Hi ..
>
>
>
> Is there a compute function I can use to filter an array with list entries
> based on the contents of the list?
>
>
>
> For eg.
>
> arr = pa.array([1,2],[3],[3,4,5]). I want to run a computer function which
> return true if the entries have 3 or 4.
>
>
>
> Expected output is:
>
> pa.array(False, True, True).
>
>
>
> The closest I could find was map lookup which expects the entries to be
> map.
>
>
>
> Thanks
>
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
>
>

RE: Filter a list array based on the contents of the list.

Posted by "Lee, David" <Da...@blackrock.com>.
Slight correction for 3 or 4 instead of just 3..

result = pc.is_in(list(range(len(arr))), pc.filter(indices, pc.is_in(flat_arr, pa.array([3,4]))))

From: Lee, David
Sent: Wednesday, November 2, 2022 11:26 AM
To: user@arrow.apache.org
Subject: RE: Filter a list array based on the contents of the list.

This works..

import pyarrow as pa
import pyarrow.compute as pc

arr = pa.array([[1,2],[3],[3,4,5]])

indices = pc.list_parent_indices(arr)
flat_arr = pc.list_flatten(arr)


result = pc.is_in(list(range(len(arr))), pc.filter(indices, pc.equal(flat_arr, 3)))

>>> result
<pyarrow.lib.BooleanArray object at 0x00000243EA2D4D00>
[
  false,
  true,
  true
]


From: Suresh V <su...@gmail.com>>
Sent: Wednesday, November 2, 2022 10:23 AM
To: user@arrow.apache.org<ma...@arrow.apache.org>
Subject: Filter a list array based on the contents of the list.


External Email: Use caution with links and attachments
Hi ..

Is there a compute function I can use to filter an array with list entries based on the contents of the list?

For eg.
arr = pa.array([1,2],[3],[3,4,5]). I want to run a computer function which return true if the entries have 3 or 4.

Expected output is:
pa.array(False, True, True).

The closest I could find was map lookup which expects the entries to be map.

Thanks

This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.

RE: Filter a list array based on the contents of the list.

Posted by "Lee, David" <Da...@blackrock.com>.
This works..

import pyarrow as pa
import pyarrow.compute as pc

arr = pa.array([[1,2],[3],[3,4,5]])

indices = pc.list_parent_indices(arr)
flat_arr = pc.list_flatten(arr)


result = pc.is_in(list(range(len(arr))), pc.filter(indices, pc.equal(flat_arr, 3)))

>>> result
<pyarrow.lib.BooleanArray object at 0x00000243EA2D4D00>
[
  false,
  true,
  true
]


From: Suresh V <su...@gmail.com>
Sent: Wednesday, November 2, 2022 10:23 AM
To: user@arrow.apache.org
Subject: Filter a list array based on the contents of the list.


External Email: Use caution with links and attachments
Hi ..

Is there a compute function I can use to filter an array with list entries based on the contents of the list?

For eg.
arr = pa.array([1,2],[3],[3,4,5]). I want to run a computer function which return true if the entries have 3 or 4.

Expected output is:
pa.array(False, True, True).

The closest I could find was map lookup which expects the entries to be map.

Thanks

This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.