You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Alessandro Molina <al...@ursacomputing.com> on 2021/12/01 11:26:35 UTC

Re: Find a value indices in an array

Yes please, I think it makes sense and should be fairly straightforward

On Mon, Nov 29, 2021 at 5:38 PM Niranda Perera <ni...@gmail.com>
wrote:

> Should I open a JIRA on this?
>
> On Mon, Nov 29, 2021, 10:52 Alessandro Molina <
> alessandro@ursacomputing.com> wrote:
>
>> Oh, ops, sorry my fault, I understood the question reversed :D
>>
>> I think that if we had a compute function that returns indices of a
>> matching value that could also be applied to masks to retrieve the indices
>> of any "true" value thus also solving your question if combined with is_in
>> (or any other predicate at that point). That might be a reasonable addition
>> to compute functions.
>>
>>
>> On Sun, Nov 28, 2021 at 7:00 AM Niranda Perera <ni...@gmail.com>
>> wrote:
>>
>>> Hi guys, sorry for the late reply.
>>>
>>> Yes,  Joris is right. I want the converse (I think 😊 ) of index in. I
>>> was discussing this with Eduardo in zulip [1].
>>>
>>> I was hoping that I could do this.
>>> ```
>>> values = pa.array([1, 2, 2, 3, 4, 1])
>>> to_find= pa.array([1, 2, 1])
>>> indices = pc.index_in(to_find, value_set=values) #  expected = [0, 5, 1,
>>> 2, 0, 5] received = [0, 1, 0]
>>> ```
>>> So, index_in does not handle duplicated indices of values (I am guessing
>>> it creates a hashmap of values, and not a multimap).
>>>
>>> One suggestion was to use `aggregations.index`. And I think that might
>>> work recursively, as follows. But I haven't tested this.
>>> ```
>>> indices = []
>>> for f in to_find:
>>>   idx = -1
>>>   while true:
>>>     idx = pc.index(values, f, start=idx + 1, end=len(values))
>>>     if idx == -1:
>>>       break
>>>     else:
>>>       indices.append(idx)
>>> ```
>>>
>>> But I was thinking if it would make sense to give a method to find all
>>> indices of a value (inner while loop)?
>>>
>>> Best
>>>
>>> [1]
>>> https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Find.20a.20value.20indices.20in.20an.20array/near/262351923
>>>
>>>
>>> On Thu, Nov 25, 2021 at 3:14 PM Joris Van den Bossche <
>>> jorisvandenbossche@gmail.com> wrote:
>>>
>>>> I think "index_in" does the index in the other way around? It gives,
>>>> for each value of the array, the index in the set. While if I
>>>> understand the question correctly, Niranda is looking for the index
>>>> into the array for elements that are present in the set.
>>>>
>>>> Something like that could be achieved by using "is_in", and then
>>>> getting the indices of the True values:
>>>>
>>>> >>> pc.is_in(pa.array([1, 2, 3]), value_set=pa.array([1, 3]))
>>>> <pyarrow.lib.BooleanArray object at 0x7fcc96896a00>
>>>> [
>>>>   true,
>>>>   false,
>>>>   true
>>>> ]
>>>>
>>>> To get the location of the True values, in numpy this is called
>>>> "nonzero", and we have an open JIRA for adding this as a kernel
>>>> (https://issues.apache.org/jira/browse/ARROW-13035)
>>>>
>>>> On Thu, 25 Nov 2021 at 11:17, Alessandro Molina
>>>> <al...@ursacomputing.com> wrote:
>>>> >
>>>> > I think index_in is what you are looking for
>>>> >
>>>> > >>> pc.index_in(pa.array([1, 2, 3]), value_set=pa.array([1, 3]))
>>>> > <pyarrow.lib.Int32Array object at 0x11e2a6580>
>>>> > [
>>>> >   0,
>>>> >   null,
>>>> >   1
>>>> > ]
>>>> >
>>>> > On Sat, Nov 20, 2021 at 4:49 AM Niranda Perera <
>>>> niranda.perera@gmail.com> wrote:
>>>> >>
>>>> >> Hi all, is there a compute API for searching a value index (and a
>>>> set of values) in an Array?
>>>> >> ex:
>>>> >> ```python
>>>> >> a = [1, 2, 2, 3, 4, 1]
>>>> >> values= pa.array([1, 2, 1])
>>>> >>
>>>> >> index = find_index(a, 1) # = [0, 5]
>>>> >> indices = find_indices(a, values) # = [0, 1, 2, 5]
>>>> >> ```
>>>> >> I am currently using `compute.is_in` and traversing the true indices
>>>> of the result Bitmap. Is there a better way?
>>>> >>
>>>> >> Best
>>>> >> --
>>>> >> Niranda Perera
>>>> >> https://niranda.dev/
>>>> >> @n1r44
>>>> >>
>>>>
>>>
>>>
>>> --
>>> Niranda Perera
>>> https://niranda.dev/
>>> @n1r44 <https://twitter.com/N1R44>
>>>
>>>

Re: Find a value indices in an array

Posted by Niranda Perera <ni...@gmail.com>.
@allasandro I opened a JIRA
<https://issues.apache.org/jira/browse/ARROW-14946>. Maybe we could discuss
things further there.

On Wed, Dec 1, 2021 at 6:27 AM Alessandro Molina <
alessandro@ursacomputing.com> wrote:

> Yes please, I think it makes sense and should be fairly straightforward
>
> On Mon, Nov 29, 2021 at 5:38 PM Niranda Perera <ni...@gmail.com>
> wrote:
>
>> Should I open a JIRA on this?
>>
>> On Mon, Nov 29, 2021, 10:52 Alessandro Molina <
>> alessandro@ursacomputing.com> wrote:
>>
>>> Oh, ops, sorry my fault, I understood the question reversed :D
>>>
>>> I think that if we had a compute function that returns indices of a
>>> matching value that could also be applied to masks to retrieve the indices
>>> of any "true" value thus also solving your question if combined with is_in
>>> (or any other predicate at that point). That might be a reasonable addition
>>> to compute functions.
>>>
>>>
>>> On Sun, Nov 28, 2021 at 7:00 AM Niranda Perera <ni...@gmail.com>
>>> wrote:
>>>
>>>> Hi guys, sorry for the late reply.
>>>>
>>>> Yes,  Joris is right. I want the converse (I think 😊 ) of index in. I
>>>> was discussing this with Eduardo in zulip [1].
>>>>
>>>> I was hoping that I could do this.
>>>> ```
>>>> values = pa.array([1, 2, 2, 3, 4, 1])
>>>> to_find= pa.array([1, 2, 1])
>>>> indices = pc.index_in(to_find, value_set=values) #  expected = [0, 5,
>>>> 1, 2, 0, 5] received = [0, 1, 0]
>>>> ```
>>>> So, index_in does not handle duplicated indices of values (I am
>>>> guessing it creates a hashmap of values, and not a multimap).
>>>>
>>>> One suggestion was to use `aggregations.index`. And I think that might
>>>> work recursively, as follows. But I haven't tested this.
>>>> ```
>>>> indices = []
>>>> for f in to_find:
>>>>   idx = -1
>>>>   while true:
>>>>     idx = pc.index(values, f, start=idx + 1, end=len(values))
>>>>     if idx == -1:
>>>>       break
>>>>     else:
>>>>       indices.append(idx)
>>>> ```
>>>>
>>>> But I was thinking if it would make sense to give a method to find all
>>>> indices of a value (inner while loop)?
>>>>
>>>> Best
>>>>
>>>> [1]
>>>> https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Find.20a.20value.20indices.20in.20an.20array/near/262351923
>>>>
>>>>
>>>> On Thu, Nov 25, 2021 at 3:14 PM Joris Van den Bossche <
>>>> jorisvandenbossche@gmail.com> wrote:
>>>>
>>>>> I think "index_in" does the index in the other way around? It gives,
>>>>> for each value of the array, the index in the set. While if I
>>>>> understand the question correctly, Niranda is looking for the index
>>>>> into the array for elements that are present in the set.
>>>>>
>>>>> Something like that could be achieved by using "is_in", and then
>>>>> getting the indices of the True values:
>>>>>
>>>>> >>> pc.is_in(pa.array([1, 2, 3]), value_set=pa.array([1, 3]))
>>>>> <pyarrow.lib.BooleanArray object at 0x7fcc96896a00>
>>>>> [
>>>>>   true,
>>>>>   false,
>>>>>   true
>>>>> ]
>>>>>
>>>>> To get the location of the True values, in numpy this is called
>>>>> "nonzero", and we have an open JIRA for adding this as a kernel
>>>>> (https://issues.apache.org/jira/browse/ARROW-13035)
>>>>>
>>>>> On Thu, 25 Nov 2021 at 11:17, Alessandro Molina
>>>>> <al...@ursacomputing.com> wrote:
>>>>> >
>>>>> > I think index_in is what you are looking for
>>>>> >
>>>>> > >>> pc.index_in(pa.array([1, 2, 3]), value_set=pa.array([1, 3]))
>>>>> > <pyarrow.lib.Int32Array object at 0x11e2a6580>
>>>>> > [
>>>>> >   0,
>>>>> >   null,
>>>>> >   1
>>>>> > ]
>>>>> >
>>>>> > On Sat, Nov 20, 2021 at 4:49 AM Niranda Perera <
>>>>> niranda.perera@gmail.com> wrote:
>>>>> >>
>>>>> >> Hi all, is there a compute API for searching a value index (and a
>>>>> set of values) in an Array?
>>>>> >> ex:
>>>>> >> ```python
>>>>> >> a = [1, 2, 2, 3, 4, 1]
>>>>> >> values= pa.array([1, 2, 1])
>>>>> >>
>>>>> >> index = find_index(a, 1) # = [0, 5]
>>>>> >> indices = find_indices(a, values) # = [0, 1, 2, 5]
>>>>> >> ```
>>>>> >> I am currently using `compute.is_in` and traversing the true
>>>>> indices of the result Bitmap. Is there a better way?
>>>>> >>
>>>>> >> Best
>>>>> >> --
>>>>> >> Niranda Perera
>>>>> >> https://niranda.dev/
>>>>> >> @n1r44
>>>>> >>
>>>>>
>>>>
>>>>
>>>> --
>>>> Niranda Perera
>>>> https://niranda.dev/
>>>> @n1r44 <https://twitter.com/N1R44>
>>>>
>>>>

-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>