You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "niranda perera (Jira)" <ji...@apache.org> on 2021/04/26 20:41:00 UTC

[jira] [Created] (ARROW-12554) Allow duplicates in the value_set for compute::is_in

niranda perera created ARROW-12554:
--------------------------------------

             Summary: Allow duplicates in the value_set for compute::is_in  
                 Key: ARROW-12554
                 URL: https://issues.apache.org/jira/browse/ARROW-12554
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Python
    Affects Versions: 4.0.0
            Reporter: niranda perera


In the arrow release-4.0.0 branch, the `compute::is_in` operation rejects duplicate values in the `value_set` [1]. This was not the case in arrow 2.0 >=.
 
I was wondering if this strict restriction is required? Because ultimately, a hash set would be created from the value_set values, and there's no harm in having duplicates while doing so, isn't it?
PS: I understand that the param name "value_set" indicates that the values need to be unique, but in the useability perspective, this can be relaxed IMO. ex: Pandas isin [2].
 
 
[1] [https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc#L53]
[2] [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)