You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Eduardo Ponce (Jira)" <ji...@apache.org> on 2021/09/01 08:51:00 UTC

[jira] [Comment Edited] (ARROW-13410) [C++] Implement min_max kernel for array[string]

    [ https://issues.apache.org/jira/browse/ARROW-13410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407971#comment-17407971 ] 

Eduardo Ponce edited comment on ARROW-13410 at 9/1/21, 8:50 AM:
----------------------------------------------------------------

A simple algorithm for finding the min/max string between a pair of strings:
{code:c++}
for (int i = 0; i < std::min(s1.size(), s2.size()); ++i) {
        // They are different, one is greater than the other
        if (s1[i] > s2[i]) return s1;  // for min return s2
        if (s1[i] < s2[i]) return s2;  // for min return s1
    }
    // At least one is empty string, longest string is max
    return (s1.size() > s2.size()) ? s1 : s2;  // swap s1/s2 for min
{code}

Then use this function for a running min/max to get the result.
For UTF8 inputs, codepoints are decoded and compared instead.


was (Author: edponce):
A simple algorithm for finding the min/max string between a pair of strings:
{code:c++}
for (int i = 0; i < std::min(s1.size(), s2.size()); ++i) {
        // They are different, one is greater than the other
        if (s1[i] > s2[i]) return s1;  // for min return s2
        if (s1[i] < s2[i]) return s2;  // for min return s1
    }
    // At least one is empty string, longest string is max
    return (s1.size() > s2.size()) ? s1 : s2;  // swap s1/s2 for min
{code}

Then use this function for a running min/max to get the result.

> [C++] Implement min_max kernel for array[string]
> ------------------------------------------------
>
>                 Key: ARROW-13410
>                 URL: https://issues.apache.org/jira/browse/ARROW-13410
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Python
>    Affects Versions: 4.0.1
>            Reporter: Tom Augspurger
>            Assignee: Eduardo Ponce
>            Priority: Minor
>
> As noted in https://github.com/pandas-dev/pandas/issues/42597, `pyarrow.compute.min_max` on a string dtype array currently raises. Here's an example from Python
> {{
> In [1]: import pyarrow, pyarrow.compute
> In [2]: a = pyarrow.array(['c', 'a', 'b'])
> In [4]: pyarrow.compute.min_max(a)
> ---------------------------------------------------------------------------
> ArrowNotImplementedError                  Traceback (most recent call last)
> <ipython-input-4-d557440fe5aa> in <module>
> ----> 1 pyarrow.compute.min_max(a)
> ~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/compute.py in min_max(array, options, memory_pool, **kwargs)
> ~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/_compute.pyx in pyarrow._compute.Function.call()
> ~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
> ~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowNotImplementedError: Function min_max has no kernel matching input types (array[string])
> }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)