You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Yue Ni <ni...@gmail.com> on 2020/05/30 08:00:41 UTC

Cast string array to number/boolean with invalid values

Hi there,

I find arrow compute provides Cast API allowing users to cast from string
to number/boolean values, but sometimes the string values contain some
invalid values that cannot be casted to a number/boolean (sorry, data is
really messy), for example, in a string array like ["1", "2", "3", "None",
""]. I wonder if there is any way to handle those invalid values during
casting.

Currently from the code I read (cast.h/cast.cc), it seems the cast will
fail and return when dealing with invalid values, I wonder if there is any
way I can ask the Cast API to return NULL for invalid values, so that it is
easier to process these NULL values later.

And since it is rarely possible to guarantee all string values in an array
are valid, **any** invalid value in an array/entire data set will make the
cast process failed. This requires users using the cast API to figure out
which value in the array has the invalid value by themself, which is not
easy to do programmatically (only an error status message is set in the
context). IMHO the following strategy could be a better default strategy
when casting from string to number/boolean:
1) when finding an invalid value, set NULL as its value
2) set an error status indicating this array casting has some invalid values
3) keep finish casting the remaining elements in the array
But I believe there are users who prefer bailing out as soon as possible as
well, it will be great if we can provide different cast options to make
both strategies possible.

Thanks so much.

Regards,
Yue

Re: Cast string array to number/boolean with invalid values

Posted by Yue Ni <ni...@gmail.com>.
Thanks Neal and Wes. https://issues.apache.org/jira/browse/ARROW-1489 is
exactly what I am searching for.

On Sat, May 30, 2020 at 11:02 PM Wes McKinney <we...@gmail.com> wrote:

> It's https://issues.apache.org/jira/browse/ARROW-1489
>
> On Sat, May 30, 2020 at 9:56 AM Neal Richardson
> <ne...@gmail.com> wrote:
> >
> > Sounds reasonable, could you please open a JIRA issue?
> >
> > Neal
> >
> > On Sat, May 30, 2020 at 1:01 AM Yue Ni <ni...@gmail.com> wrote:
> >>
> >> Hi there,
> >>
> >> I find arrow compute provides Cast API allowing users to cast from
> string to number/boolean values, but sometimes the string values contain
> some invalid values that cannot be casted to a number/boolean (sorry, data
> is really messy), for example, in a string array like ["1", "2", "3",
> "None", ""]. I wonder if there is any way to handle those invalid values
> during casting.
> >>
> >> Currently from the code I read (cast.h/cast.cc), it seems the cast will
> fail and return when dealing with invalid values, I wonder if there is any
> way I can ask the Cast API to return NULL for invalid values, so that it is
> easier to process these NULL values later.
> >>
> >> And since it is rarely possible to guarantee all string values in an
> array are valid, **any** invalid value in an array/entire data set will
> make the cast process failed. This requires users using the cast API to
> figure out which value in the array has the invalid value by themself,
> which is not easy to do programmatically (only an error status message is
> set in the context). IMHO the following strategy could be a better default
> strategy when casting from string to number/boolean:
> >> 1) when finding an invalid value, set NULL as its value
> >> 2) set an error status indicating this array casting has some invalid
> values
> >> 3) keep finish casting the remaining elements in the array
> >> But I believe there are users who prefer bailing out as soon as
> possible as well, it will be great if we can provide different cast options
> to make both strategies possible.
> >>
> >> Thanks so much.
> >>
> >> Regards,
> >> Yue
>

Re: Cast string array to number/boolean with invalid values

Posted by Wes McKinney <we...@gmail.com>.
It's https://issues.apache.org/jira/browse/ARROW-1489

On Sat, May 30, 2020 at 9:56 AM Neal Richardson
<ne...@gmail.com> wrote:
>
> Sounds reasonable, could you please open a JIRA issue?
>
> Neal
>
> On Sat, May 30, 2020 at 1:01 AM Yue Ni <ni...@gmail.com> wrote:
>>
>> Hi there,
>>
>> I find arrow compute provides Cast API allowing users to cast from string to number/boolean values, but sometimes the string values contain some invalid values that cannot be casted to a number/boolean (sorry, data is really messy), for example, in a string array like ["1", "2", "3", "None", ""]. I wonder if there is any way to handle those invalid values during casting.
>>
>> Currently from the code I read (cast.h/cast.cc), it seems the cast will fail and return when dealing with invalid values, I wonder if there is any way I can ask the Cast API to return NULL for invalid values, so that it is easier to process these NULL values later.
>>
>> And since it is rarely possible to guarantee all string values in an array are valid, **any** invalid value in an array/entire data set will make the cast process failed. This requires users using the cast API to figure out which value in the array has the invalid value by themself, which is not easy to do programmatically (only an error status message is set in the context). IMHO the following strategy could be a better default strategy when casting from string to number/boolean:
>> 1) when finding an invalid value, set NULL as its value
>> 2) set an error status indicating this array casting has some invalid values
>> 3) keep finish casting the remaining elements in the array
>> But I believe there are users who prefer bailing out as soon as possible as well, it will be great if we can provide different cast options to make both strategies possible.
>>
>> Thanks so much.
>>
>> Regards,
>> Yue

Re: Cast string array to number/boolean with invalid values

Posted by Neal Richardson <ne...@gmail.com>.
Sounds reasonable, could you please open a JIRA issue?

Neal

On Sat, May 30, 2020 at 1:01 AM Yue Ni <ni...@gmail.com> wrote:

> Hi there,
>
> I find arrow compute provides Cast API allowing users to cast from string
> to number/boolean values, but sometimes the string values contain some
> invalid values that cannot be casted to a number/boolean (sorry, data is
> really messy), for example, in a string array like ["1", "2", "3", "None",
> ""]. I wonder if there is any way to handle those invalid values during
> casting.
>
> Currently from the code I read (cast.h/cast.cc), it seems the cast will
> fail and return when dealing with invalid values, I wonder if there is any
> way I can ask the Cast API to return NULL for invalid values, so that it is
> easier to process these NULL values later.
>
> And since it is rarely possible to guarantee all string values in an array
> are valid, **any** invalid value in an array/entire data set will make the
> cast process failed. This requires users using the cast API to figure out
> which value in the array has the invalid value by themself, which is not
> easy to do programmatically (only an error status message is set in the
> context). IMHO the following strategy could be a better default strategy
> when casting from string to number/boolean:
> 1) when finding an invalid value, set NULL as its value
> 2) set an error status indicating this array casting has some invalid
> values
> 3) keep finish casting the remaining elements in the array
> But I believe there are users who prefer bailing out as soon as possible
> as well, it will be great if we can provide different cast options to make
> both strategies possible.
>
> Thanks so much.
>
> Regards,
> Yue
>