You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Stavros Kontopoulos <st...@gmail.com> on 2017/02/10 22:48:19 UTC

Flink ML - NaN Handling

Hello guys,

Is there a story for this (might have been discussed earlier)? I see
differences between scikit-learn and numpy. Do we standardize on
scikit-learn?

PS. I am working on the preprocessing stuff.

Best,
Stavros

Re: Flink ML - NaN Handling

Posted by Till Rohrmann <tr...@apache.org>.

Hi Stavros,

your idea to add an imputer is really good. Please open a JIRA issue for
that.

You're right that failing fast is usually the better behaviour in case of
an undefined value such as NaN or infinity. Thus, I think it makes sense to
define for the different components their value range and fail if an
incoming value is not contained in this range.

Cheers,
Till

On Sun, Feb 12, 2017 at 8:55 PM, Stavros Kontopoulos <
st.kontopoulos@gmail.com> wrote:

> Btw I think we should add an Imputer if we follow scikit-learn as stated
> here for preparing the dataset:
> http://scikit-learn.org/stable/modules/preprocessing.html
> "Imputation of Missing Values" paragraph. What do you think? Should I add
> it as an issue on jira?
>
> The question for NaN also holds for generated data from one pipeline stage
> feed to the other. In all cases we should fire an exception from what I
> see....
> For example for sklearn:
>
> >>> X = [[ 1., -1.,  2.],
> ... [ 2.,  0.,  float('NaN')]]
>
> >>> preprocessing.normalize(X, norm='l2')
> Traceback (most recent call last):
> ....
> ValueError: Input contains NaN, infinity or a value too large for
> dtype('float64').
>
> I don't see that in FLink ML's code, my understanding  is that that NaNs
> are propagated correct?
> For example when I run the MinMaxScalerIT tests with NaN in the data I get
> a result like:
> DenseVector(0.34528405956977387, 0.5, NaN)
> ...
> which is reasonable given the implementation but should be allowed?
>
>
> On Sun, Feb 12, 2017 at 9:03 PM, Stavros Kontopoulos <
> st.kontopoulos@gmail.com> wrote:
>
> > Ok cool thnx Till.
> >
> > On Sun, Feb 12, 2017 at 4:59 PM, Till Rohrmann <tr...@apache.org>
> > wrote:
> >
> >> Hi Stavros,
> >>
> >> so far we've sticked mainly to scikit-learn in terms of semantics.
> Thus, I
> >> would recommend to follow scikit-learn's approach to handle NaNs.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Fri, Feb 10, 2017 at 11:48 PM, Stavros Kontopoulos <
> >> st.kontopoulos@gmail.com> wrote:
> >>
> >> > Hello guys,
> >> >
> >> > Is there a story for this (might have been discussed earlier)? I see
> >> > differences between scikit-learn and numpy. Do we standardize on
> >> > scikit-learn?
> >> >
> >> > PS. I am working on the preprocessing stuff.
> >> >
> >> > Best,
> >> > Stavros
> >> >
> >>
> >
> >
>

Re: Flink ML - NaN Handling

Posted by Stavros Kontopoulos <st...@gmail.com>.

Btw I think we should add an Imputer if we follow scikit-learn as stated
here for preparing the dataset:
http://scikit-learn.org/stable/modules/preprocessing.html
"Imputation of Missing Values" paragraph. What do you think? Should I add
it as an issue on jira?

The question for NaN also holds for generated data from one pipeline stage
feed to the other. In all cases we should fire an exception from what I
see....
For example for sklearn:

>>> X = [[ 1., -1.,  2.],
... [ 2.,  0.,  float('NaN')]]

>>> preprocessing.normalize(X, norm='l2')
Traceback (most recent call last):
....
ValueError: Input contains NaN, infinity or a value too large for
dtype('float64').

I don't see that in FLink ML's code, my understanding  is that that NaNs
are propagated correct?
For example when I run the MinMaxScalerIT tests with NaN in the data I get
a result like:
DenseVector(0.34528405956977387, 0.5, NaN)
...
which is reasonable given the implementation but should be allowed?

On Sun, Feb 12, 2017 at 9:03 PM, Stavros Kontopoulos <
st.kontopoulos@gmail.com> wrote:

> Ok cool thnx Till.
>
> On Sun, Feb 12, 2017 at 4:59 PM, Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Stavros,
>>
>> so far we've sticked mainly to scikit-learn in terms of semantics. Thus, I
>> would recommend to follow scikit-learn's approach to handle NaNs.
>>
>> Cheers,
>> Till
>>
>> On Fri, Feb 10, 2017 at 11:48 PM, Stavros Kontopoulos <
>> st.kontopoulos@gmail.com> wrote:
>>
>> > Hello guys,
>> >
>> > Is there a story for this (might have been discussed earlier)? I see
>> > differences between scikit-learn and numpy. Do we standardize on
>> > scikit-learn?
>> >
>> > PS. I am working on the preprocessing stuff.
>> >
>> > Best,
>> > Stavros
>> >
>>
>
>

Re: Flink ML - NaN Handling

Posted by Stavros Kontopoulos <st...@gmail.com>.

Ok cool thnx Till.

On Sun, Feb 12, 2017 at 4:59 PM, Till Rohrmann <tr...@apache.org> wrote:

> Hi Stavros,
>
> so far we've sticked mainly to scikit-learn in terms of semantics. Thus, I
> would recommend to follow scikit-learn's approach to handle NaNs.
>
> Cheers,
> Till
>
> On Fri, Feb 10, 2017 at 11:48 PM, Stavros Kontopoulos <
> st.kontopoulos@gmail.com> wrote:
>
> > Hello guys,
> >
> > Is there a story for this (might have been discussed earlier)? I see
> > differences between scikit-learn and numpy. Do we standardize on
> > scikit-learn?
> >
> > PS. I am working on the preprocessing stuff.
> >
> > Best,
> > Stavros
> >
>

Re: Flink ML - NaN Handling

Posted by Till Rohrmann <tr...@apache.org>.

Hi Stavros,

so far we've sticked mainly to scikit-learn in terms of semantics. Thus, I
would recommend to follow scikit-learn's approach to handle NaNs.

Cheers,
Till

On Fri, Feb 10, 2017 at 11:48 PM, Stavros Kontopoulos <
st.kontopoulos@gmail.com> wrote:

> Hello guys,
>
> Is there a story for this (might have been discussed earlier)? I see
> differences between scikit-learn and numpy. Do we standardize on
> scikit-learn?
>
> PS. I am working on the preprocessing stuff.
>
> Best,
> Stavros
>