You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Joeri Hermans <jo...@cern.ch> on 2016/11/21 20:25:31 UTC
MinMaxScaler behaviour
Hi all,
I observed some weird behaviour while applying some feature transformations using MinMaxScaler. More specifically, I was wondering if this behaviour is intended and makes sense? Especially because I explicitly defined min and max.
Basically, I am preprocessing the MNIST dataset, and thereby scaling the features between the ranges 0 and 1 using the following code:
# Clear the dataset in the case you ran this cell before.
dataset = dataset.select("features", "label", "label_encoded")
# Apply MinMax normalization to the features.
scaler = MinMaxScaler(min=0.0, max=1.0, inputCol="features", outputCol="features_normalized")
# Compute summary statistics and generate MinMaxScalerModel.
scaler_model = scaler.fit(dataset)
# Rescale each feature to range [min, max].
dataset = scaler_model.transform(dataset)
Complete code is here: https://github.com/JoeriHermans/dist-keras/blob/development/examples/mnist.ipynb (Normalization section)
The original MNIST images are shown in original.png. Whereas the processed images are shown in processed.png. Note the 0.5 artifacts. I checked the source code of this particular estimator / transformer and found the following.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L191
According to the documentation:
* <p><blockquote>
* $$
* Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min
* $$
* </blockquote></p>
*
* For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$.
So basically, when the difference between E_{max} and E_{min} is 0, we assing 0.5 as a raw value. I am wondering if this is helpful in any situation? Why not assign 0?
Kind regards,
Joeri
RE: MinMaxScaler behaviour
Posted by Joeri Hermans <jo...@cern.ch>.
I see. I think I read the documentation a little bit too quick :)
My apologies.
Kind regards,
Joeri
________________________________________
From: Sean Owen [sowen@cloudera.com]
Sent: 21 November 2016 21:32
To: Joeri Hermans; dev@spark.apache.org
Subject: Re: MinMaxScaler behaviour
It's a degenerate case of course. 0, 0.5 and 1 all make about as much sense. Is there a strong convention elsewhere to use 0?
Min/max scaling is the wrong thing to do for a data set like this anyway. What you probably intend to do is scale each image so that its max intensity is 1 and min intensity is 0, but that's different. Scaling each pixel across all images doesn't make as much sense.
On Mon, Nov 21, 2016 at 8:26 PM Joeri Hermans <jo...@cern.ch>> wrote:
Hi all,
I observed some weird behaviour while applying some feature transformations using MinMaxScaler. More specifically, I was wondering if this behaviour is intended and makes sense? Especially because I explicitly defined min and max.
Basically, I am preprocessing the MNIST dataset, and thereby scaling the features between the ranges 0 and 1 using the following code:
# Clear the dataset in the case you ran this cell before.
dataset = dataset.select("features", "label", "label_encoded")
# Apply MinMax normalization to the features.
scaler = MinMaxScaler(min=0.0, max=1.0, inputCol="features", outputCol="features_normalized")
# Compute summary statistics and generate MinMaxScalerModel.
scaler_model = scaler.fit(dataset)
# Rescale each feature to range [min, max].
dataset = scaler_model.transform(dataset)
Complete code is here: https://github.com/JoeriHermans/dist-keras/blob/development/examples/mnist.ipynb (Normalization section)
The original MNIST images are shown in original.png. Whereas the processed images are shown in processed.png. Note the 0.5 artifacts. I checked the source code of this particular estimator / transformer and found the following.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L191
According to the documentation:
* <p><blockquote>
* $$
* Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min
* $$
* </blockquote></p>
*
* For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$.
So basically, when the difference between E_{max} and E_{min} is 0, we assing 0.5 as a raw value. I am wondering if this is helpful in any situation? Why not assign 0?
Kind regards,
Joeri
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
Re: MinMaxScaler behaviour
Posted by Sean Owen <so...@cloudera.com>.
It's a degenerate case of course. 0, 0.5 and 1 all make about as much
sense. Is there a strong convention elsewhere to use 0?
Min/max scaling is the wrong thing to do for a data set like this anyway.
What you probably intend to do is scale each image so that its max
intensity is 1 and min intensity is 0, but that's different. Scaling each
pixel across all images doesn't make as much sense.
On Mon, Nov 21, 2016 at 8:26 PM Joeri Hermans <
joeri.raymond.e.hermans@cern.ch> wrote:
> Hi all,
>
> I observed some weird behaviour while applying some feature
> transformations using MinMaxScaler. More specifically, I was wondering if
> this behaviour is intended and makes sense? Especially because I explicitly
> defined min and max.
>
> Basically, I am preprocessing the MNIST dataset, and thereby scaling the
> features between the ranges 0 and 1 using the following code:
>
> # Clear the dataset in the case you ran this cell before.
> dataset = dataset.select("features", "label", "label_encoded")
> # Apply MinMax normalization to the features.
> scaler = MinMaxScaler(min=0.0, max=1.0, inputCol="features",
> outputCol="features_normalized")
> # Compute summary statistics and generate MinMaxScalerModel.
> scaler_model = scaler.fit(dataset)
> # Rescale each feature to range [min, max].
> dataset = scaler_model.transform(dataset)
>
> Complete code is here:
> https://github.com/JoeriHermans/dist-keras/blob/development/examples/mnist.ipynb
> (Normalization section)
>
> The original MNIST images are shown in original.png. Whereas the processed
> images are shown in processed.png. Note the 0.5 artifacts. I checked the
> source code of this particular estimator / transformer and found the
> following.
>
>
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala#L191
>
> According to the documentation:
>
> * <p><blockquote>
> * $$
> * Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max -
> min) + min
> * $$
> * </blockquote></p>
> *
> * For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$.
>
> So basically, when the difference between E_{max} and E_{min} is 0, we
> assing 0.5 as a raw value. I am wondering if this is helpful in any
> situation? Why not assign 0?
>
>
>
> Kind regards,
>
> Joeri
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org