You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by dataginjaninja <ri...@gmail.com> on 2014/05/28 14:03:45 UTC

Standard preprocessing/scaling

I searched on this, but didn't find anything general so I apologize if this
has been addressed. 

Many algorithms (SGD, SVM...) either will not converge or will run forever
if the data is not scaled. Sci-kit has  preprocessing
<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html>  
that will subtract the mean and divide by standard dev. Of course there are
a few options with it as well.

Is there something in the works for this?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Standard-preprocessing-scaling-tp6826.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Standard preprocessing/scaling

Posted by dataginjaninja <ri...@gmail.com>.

I do see the issue for centering sparse data. Actually, the centering is less
important than the scaling by the standard deviation. Not having unit
variance causes the convergence issues and long runtimes. 

RowMatrix will compute variance of a column?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Standard-preprocessing-scaling-tp6826p6849.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Standard preprocessing/scaling

Posted by DB Tsai <db...@stanford.edu>.

Sometimes for this case, I will just standardize without centerization. I
still get good result.

Sent from my Google Nexus 5
On May 28, 2014 7:03 PM, "Xiangrui Meng" <me...@gmail.com> wrote:

> RowMatrix has a method to compute column summary statistics. There is
> a trade-off here because centering may densify the data. A utility
> function that centers data would be useful for dense datasets.
> -Xiangrui
>
> On Wed, May 28, 2014 at 5:03 AM, dataginjaninja
> <ri...@gmail.com> wrote:
> > I searched on this, but didn't find anything general so I apologize if
> this
> > has been addressed.
> >
> > Many algorithms (SGD, SVM...) either will not converge or will run
> forever
> > if the data is not scaled. Sci-kit has  preprocessing
> > <
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> >
> > that will subtract the mean and divide by standard dev. Of course there
> are
> > a few options with it as well.
> >
> > Is there something in the works for this?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Standard-preprocessing-scaling-tp6826.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>

Re: Standard preprocessing/scaling

Posted by Xiangrui Meng <me...@gmail.com>.

RowMatrix has a method to compute column summary statistics. There is
a trade-off here because centering may densify the data. A utility
function that centers data would be useful for dense datasets.
-Xiangrui

On Wed, May 28, 2014 at 5:03 AM, dataginjaninja
<ri...@gmail.com> wrote:
> I searched on this, but didn't find anything general so I apologize if this
> has been addressed.
>
> Many algorithms (SGD, SVM...) either will not converge or will run forever
> if the data is not scaled. Sci-kit has  preprocessing
> <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html>
> that will subtract the mean and divide by standard dev. Of course there are
> a few options with it as well.
>
> Is there something in the works for this?
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Standard-preprocessing-scaling-tp6826.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.