You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aslan Bekirov <as...@gmail.com> on 2014/06/11 15:25:30 UTC

Normalizations in MLBase

Hi All,

I have to normalize a set of values in the range 0-500 to the [0-1] range.

Is there any util method in MLBase to normalize large set of data?

BR,
Aslan

Re: Normalizations in MLBase

Posted by Aslan Bekirov <as...@gmail.com>.
Thanks a lot DB.

I will test it and let you know the results.

BR,
Aslan


On Fri, Jun 13, 2014 at 12:34 AM, DB Tsai <db...@stanford.edu> wrote:

> Hi Asian,
>
> I'm not sure if mlbase code is maintained for the current spark
> master. The following is the code we use for standardization in my
> company. I'm intended to clean up, and submit a PR. You could use it
> for now.
>
>   def standardize(data: RDD[Vector]): RDD[Vector] = {
>     val summarizer = new RowMatrix(data).computeColumnSummaryStatistics
>     val mean = summarizer.mean
>     val variance = summarizer.variance
>
>     // The standardization will always densify the output, so the output
>     // will be stored in dense vector.
>     data.map(x => {
>       val n = x.toBreeze.length
>       val output = BDV.zeros[Double](n)
>       var i = 0
>       while(i < n) {
>         if(variance(i) == 0) {
>           output(i) = Double.NaN
>         } else {
>           output(i) = (x(i) - mean(i)) / Math.sqrt(variance(i))
>         }
>         i += 1
>       }
>       Vectors.fromBreeze(output)
>     })
>   }
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Thu, Jun 12, 2014 at 1:49 AM, Aslan Bekirov <as...@gmail.com>
> wrote:
> > Hi DB,
> >
> > I found a piece of code that uses znorm to normalize data.
> >
> >
> > /**
> >  * build training data set from sample and summary data
> >  */
> >  val train_data = sample_data.map( v =>
> >    Array.tabulate[Double](field_cnt)(
> >      i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
> >    )
> >  ).cache
> >
> > Please make your comments if you find something wrong.
> >
> > BR,
> > Aslan
> >
> >
> >
> > On Thu, Jun 12, 2014 at 11:13 AM, Aslan Bekirov <as...@gmail.com>
> > wrote:
> >>
> >> Thanks a lot DB.
> >>
> >> I will try to do Znorm normalization using map transformation.
> >>
> >>
> >> BR,
> >> Aslan
> >>
> >>
> >> On Thu, Jun 12, 2014 at 12:16 AM, DB Tsai <db...@stanford.edu> wrote:
> >>>
> >>> Hi Aslan,
> >>>
> >>> Currently, we don't have the utility function to do so. However, you
> >>> can easily implement this by another map transformation. I'm working
> >>> on this feature now, and there will be couple different available
> >>> normalization option users can chose.
> >>>
> >>> Sincerely,
> >>>
> >>> DB Tsai
> >>> -------------------------------------------------------
> >>> My Blog: https://www.dbtsai.com
> >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>>
> >>>
> >>> On Wed, Jun 11, 2014 at 6:25 AM, Aslan Bekirov <aslanbekirov@gmail.com
> >
> >>> wrote:
> >>> > Hi All,
> >>> >
> >>> > I have to normalize a set of values in the range 0-500 to the [0-1]
> >>> > range.
> >>> >
> >>> > Is there any util method in MLBase to normalize large set of data?
> >>> >
> >>> > BR,
> >>> > Aslan
> >>
> >>
> >
>

Re: Normalizations in MLBase

Posted by DB Tsai <db...@stanford.edu>.
Hi Asian,

I'm not sure if mlbase code is maintained for the current spark
master. The following is the code we use for standardization in my
company. I'm intended to clean up, and submit a PR. You could use it
for now.

  def standardize(data: RDD[Vector]): RDD[Vector] = {
    val summarizer = new RowMatrix(data).computeColumnSummaryStatistics
    val mean = summarizer.mean
    val variance = summarizer.variance

    // The standardization will always densify the output, so the output
    // will be stored in dense vector.
    data.map(x => {
      val n = x.toBreeze.length
      val output = BDV.zeros[Double](n)
      var i = 0
      while(i < n) {
        if(variance(i) == 0) {
          output(i) = Double.NaN
        } else {
          output(i) = (x(i) - mean(i)) / Math.sqrt(variance(i))
        }
        i += 1
      }
      Vectors.fromBreeze(output)
    })
  }

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Jun 12, 2014 at 1:49 AM, Aslan Bekirov <as...@gmail.com> wrote:
> Hi DB,
>
> I found a piece of code that uses znorm to normalize data.
>
>
> /**
>  * build training data set from sample and summary data
>  */
>  val train_data = sample_data.map( v =>
>    Array.tabulate[Double](field_cnt)(
>      i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
>    )
>  ).cache
>
> Please make your comments if you find something wrong.
>
> BR,
> Aslan
>
>
>
> On Thu, Jun 12, 2014 at 11:13 AM, Aslan Bekirov <as...@gmail.com>
> wrote:
>>
>> Thanks a lot DB.
>>
>> I will try to do Znorm normalization using map transformation.
>>
>>
>> BR,
>> Aslan
>>
>>
>> On Thu, Jun 12, 2014 at 12:16 AM, DB Tsai <db...@stanford.edu> wrote:
>>>
>>> Hi Aslan,
>>>
>>> Currently, we don't have the utility function to do so. However, you
>>> can easily implement this by another map transformation. I'm working
>>> on this feature now, and there will be couple different available
>>> normalization option users can chose.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> -------------------------------------------------------
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Wed, Jun 11, 2014 at 6:25 AM, Aslan Bekirov <as...@gmail.com>
>>> wrote:
>>> > Hi All,
>>> >
>>> > I have to normalize a set of values in the range 0-500 to the [0-1]
>>> > range.
>>> >
>>> > Is there any util method in MLBase to normalize large set of data?
>>> >
>>> > BR,
>>> > Aslan
>>
>>
>

Re: Normalizations in MLBase

Posted by Aslan Bekirov <as...@gmail.com>.
Hi DB,

I found a piece of code that uses znorm to normalize data.


/**
 * build training data set from sample and summary data
 */
 val train_data = sample_data.map( v =>
   Array.tabulate[Double](field_cnt)(
     i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
   )
 ).cache

Please make your comments if you find something wrong.

BR,
Aslan



On Thu, Jun 12, 2014 at 11:13 AM, Aslan Bekirov <as...@gmail.com>
wrote:

> Thanks a lot DB.
>
> I will try to do Znorm normalization using map transformation.
>
>
> BR,
> Aslan
>
>
> On Thu, Jun 12, 2014 at 12:16 AM, DB Tsai <db...@stanford.edu> wrote:
>
>> Hi Aslan,
>>
>> Currently, we don't have the utility function to do so. However, you
>> can easily implement this by another map transformation. I'm working
>> on this feature now, and there will be couple different available
>> normalization option users can chose.
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Wed, Jun 11, 2014 at 6:25 AM, Aslan Bekirov <as...@gmail.com>
>> wrote:
>> > Hi All,
>> >
>> > I have to normalize a set of values in the range 0-500 to the [0-1]
>> range.
>> >
>> > Is there any util method in MLBase to normalize large set of data?
>> >
>> > BR,
>> > Aslan
>>
>
>

Re: Normalizations in MLBase

Posted by Aslan Bekirov <as...@gmail.com>.
Thanks a lot DB.

I will try to do Znorm normalization using map transformation.


BR,
Aslan


On Thu, Jun 12, 2014 at 12:16 AM, DB Tsai <db...@stanford.edu> wrote:

> Hi Aslan,
>
> Currently, we don't have the utility function to do so. However, you
> can easily implement this by another map transformation. I'm working
> on this feature now, and there will be couple different available
> normalization option users can chose.
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, Jun 11, 2014 at 6:25 AM, Aslan Bekirov <as...@gmail.com>
> wrote:
> > Hi All,
> >
> > I have to normalize a set of values in the range 0-500 to the [0-1]
> range.
> >
> > Is there any util method in MLBase to normalize large set of data?
> >
> > BR,
> > Aslan
>

Re: Normalizations in MLBase

Posted by DB Tsai <db...@stanford.edu>.
Hi Aslan,

Currently, we don't have the utility function to do so. However, you
can easily implement this by another map transformation. I'm working
on this feature now, and there will be couple different available
normalization option users can chose.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Jun 11, 2014 at 6:25 AM, Aslan Bekirov <as...@gmail.com> wrote:
> Hi All,
>
> I have to normalize a set of values in the range 0-500 to the [0-1] range.
>
> Is there any util method in MLBase to normalize large set of data?
>
> BR,
> Aslan