You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mengxr <gi...@git.apache.org> on 2014/08/07 00:10:49 UTC

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/1814

    [SPARK-2852][MLLIB] Separate model from IDF/StandardScaler algorithms

    This is part of SPARK-2828:
    
    1. separate IDF model from IDF algorithm (which generates a model)
    2. separate StandardScaler model from StandardScaler
    
    CC: @dbtsai

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark feature-api-update

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1814.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1814
    
----
commit 89f3486c9d91add44464068e3ea515b0d3e7d261
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-08-06T21:52:34Z

    update IDF to separate Model from Algorithm

commit 48a0fffba40b1d0a3dc6f56063c1ccd0288e3712
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-08-06T22:08:40Z

    separate Model from StandardScaler algorithm

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1814#issuecomment-51438966
  
    QA tests have started for PR 1814. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18111/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1814#discussion_r15920617
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
    @@ -177,18 +115,72 @@ private object IDF {
         private def isEmpty: Boolean = m == 0L
     
         /** Returns the current IDF vector. */
    -    def idf(): BDV[Double] = {
    +    def idf(): Vector = {
           if (isEmpty) {
             throw new IllegalStateException("Haven't seen any document yet.")
           }
           val n = df.length
    -      val inv = BDV.zeros[Double](n)
    +      val inv = new Array[Double](n)
           var j = 0
           while (j < n) {
             inv(j) = math.log((m + 1.0)/ (df(j) + 1.0))
             j += 1
           }
    -      inv
    +      Vectors.dense(inv)
         }
       }
     }
    +
    +/**
    + * :: Experimental ::
    + * Represents an IDF model that can transform term frequency vectors.
    + */
    +@Experimental
    +class IDFModel private[mllib] (val idf: Vector) extends Serializable {
    +
    +  /**
    +   * Transforms term frequency (TF) vectors to TF-IDF vectors.
    +   * @param dataset an RDD of term frequency vectors
    +   * @return an RDD of TF-IDF vectors
    +   */
    +  def transform(dataset: RDD[Vector]): RDD[Vector] = {
    +    val bcIdf = dataset.context.broadcast(idf)
    +    dataset.mapPartitions { iter =>
    +      val thisIdf = bcIdf.value
    +      iter.map { v =>
    +        val n = v.size
    +        v match {
    +          case sv: SparseVector =>
    +            val nnz = sv.indices.size
    +            val newValues = new Array[Double](nnz)
    +            var k = 0
    +            while (k < nnz) {
    +              newValues(k) = sv.values(k) * thisIdf(sv.indices(k))
    +              k += 1
    +            }
    +            Vectors.sparse(n, sv.indices, newValues)
    +          case dv: DenseVector =>
    +            val newValues = new Array[Double](n)
    +            var j = 0
    +            while (j < n) {
    +              newValues(j) = dv.values(j) * thisIdf(j)
    +              j += 1
    +            }
    +            Vectors.dense(newValues)
    +          case other =>
    +            throw new UnsupportedOperationException(
    --- End diff --
    
    We might want to use different error messages. In that case, having a util function doesn't save us much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1814#discussion_r15908504
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
    @@ -177,18 +115,72 @@ private object IDF {
         private def isEmpty: Boolean = m == 0L
     
         /** Returns the current IDF vector. */
    -    def idf(): BDV[Double] = {
    +    def idf(): Vector = {
           if (isEmpty) {
             throw new IllegalStateException("Haven't seen any document yet.")
           }
           val n = df.length
    -      val inv = BDV.zeros[Double](n)
    +      val inv = new Array[Double](n)
           var j = 0
           while (j < n) {
             inv(j) = math.log((m + 1.0)/ (df(j) + 1.0))
             j += 1
           }
    -      inv
    +      Vectors.dense(inv)
         }
       }
     }
    +
    +/**
    + * :: Experimental ::
    + * Represents an IDF model that can transform term frequency vectors.
    + */
    +@Experimental
    +class IDFModel private[mllib] (val idf: Vector) extends Serializable {
    +
    +  /**
    +   * Transforms term frequency (TF) vectors to TF-IDF vectors.
    +   * @param dataset an RDD of term frequency vectors
    +   * @return an RDD of TF-IDF vectors
    +   */
    +  def transform(dataset: RDD[Vector]): RDD[Vector] = {
    +    val bcIdf = dataset.context.broadcast(idf)
    +    dataset.mapPartitions { iter =>
    +      val thisIdf = bcIdf.value
    +      iter.map { v =>
    +        val n = v.size
    +        v match {
    +          case sv: SparseVector =>
    +            val nnz = sv.indices.size
    +            val newValues = new Array[Double](nnz)
    +            var k = 0
    +            while (k < nnz) {
    +              newValues(k) = sv.values(k) * thisIdf(sv.indices(k))
    +              k += 1
    +            }
    +            Vectors.sparse(n, sv.indices, newValues)
    +          case dv: DenseVector =>
    +            val newValues = new Array[Double](n)
    +            var j = 0
    +            while (j < n) {
    +              newValues(j) = dv.values(j) * thisIdf(j)
    +              j += 1
    +            }
    +            Vectors.dense(newValues)
    +          case other =>
    +            throw new UnsupportedOperationException(
    --- End diff --
    
    The following exception is used for unsupported vector in appendBias and StandardScaler, maybe we could have a global definition of this in util.
        case v => throw new IllegalArgumentException("Do not support vector type " + v.getClass)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on the pull request:

    https://github.com/apache/spark/pull/1814#issuecomment-51511617
  
    LGTM. Merged into both master and branch-1.1. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1814#discussion_r15920592
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
    @@ -35,38 +35,47 @@ import org.apache.spark.rdd.RDD
      * @param withStd True by default. Scales the data to unit standard deviation.
      */
     @Experimental
    -class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
    +class StandardScaler(withMean: Boolean, withStd: Boolean) {
     
       def this() = this(false, true)
     
       require(withMean || withStd, s"withMean and withStd both equal to false. Doing nothing.")
     
    -  private var mean: BV[Double] = _
    -  private var factor: BV[Double] = _
    -
       /**
        * Computes the mean and variance and stores as a model to be used for later scaling.
        *
        * @param data The data used to compute the mean and variance to build the transformation model.
    -   * @return This StandardScalar object.
    +   * @return a StandardScalarModel
        */
    -  def fit(data: RDD[Vector]): this.type = {
    +  def fit(data: RDD[Vector]): StandardScalerModel = {
         val summary = data.treeAggregate(new MultivariateOnlineSummarizer)(
           (aggregator, data) => aggregator.add(data),
           (aggregator1, aggregator2) => aggregator1.merge(aggregator2))
     
    -    mean = summary.mean.toBreeze
    -    factor = summary.variance.toBreeze
    -    require(mean.length == factor.length)
    +    val mean = summary.mean.toBreeze
    +    val factor = summary.variance.toBreeze
    +    require(mean.size == factor.size)
     
         var i = 0
    -    while (i < factor.length) {
    +    while (i < factor.size) {
           factor(i) = if (factor(i) != 0.0) 1.0 / math.sqrt(factor(i)) else 0.0
           i += 1
         }
     
    -    this
    +    new StandardScalerModel(withMean, withStd, mean, factor)
       }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Represents a StandardScaler model that can transform vectors.
    + */
    +@Experimental
    +class StandardScalerModel private[mllib] (
    +    val withMean: Boolean,
    +    val withStd: Boolean,
    +    val mean: BV[Double],
    +    val factor: BV[Double]) extends VectorTransformer {
     
    --- End diff --
    
    done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1814#issuecomment-51409813
  
    QA results for PR 1814:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class StandardScaler(withMean: Boolean, withStd: Boolean) {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18063/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1814#discussion_r15908219
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
    @@ -35,38 +35,47 @@ import org.apache.spark.rdd.RDD
      * @param withStd True by default. Scales the data to unit standard deviation.
      */
     @Experimental
    -class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
    +class StandardScaler(withMean: Boolean, withStd: Boolean) {
     
       def this() = this(false, true)
     
       require(withMean || withStd, s"withMean and withStd both equal to false. Doing nothing.")
     
    -  private var mean: BV[Double] = _
    -  private var factor: BV[Double] = _
    -
       /**
        * Computes the mean and variance and stores as a model to be used for later scaling.
        *
        * @param data The data used to compute the mean and variance to build the transformation model.
    -   * @return This StandardScalar object.
    +   * @return a StandardScalarModel
        */
    -  def fit(data: RDD[Vector]): this.type = {
    +  def fit(data: RDD[Vector]): StandardScalerModel = {
         val summary = data.treeAggregate(new MultivariateOnlineSummarizer)(
           (aggregator, data) => aggregator.add(data),
           (aggregator1, aggregator2) => aggregator1.merge(aggregator2))
     
    -    mean = summary.mean.toBreeze
    -    factor = summary.variance.toBreeze
    -    require(mean.length == factor.length)
    +    val mean = summary.mean.toBreeze
    +    val factor = summary.variance.toBreeze
    +    require(mean.size == factor.size)
     
         var i = 0
    -    while (i < factor.length) {
    +    while (i < factor.size) {
           factor(i) = if (factor(i) != 0.0) 1.0 / math.sqrt(factor(i)) else 0.0
           i += 1
         }
     
    -    this
    +    new StandardScalerModel(withMean, withStd, mean, factor)
       }
    +}
    +
    +/**
    + * :: Experimental ::
    + * Represents a StandardScaler model that can transform vectors.
    + */
    +@Experimental
    +class StandardScalerModel private[mllib] (
    +    val withMean: Boolean,
    +    val withStd: Boolean,
    +    val mean: BV[Double],
    +    val factor: BV[Double]) extends VectorTransformer {
     
    --- End diff --
    
    Since users may want to know the variance of the training set, should we have constructor 
    
        class StandardScalerModel private[mllib] (
            val withMean: Boolean,
            val withStd: Boolean,
            val mean: BV[Double],
            val variance: BV[Double]) {
    
          lazy val factor = { 
            val temp = variance.clone
            while (i < temp.size) {
            temp(i) = if (temp(i) != 0.0) 1.0 / math.sqrt(temp(i)) else 0.0
             i += 1
             temp
            }
          }
        }


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1814#issuecomment-51442882
  
    QA results for PR 1814:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class StandardScaler(withMean: Boolean, withStd: Boolean) extends Logging {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18111/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1814


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1814#issuecomment-51405165
  
    QA tests have started for PR 1814. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18063/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1814#discussion_r15908318
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
    @@ -35,38 +35,47 @@ import org.apache.spark.rdd.RDD
      * @param withStd True by default. Scales the data to unit standard deviation.
      */
     @Experimental
    -class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
    +class StandardScaler(withMean: Boolean, withStd: Boolean) {
     
    --- End diff --
    
    This class is only used for keeping the state of withMean, and withStd, is it possible to move those states to fit function by overloading, and make it as object?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2852][MLLIB] Separate model from IDF/St...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1814#discussion_r15920590
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala ---
    @@ -35,38 +35,47 @@ import org.apache.spark.rdd.RDD
      * @param withStd True by default. Scales the data to unit standard deviation.
      */
     @Experimental
    -class StandardScaler(withMean: Boolean, withStd: Boolean) extends VectorTransformer {
    +class StandardScaler(withMean: Boolean, withStd: Boolean) {
     
    --- End diff --
    
    The current API is more consistent with others like `LinearRegressionWithSGD`. We can add them later after we figure out the right way to specify parameters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org