You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mengxr <gi...@git.apache.org> on 2014/07/31 00:26:30 UTC

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/1671

    [SPARK-2511][MLLIB] add HashingTF and IDF

    This is roughly the TF-IDF implementation used in the Databricks Cloud Demo: http://databricks.com/cloud/ .
    
    Both `HashingTF` and `IDF` are implemented as transformers, similar to scikit-learn.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark tfidf

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1671.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1671
    
----
commit 3cc03487759f1f4d5da935c4d5eabc9b8ccdac6e
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-07-11T06:23:16Z

    add HashingTF and IDF

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50724374
  
    QA tests have started for PR 1671. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17569/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50740749
  
    QA results for PR 1671:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class HashingTF(val numFeatures: Int) extends Serializable {<br>class IDF {<br>class DocumentFrequencyAggregator extends Serializable {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17575/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50701382
  
    QA results for PR 1671:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class HashingTF(val numFeatures: Int) extends Serializable {<br>class IDF {<br>class DocumentFrequencyAggregator extends Serializable {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17529/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50706260
  
    QA results for PR 1671:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class HashingTF(val numFeatures: Int) extends Serializable {<br>class IDF {<br>class DocumentFrequencyAggregator extends Serializable {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17531/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50736215
  
    QA tests have started for PR 1671. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17575/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1671#discussion_r15629090
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
    @@ -0,0 +1,194 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import breeze.linalg.{DenseVector => BDV}
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors}
    +import org.apache.spark.mllib.rdd.RDDFunctions._
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Inverse document frequency (IDF).
    + * The standard formulation is used: `idf = log((m + 1) / (d(t) + 1))`, where `m` is the total
    + * number of documents and `d(t)` is the number of documents that contain term `t`.
    + */
    +@Experimental
    +class IDF {
    +
    +  // TODO: Allow different IDF formulations.
    +
    +  private var brzIdf: BDV[Double] = _
    +
    +  /**
    +   * Computes the inverse document frequency.
    +   * @param dataset an RDD of term frequency vectors
    +   */
    +  def fit(dataset: RDD[Vector]): this.type = {
    +    brzIdf = dataset.treeAggregate(new IDF.DocumentFrequencyAggregator)(
    +      seqOp = (df, v) => df.add(v),
    +      combOp = (df1, df2) => df1.merge(df2)
    +    ).idf()
    +    this
    +  }
    +
    +  /**
    +   * Computes the inverse document frequency.
    +   * @param dataset a JavaRDD of term frequency vectors
    +   */
    +  def fit(dataset: JavaRDD[Vector]): this.type = {
    --- End diff --
    
    Ah okay, sorry, I missed that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50715814
  
    QA tests have started for PR 1671. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17557/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50726519
  
    Looks good to me API-wise, though I haven't checked the math in detail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-51371929
  
    Fair enough.
    
    Standard approach for same token appearing multiple times in a training
    example is additive, so it would act the same as HashingTF computation for
    text (and as a one-hot style encoder for categorical features, and just
    normally for real features).
    
    To deal with collisions, typically take signed features by either (1) apply
    two hashes, one of which determines sign (cf VW), or (2) use just one
    signed hash function and take sign of hash value as sign of feature (cf
    scikit-learn). The idea being that collisions cancel each other out.
    
    
    On Wed, Aug 6, 2014 at 7:01 PM, Xiangrui Meng <no...@github.com>
    wrote:
    
    > @MLnick <https://github.com/MLnick> FeatureHasher might be too general.
    > One thing it is not clear from its name is how to resolve conflicts: same
    > word appears more than once, or different words having the same hash value.
    > We can add FeatureHasher (or called HashingVectorizer) later. I think it
    > might be useful to keep HashingTF as a special case.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1671#issuecomment-51364477>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50735679
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1671#discussion_r15629148
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import java.lang.{Iterable => JavaIterable}
    +
    +import scala.collection.convert.WrapAsScala
    +import scala.collection.mutable
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + * Maps a sequence of terms to their term frequencies using the hashing trick.
    + *
    + * @param numFeatures number of features (default: 1000000)
    + */
    +@Experimental
    +class HashingTF(val numFeatures: Int) extends Serializable {
    +
    +  def this() = this(1000000)
    +
    +  /**
    +   * Returns the index of the input term.
    +   */
    +  def indexOf(term: Any): Int = Utils.nonNegativeMod(term.##, numFeatures)
    +
    +  /**
    +   * Transforms the input document into a sparse term frequency vector.
    +   */
    +  def transform(document: Iterable[_]): Vector = {
    +    val termFrequencies = mutable.HashMap.empty[Int, Double]
    +    document.foreach { term =>
    +      val i = indexOf(term)
    +      termFrequencies.put(i, termFrequencies.getOrElse(i, 0.0) + 1.0)
    +    }
    +    Vectors.sparse(numFeatures, termFrequencies.toSeq)
    +  }
    +
    +  /**
    +   * Transforms the input document into a sparse term frequency vector (Java version).
    +   */
    +  def transform(document: JavaIterable[_]): Vector = {
    +    transform(WrapAsScala.iterableAsScalaIterable(document))
    --- End diff --
    
    Just curious, is this more efficient than importing `scala.JavaConverters._` and doing `document.asScala`? I think that's a more common way to do this. (It's better than the older JavaConversions, which just had a conversion from java.Iterable to scala.Iterable).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50703957
  
    QA tests have started for PR 1671. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17531/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1671#discussion_r15618551
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/text/HashingTF.scala ---
    @@ -0,0 +1,61 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature.text
    +
    +import scala.collection.mutable
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + * Maps a sequence of terms to their term frequencies using the hashing trick.
    + *
    + * @param numFeatures number of features (default: 1000000)
    + */
    +@Experimental
    +class HashingTF(val numFeatures: Int) extends Serializable {
    +
    +  def this() = this(1000000)
    +
    +  /**
    +   * Returns the index of the input term.
    +   */
    +  def indexOf(term: Any): Int = Utils.nonNegativeMod(term.##, numFeatures)
    +
    +  /**
    +   * Transforms the input document into a sparse term frequency vector.
    +   */
    +  def transform(document: Iterable[_]): Vector = {
    +    val termFrequencies = mutable.HashMap.empty[Int, Double]
    +    document.foreach { term =>
    +      val i = indexOf(term)
    +      termFrequencies.put(i, termFrequencies.getOrElse(i, 0.0) + 1.0)
    +    }
    +    Vectors.sparse(numFeatures, termFrequencies.toSeq)
    +  }
    +
    +  /**
    +   * Transforms the input document to term frequency vectors.
    +   */
    +  def transform[D <: Iterable[_]](dataset: RDD[D]): RDD[Vector] = {
    +    dataset.map(this.transform)
    +  }
    --- End diff --
    
    Can you add a Java-friendly version of these tool? This looks like `scala.Iterable`. For the Java one you will probably have to make it take `JavaRDD` since we can't have methods that differ on the element type of RDD.
    
    Once you add it, please add a Java test that uses it as well so we make sure it compiles in Java with no surprises.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50691489
  
    QA tests have started for PR 1671. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17501/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50718111
  
    QA results for PR 1671:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class HashingTF(val numFeatures: Int) extends Serializable {<br>class IDF {<br>class DocumentFrequencyAggregator extends Serializable {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17557/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1671#discussion_r15624672
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
    @@ -0,0 +1,194 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import breeze.linalg.{DenseVector => BDV}
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors}
    +import org.apache.spark.mllib.rdd.RDDFunctions._
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Inverse document frequency (IDF).
    + * The standard formulation is used: `idf = log((m + 1) / (d(t) + 1))`, where `m` is the total
    + * number of documents and `d(t)` is the number of documents that contain term `t`.
    + */
    +@Experimental
    +class IDF {
    +
    +  // TODO: Allow different IDF formulations.
    +
    +  private var brzIdf: BDV[Double] = _
    +
    +  /**
    +   * Computes the inverse document frequency.
    +   * @param dataset an RDD of term frequency vectors
    +   */
    +  def fit(dataset: RDD[Vector]): this.type = {
    +    brzIdf = dataset.treeAggregate(new IDF.DocumentFrequencyAggregator)(
    +      seqOp = (df, v) => df.add(v),
    +      combOp = (df1, df2) => df1.merge(df2)
    +    ).idf()
    +    this
    +  }
    +
    +  /**
    +   * Computes the inverse document frequency.
    +   * @param dataset a JavaRDD of term frequency vectors
    +   */
    +  def fit(dataset: JavaRDD[Vector]): this.type = {
    --- End diff --
    
    Will the return type of this show up okay in Java? It would be good to add a Java example that uses this just to check that everything works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1671#discussion_r15627165
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
    @@ -0,0 +1,194 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import breeze.linalg.{DenseVector => BDV}
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vector, Vectors}
    +import org.apache.spark.mllib.rdd.RDDFunctions._
    +import org.apache.spark.rdd.RDD
    +
    +/**
    + * :: Experimental ::
    + * Inverse document frequency (IDF).
    + * The standard formulation is used: `idf = log((m + 1) / (d(t) + 1))`, where `m` is the total
    + * number of documents and `d(t)` is the number of documents that contain term `t`.
    + */
    +@Experimental
    +class IDF {
    +
    +  // TODO: Allow different IDF formulations.
    +
    +  private var brzIdf: BDV[Double] = _
    +
    +  /**
    +   * Computes the inverse document frequency.
    +   * @param dataset an RDD of term frequency vectors
    +   */
    +  def fit(dataset: RDD[Vector]): this.type = {
    +    brzIdf = dataset.treeAggregate(new IDF.DocumentFrequencyAggregator)(
    +      seqOp = (df, v) => df.add(v),
    +      combOp = (df1, df2) => df1.merge(df2)
    +    ).idf()
    +    this
    +  }
    +
    +  /**
    +   * Computes the inverse document frequency.
    +   * @param dataset a JavaRDD of term frequency vectors
    +   */
    +  def fit(dataset: JavaRDD[Vector]): this.type = {
    --- End diff --
    
    Yes. In the Java test suite, I put 'idf.fit(...).transform(...)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1671#discussion_r15775472
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import java.lang.{Iterable => JavaIterable}
    +
    +import scala.collection.JavaConverters._
    +import scala.collection.mutable
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + * Maps a sequence of terms to their term frequencies using the hashing trick.
    + *
    + * @param numFeatures number of features (default: 1000000)
    + */
    +@Experimental
    +class HashingTF(val numFeatures: Int) extends Serializable {
    +
    +  def this() = this(1000000)
    +
    --- End diff --
    
    @mengxr @mateiz this looks really awesome!
    
    I know this is merged already, but one comment: when using the hashing trick, should the vector size not usually be a power of 2? This is mentioned [here](http://scikit-learn.org/stable/modules/feature_extraction.html#implementation-details) for example.
    
    Pretty much every library uses powers of 2 - Vowpal Wabbit, sophia-ml, scikit-learn and shogun for example.
    
    So it may be worth mentioning, and making the default 2^20 (or 2^18 which is also a common default).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50691546
  
    QA results for PR 1671:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class HashingTF(val numFeatures: Int) extends Serializable {<br>class IDF {<br>class DocumentFrequencyAggregator extends Serializable {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17501/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1671#discussion_r15618968
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/text/HashingTF.scala ---
    @@ -0,0 +1,61 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature.text
    +
    +import scala.collection.mutable
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + * Maps a sequence of terms to their term frequencies using the hashing trick.
    + *
    + * @param numFeatures number of features (default: 1000000)
    + */
    +@Experimental
    +class HashingTF(val numFeatures: Int) extends Serializable {
    +
    +  def this() = this(1000000)
    +
    +  /**
    +   * Returns the index of the input term.
    +   */
    +  def indexOf(term: Any): Int = Utils.nonNegativeMod(term.##, numFeatures)
    +
    +  /**
    +   * Transforms the input document into a sparse term frequency vector.
    +   */
    +  def transform(document: Iterable[_]): Vector = {
    +    val termFrequencies = mutable.HashMap.empty[Int, Double]
    +    document.foreach { term =>
    +      val i = indexOf(term)
    +      termFrequencies.put(i, termFrequencies.getOrElse(i, 0.0) + 1.0)
    +    }
    +    Vectors.sparse(numFeatures, termFrequencies.toSeq)
    +  }
    +
    +  /**
    +   * Transforms the input document to term frequency vectors.
    +   */
    +  def transform[D <: Iterable[_]](dataset: RDD[D]): RDD[Vector] = {
    +    dataset.map(this.transform)
    +  }
    --- End diff --
    
    Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50735051
  
    QA results for PR 1671:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class HashingTF(val numFeatures: Int) extends Serializable {<br>class IDF {<br>class DocumentFrequencyAggregator extends Serializable {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17571/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1671#discussion_r15887125
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import java.lang.{Iterable => JavaIterable}
    +
    +import scala.collection.JavaConverters._
    +import scala.collection.mutable
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + * Maps a sequence of terms to their term frequencies using the hashing trick.
    + *
    + * @param numFeatures number of features (default: 1000000)
    + */
    +@Experimental
    +class HashingTF(val numFeatures: Int) extends Serializable {
    +
    +  def this() = this(1000000)
    +
    --- End diff --
    
    Good point. I will update the default value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-51106074
  
    @mengxr @mateiz just a broad note on this, it seems like this could (should?) be a broader `FeatureHasher` implemented as as transform similar to scikit-learn's approach. Vectorization for TF is sort of a special case.
    
    Thoughts or plans for this sort of more generic feature transformer/extraction interfaces and pipelines?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-51364477
  
    @MLnick `FeatureHasher` might be too general. One thing it is not clear from its name is how to resolve conflicts: same word appears more than once, or different words having the same hash value. We can add `FeatureHasher` (or called `HashingVectorizer`) later. I think it might be useful to keep `HashingTF` as a special case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-51375372
  
    Thanks for explaining different approaches. What should be the best for `HashingTF`? Adding the counts or with random signs and bounded by 0? I know with random signs, we can preserve the structure. But I'm not sure whether this is a standard practice for NLP.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50810069
  
    @mateiz Thanks for reviewing the code! I merged this into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1671


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50730369
  
    QA tests have started for PR 1671. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17571/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50701075
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50729982
  
    QA results for PR 1671:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds the following public classes (experimental):<br>class HashingTF(val numFeatures: Int) extends Serializable {<br>class IDF {<br>class DocumentFrequencyAggregator extends Serializable {<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17569/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1671#discussion_r15630475
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala ---
    @@ -0,0 +1,79 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.feature
    +
    +import java.lang.{Iterable => JavaIterable}
    +
    +import scala.collection.convert.WrapAsScala
    +import scala.collection.mutable
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.api.java.JavaRDD
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.rdd.RDD
    +import org.apache.spark.util.Utils
    +
    +/**
    + * :: Experimental ::
    + * Maps a sequence of terms to their term frequencies using the hashing trick.
    + *
    + * @param numFeatures number of features (default: 1000000)
    + */
    +@Experimental
    +class HashingTF(val numFeatures: Int) extends Serializable {
    +
    +  def this() = this(1000000)
    +
    +  /**
    +   * Returns the index of the input term.
    +   */
    +  def indexOf(term: Any): Int = Utils.nonNegativeMod(term.##, numFeatures)
    +
    +  /**
    +   * Transforms the input document into a sparse term frequency vector.
    +   */
    +  def transform(document: Iterable[_]): Vector = {
    +    val termFrequencies = mutable.HashMap.empty[Int, Double]
    +    document.foreach { term =>
    +      val i = indexOf(term)
    +      termFrequencies.put(i, termFrequencies.getOrElse(i, 0.0) + 1.0)
    +    }
    +    Vectors.sparse(numFeatures, termFrequencies.toSeq)
    +  }
    +
    +  /**
    +   * Transforms the input document into a sparse term frequency vector (Java version).
    +   */
    +  def transform(document: JavaIterable[_]): Vector = {
    +    transform(WrapAsScala.iterableAsScalaIterable(document))
    --- End diff --
    
    Intellij hints many functions under `JavaConverters` are deprecated, so I followed the message and found the real converter. But it seems to be unnecessary. Changed it back to use `JavaConverters`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-51380681
  
    True, for text feature extraction counts you'd want only positive values 
    (scikit-learn has a non-negative flag: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher)
    
    Not suggesting any change of `HashingTF` now - just considering the more general use cases this opens up (ie VW-style hashing, including namespaces etc).
    
    This can be revisited perhaps for 1.2+ - the default could be as is (non-negative features), with an option to use the signed approach. `HashingTF` then becomes a more general-use feature hasher (name could change, or depending on how `HashingTF` evolves it could become a `HashingVectorizer`-like thing)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2511][MLLIB] add HashingTF and IDF

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1671#issuecomment-50701377
  
    QA tests have started for PR 1671. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17529/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---