You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2016/04/20 12:45:13 UTC

spark git commit: [SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF

Repository: spark
Updated Branches:
  refs/heads/master 17db4bfea -> ed9d80385


[SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF

## What changes were proposed in this pull request?

Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.

## How was this patch tested?

unit tests and doc generation

Author: Yuhao Yang <hh...@gmail.com>

Closes #12454 from hhbyyh/tfdoc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ed9d8038
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ed9d8038
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ed9d8038

Branch: refs/heads/master
Commit: ed9d80385486cd39a84a689ef467795262af919a
Parents: 17db4bf
Author: Yuhao Yang <hh...@gmail.com>
Authored: Wed Apr 20 11:45:08 2016 +0100
Committer: Sean Owen <so...@cloudera.com>
Committed: Wed Apr 20 11:45:08 2016 +0100

----------------------------------------------------------------------
 docs/ml-features.md                                  | 15 ++++++++++++---
 .../apache/spark/examples/ml/JavaTfIdfExample.java   |  2 ++
 examples/src/main/python/ml/tf_idf_example.py        |  2 ++
 .../org/apache/spark/examples/ml/TfIdfExample.scala  |  2 ++
 4 files changed, 18 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/ed9d8038/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 876d21f..11d5acb 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -22,10 +22,19 @@ This section covers algorithms for working with features, roughly divided into t
 
 [Term Frequency-Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common text pre-processing step.  In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF.
 
-**TF**: `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into fixed-length feature vectors.  In text processing, a "set of terms" might be a bag of words.
-The algorithm combines Term Frequency (TF) counts with the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
+**TF**: Both `HashingTF` and `CountVectorizer` can be used to generate the term frequency vectors. 
 
-**IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`.  The `IDFModel` takes feature vectors (generally created from `HashingTF`) and scales each column.  Intuitively, it down-weights columns which appear frequently in a corpus.
+`HashingTF` is a `Transformer` which takes sets of terms and converts those sets into 
+fixed-length feature vectors.  In text processing, a "set of terms" might be a bag of words.
+The algorithm combines Term Frequency (TF) counts with the 
+[hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
+
+`CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer
+](ml-features.html#countvectorizer) for more details.
+
+**IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an `IDFModel`.  The 
+`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and scales each column.  
+Intuitively, it down-weights columns which appear frequently in a corpus.
 
 Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/ed9d8038/examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java
----------------------------------------------------------------------
diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java
index 37a3d0d..107c835 100644
--- a/examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java
+++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaTfIdfExample.java
@@ -63,6 +63,8 @@ public class JavaTfIdfExample {
       .setOutputCol("rawFeatures")
       .setNumFeatures(numFeatures);
     Dataset<Row> featurizedData = hashingTF.transform(wordsData);
+    // alternatively, CountVectorizer can also be used to get term frequency vectors
+
     IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
     IDFModel idfModel = idf.fit(featurizedData);
     Dataset<Row> rescaledData = idfModel.transform(featurizedData);

http://git-wip-us.apache.org/repos/asf/spark/blob/ed9d8038/examples/src/main/python/ml/tf_idf_example.py
----------------------------------------------------------------------
diff --git a/examples/src/main/python/ml/tf_idf_example.py b/examples/src/main/python/ml/tf_idf_example.py
index c923133..141324d 100644
--- a/examples/src/main/python/ml/tf_idf_example.py
+++ b/examples/src/main/python/ml/tf_idf_example.py
@@ -37,6 +37,8 @@ if __name__ == "__main__":
     wordsData = tokenizer.transform(sentenceData)
     hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
     featurizedData = hashingTF.transform(wordsData)
+    # alternatively, CountVectorizer can also be used to get term frequency vectors
+
     idf = IDF(inputCol="rawFeatures", outputCol="features")
     idfModel = idf.fit(featurizedData)
     rescaledData = idfModel.transform(featurizedData)

http://git-wip-us.apache.org/repos/asf/spark/blob/ed9d8038/examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala
----------------------------------------------------------------------
diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala
index 28115f9..396f073 100644
--- a/examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala
+++ b/examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala
@@ -43,6 +43,8 @@ object TfIdfExample {
     val hashingTF = new HashingTF()
       .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
     val featurizedData = hashingTF.transform(wordsData)
+    // alternatively, CountVectorizer can also be used to get term frequency vectors
+
     val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
     val idfModel = idf.fit(featurizedData)
     val rescaledData = idfModel.transform(featurizedData)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org