You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/23 11:31:25 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

zhengruifeng opened a new pull request #29852:
URL: https://github.com/apache/spark/pull/29852


   ### What changes were proposed in this pull request?
   `HashingTF` use `util.collection.OpenHashMap` instead of `mutable.HashMap`
   
   
   ### Why are the changes needed?
   according to `util.collection.OpenHashMap` 's doc:
   
   > This map is about 5X faster than java.util.HashMap, while using much less space overhead.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   existing testsuites
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-698162567


   ping @huaxingao 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #29852:
URL: https://github.com/apache/spark/pull/29852#discussion_r494713601



##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
##########
@@ -91,20 +90,13 @@ class HashingTF @Since("3.0.0") private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
     val outputSchema = transformSchema(dataset.schema)
-    val localNumFeatures = $(numFeatures)
-    val localBinary = $(binary)
+    val n = $(numFeatures)
+    val updateFunc = if ($(binary)) (v: Double) => 1.0 else (v: Double) => v + 1.0
 
     val hashUDF = udf { terms: Seq[_] =>
-      val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0)
-      terms.foreach { term =>
-        val i = indexOf(term)
-        if (localBinary) {
-          termFrequencies(i) = 1.0
-        } else {
-          termFrequencies(i) += 1.0
-        }
-      }
-      Vectors.sparse(localNumFeatures, termFrequencies.toSeq)
+      val map = new OpenHashMap[Int, Double]()

Review comment:
       yes, the comment in `util.collection.OpenHashMap` only refer to java hashmap, but to my knowledge, scala hashmap is slower than java hashmap.
   
   and there is was also a [performance test](https://gist.github.com/rxin/44657c3f40d4e4294857) between java hashmap and scala hashmap: scala hashmap < java hashmap < scala openhashmap
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #29852:
URL: https://github.com/apache/spark/pull/29852#discussion_r494390254



##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
##########
@@ -91,20 +90,13 @@ class HashingTF @Since("3.0.0") private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
     val outputSchema = transformSchema(dataset.schema)
-    val localNumFeatures = $(numFeatures)
-    val localBinary = $(binary)
+    val n = $(numFeatures)
+    val updateFunc = if ($(binary)) (v: Double) => 1.0 else (v: Double) => v + 1.0
 
     val hashUDF = udf { terms: Seq[_] =>
-      val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0)
-      terms.foreach { term =>
-        val i = indexOf(term)
-        if (localBinary) {
-          termFrequencies(i) = 1.0
-        } else {
-          termFrequencies(i) += 1.0
-        }
-      }
-      Vectors.sparse(localNumFeatures, termFrequencies.toSeq)
+      val map = new OpenHashMap[Int, Double]()

Review comment:
       This seems fine but is it faster than Scala's Map? the comment refers to the Java HashMap.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-698162567


   ping @huaxingao 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
srowen closed pull request #29852:
URL: https://github.com/apache/spark/pull/29852


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-699494570


   Merged to master


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #29852:
URL: https://github.com/apache/spark/pull/29852#discussion_r494713601



##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
##########
@@ -91,20 +90,13 @@ class HashingTF @Since("3.0.0") private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
     val outputSchema = transformSchema(dataset.schema)
-    val localNumFeatures = $(numFeatures)
-    val localBinary = $(binary)
+    val n = $(numFeatures)
+    val updateFunc = if ($(binary)) (v: Double) => 1.0 else (v: Double) => v + 1.0
 
     val hashUDF = udf { terms: Seq[_] =>
-      val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0)
-      terms.foreach { term =>
-        val i = indexOf(term)
-        if (localBinary) {
-          termFrequencies(i) = 1.0
-        } else {
-          termFrequencies(i) += 1.0
-        }
-      }
-      Vectors.sparse(localNumFeatures, termFrequencies.toSeq)
+      val map = new OpenHashMap[Int, Double]()

Review comment:
       yes, the comment in `util.collection.OpenHashMap` only refer to java hashmap, but to my knowledge, scala hashmap is slower than java hashmap.
   
   and there is was also a [performance test](https://gist.github.com/rxin/44657c3f40d4e4294857) between java hashmap and scala hashmap: scala hashmap < java hashmap < scala openhashmap
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-697348502






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-697307454


   **[Test build #129028 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129028/testReport)** for PR 29852 at commit [`000ca35`](https://github.com/apache/spark/commit/000ca3537d49a32b30b4f4bead00a2bdff790a43).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-699570008


   Thanks @srowen and @huaxingao for help reviewing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-697305023






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-697348502






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-697305023






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-697307454


   **[Test build #129028 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129028/testReport)** for PR 29852 at commit [`000ca35`](https://github.com/apache/spark/commit/000ca3537d49a32b30b4f4bead00a2bdff790a43).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29852:
URL: https://github.com/apache/spark/pull/29852#issuecomment-697347990


   **[Test build #129028 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129028/testReport)** for PR 29852 at commit [`000ca35`](https://github.com/apache/spark/commit/000ca3537d49a32b30b4f4bead00a2bdff790a43).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #29852: [SPARK-21481][ML][FOLLOWUP][Trivial] HashingTF use util.collection.OpenHashMap instead of mutable.HashMap

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #29852:
URL: https://github.com/apache/spark/pull/29852#discussion_r494390254



##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala
##########
@@ -91,20 +90,13 @@ class HashingTF @Since("3.0.0") private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
     val outputSchema = transformSchema(dataset.schema)
-    val localNumFeatures = $(numFeatures)
-    val localBinary = $(binary)
+    val n = $(numFeatures)
+    val updateFunc = if ($(binary)) (v: Double) => 1.0 else (v: Double) => v + 1.0
 
     val hashUDF = udf { terms: Seq[_] =>
-      val termFrequencies = mutable.HashMap.empty[Int, Double].withDefaultValue(0.0)
-      terms.foreach { term =>
-        val i = indexOf(term)
-        if (localBinary) {
-          termFrequencies(i) = 1.0
-        } else {
-          termFrequencies(i) += 1.0
-        }
-      }
-      Vectors.sparse(localNumFeatures, termFrequencies.toSeq)
+      val map = new OpenHashMap[Int, Double]()

Review comment:
       This seems fine but is it faster than Scala's Map? the comment refers to the Java HashMap.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org