You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by Ishiihara <gi...@git.apache.org> on 2014/08/10 07:02:49 UTC

[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

GitHub user Ishiihara opened a pull request:

    https://github.com/apache/spark/pull/1871

    [SPARK-2907] [MLlib] Use mutable.HashMap to represent model in Word2Vec

    Change list:
    1. Used mutable.HashMap to represent syn0Global and syn1Global to reduce shuffle size.
    2. Introduced local vocabulary to perform more precise learning rate update. 
    3. Replace layer1Size with vectorSize to correctly set vector size.  Previously, layer1Size was always the default value of vectorSize. 
    
    For 100 partitions,  using mutable.HashMap reduces shuffle size from 8.1G to 4G. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Ishiihara/spark Word2Vec-improve

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1871.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1871
    
----
commit 8d6befe21e26cc843fc96e4c2934a15c0797ce51
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-01T07:45:22Z

    initial commit

commit 0aafb1b02a19fe4f1689543baf1882a49a7ff11a
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-01T15:34:11Z

    Add comments, minor fixes

commit e4a04d32be284f9a7ab2d3f57d745342912930a7
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-01T15:46:38Z

    minor fix

commit 57dc50d3f24beda8eb0348c0baf8dc343065fd2d
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-01T16:20:10Z

    code formatting

commit 2e92b5991ad8f3f73bbeab9a056f452c4b532b3c
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-02T01:17:38Z

    modify according to feedback

commit 720b5a3ea697a881fc7d7c286b65ef110421f89e
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-02T05:53:03Z

    Add test for Word2Vec algorithm, minor fixes

commit 6bcc8be34f6253bc7d4f9d4dcb478bf91f108c86
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-03T18:15:09Z

    add multiple iteration support

commit 7efbb6f91ca94f9243dbb7a16ea3fc9b6f548b99
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-03T19:16:19Z

    use broadcast version of vocab in aggregate

commit 1a8fb4127b9433945e75beea16fc2d485a249219
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-03T23:24:35Z

    use weighted sum in combOp

commit e93e7263d74879379257e6fff40d5efc8417f2ce
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-04T03:53:21Z

    use treeAggregate instead of aggregate

commit 384c77185544d6f80de96bd366e19760eacbd936
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-08-04T04:33:05Z

    remove minCount and window from constructor
    change model to use float instead of double

commit c14da411d4da1b6553759afff7952ac746c9fa15
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-08-04T05:09:58Z

    fix styles

commit 26a948d7e4b8f8cbc91cc7db5cf0acc7d6f08131
Author: Liquan Pei <li...@gmail.com>
Date:   2014-08-04T05:15:27Z

    Merge pull request #1 from mengxr/Ishiihara-master
    
    some updates

commit e2484414d65c3b8aebffa79c3cac34452cf53d38
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-04T05:47:53Z

    minor style change

commit 2ba948384e96e79e95a529f032d4768f24236547
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-04T05:59:40Z

    minor fix for Word2Vec test

commit 74b647b3edb87212c57cf6c5e77d627b0aebb67f
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-07T00:28:53Z

    confict resolution

commit e73fd4c8688cc7bbbf49fa68456fb1c83a29d0e6
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-10T03:44:15Z

    Merge remote-tracking branch 'upstream/master'

commit a8ccea59e65708d1be708a602369084b90c6fc49
Author: Liquan Pei <lp...@gopivotal.com>
Date:   2014-08-10T04:44:17Z

    use mutable.HashMap to represent model

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-52702570
  
    We merged #1932 instead. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51707392
  
    QA tests have started for PR 1871. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18272/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by Ishiihara <gi...@git.apache.org>.
Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-52702675
  
    @mateiz This is taken care of by https://github.com/apache/spark/pull/1932 and is already merged in master and 1.1. In that PR, the model output by each partition is using PrimitiveKeyOpenHashMap. As the implementation is different, we started another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51713377
  
    QA tests have started for PR 1871. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18279/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51712724
  
    QA results for PR 1871:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18277/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51708105
  
    QA results for PR 1871:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18272/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51851620
  
    QA results for PR 1871:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18335/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51730712
  
    Just FYI, mutable.HashMap can be pretty inefficient in space usage, compared e.g. to java.util.HashMap or to Spark's AppendOnlyMap. In this case it will depend on how many keys there are and how big the arrays of floats are (if that's the bulk of the data, it won't matter).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by Ishiihara <gi...@git.apache.org>.
Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51720995
  
    @mengxr It is about 1-2  minutes slower with vector size = 100 for different number of partitions.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51712712
  
    QA tests have started for PR 1871. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18277/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-52701541
  
    @Ishiihara why did you close this, has this been fixed elsewhere now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by Ishiihara <gi...@git.apache.org>.
Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51878228
  
    @mateiz The performance of PrimitiveKeyOpenHashMap is on par with mutable.HashMap. For one partition case, the PrimitiveKeyOpenHashMap is slightly faster than using big array. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51720842
  
    @Ishiihara Did you compare the speed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by Ishiihara <gi...@git.apache.org>.
Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51724432
  
    @mengxr Some benchmark result
    Environment: OSX 10.9, 8G memory, 2.5G i5 CPU, 4 threads
    startingAlpha = 0.0025
    vecterSize = 100
    Driver memory 2g
    
    syn0 and syn1 as mutable.HashMap
    
    | numPartition 	| numIteration 	| time 	| total shuffle write |
    | ---------------- | -----------------  |-----------|-----------------------|
    |1 		       |1 			| 9m30.828s    |42.6MB |
    |4 		       |1			| 5m47.192s    |43.6MB |
    |10 		       |1 		        | 6m12.333s    |490.4MB|
    |100		       |1 			| 6m24.663s 	|2.0G|
    
    syn0 and syn1 as big Array
    
    | numPartition 	| numIteration 	| time 	| total shuffle write |
    | ---------------- | -----------------  |-----------|-----------------------|
    |1 		       |1 			| 9m1.675s	|42.6MB |
    |4     	       |1  			| 5m3.130s	|43.6MB |
    |10   	       |1 			| 5m24.283s	|580MB|
    |100 	       |1 			| 5m52.446s    | 4.1G|


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51865485
  
    Just wondering, any noticeable perf difference with this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by Ishiihara <gi...@git.apache.org>.
Github user Ishiihara closed the pull request at:

    https://github.com/apache/spark/pull/1871


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51706424
  
    QA tests have started for PR 1871. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18267/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51846871
  
    QA tests have started for PR 1871. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18335/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51707054
  
    QA results for PR 1871:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18267/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51730738
  
    Even better might be Spark's PrimitiveKeyOpenHashMap here. Again, if there are lots of keys.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-2907] [MLlib] Use mutable.HashMap to re...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1871#issuecomment-51714618
  
    QA results for PR 1871:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18279/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org