You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Carsten Schnober <sc...@ukp.informatik.tu-darmstadt.de> on 2015/07/08 09:44:41 UTC

Word2Vec distributed?

Hi,
I've been experimenting with the Spark Word2Vec implementation in the
MLLib package.
It seems to me that only the preparatory steps are actually performed in
a distributed way, i.e. stages 0-2 that prepare the data. In stage 3
(mapPartitionsWithIndex at Word2Vec.scala:312), only one node seems to
be working, using one CPU.

I suppose this is related to the discussion in [1], essentially stating
that the original algorithm allows for multi-threading, but not for
distributed computation due to frequent internal communication.

To my understanding, this issue has not been fully resolved in Spark,
has it? I just wonder whether I am interpreting the current situation
correctly.

Thanks!
Carsten

[1] https://issues.apache.org/jira/browse/SPARK-2510

-- 
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
schnober@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org