You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by Ishiihara <gi...@git.apache.org> on 2014/09/11 11:59:39 UTC

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

GitHub user Ishiihara opened a pull request:

    https://github.com/apache/spark/pull/2356

    [SPARK-3486][MLlib][PySpark] PySpark support for Word2Vec

    @mengxr
    Added PySpark support for Word2Vec
    Change list
    (1) PySpark support for Word2Vec
    (2) SerDe support of string sequence both on python side and JVM side
    (3) Test for SerDe of string sequence on JVM side

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Ishiihara/spark Word2Vec-python

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2356.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2356
    
----
commit c867fdfdf623c2e9905a376d35987dbe2914e329
Author: Liquan Pei <li...@gmail.com>
Date:   2014-09-10T08:51:44Z

    add Word2Vec to pyspark

commit 0ad3ac1efed6258607a79c0d45345d70a17dee47
Author: Liquan Pei <li...@gmail.com>
Date:   2014-09-10T10:02:56Z

    minor fix

commit 48d5e721a58924f33ebef31b9e67454f45480d5c
Author: Liquan Pei <li...@gmail.com>
Date:   2014-09-11T09:50:30Z

    Functionality improvement

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18122597
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,124 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +
    +from pyspark import PickleSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18486387
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,192 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        """
    +        :param word: a word
    +        :return: vector representation of word
    +
    +        Note: local use only
    +        TODO: make transform usable in RDD operations from python side
    +        """
    +        result = self._java_model.transform(word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        """
    +        :param x: a word or a vector representation of word
    +        :param num: number of synonyms to find
    +        :return: array of (word, cosineSimilarity)
    +
    +        Note: local use only
    +        TODO: make findSynonyms usable in RDD operations from python side
    --- End diff --
    
    ditto: move TODO to implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18121798
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,124 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +
    +from pyspark import PickleSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        result = self._java_model.transform(word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        jlist = self._java_model.findSynonyms(x, num)
    +        words, similarity = PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(jlist)))
    +        return zip(words, similarity)
    +
    +
    +class Word2Vec(object):
    +    """
    +    Word2Vec creates vector representation of words in a text corpus.
    +    The algorithm first constructs a vocabulary from the corpus
    +    and then learns vector representation of words in the vocabulary.
    +    The vector representation can be used as features in
    +    natural language processing and machine learning algorithms.
    +
    +    We used skip-gram model in our implementation and hierarchical softmax
    +    method to train the model. The variable names in the implementation
    +    matches the original C implementation.
    +    For original C implementation, see https://code.google.com/p/word2vec/
    +    For research papers, see
    +    Efficient Estimation of Word Representations in Vector Space
    +    and
    +    Distributed Representations of Words and Phrases and their Compositionality.
    --- End diff --
    
    insert a new line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58271779
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58281398
  
    LGTM. Merged into master. Thanks! I created a JIRA to remember add Python code example to the user guide: https://issues.apache.org/jira/browse/SPARK-3838 . Not a high priority task, just in case we forget it before 1.2 release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18184057
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,54 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModelWrapper instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]],
    --- End diff --
    
    4-space indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18063655
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -40,11 +40,12 @@ import org.apache.spark.mllib.tree.impurity._
     import org.apache.spark.mllib.tree.model.DecisionTreeModel
     import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
     import org.apache.spark.mllib.stat.correlation.CorrelationNames
    +import org.apache.spark.mllib.feature.Word2Vec
    --- End diff --
    
    order imports alphabetically (https://plugins.jetbrains.com/plugin/7350)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18120761
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModel instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]]
    +    ): Word2VecModel = {
    +    val data = dataJRDD.rdd.map(_.toArray(new Array[String](0)).toSeq).cache()
    +    val word2vec = new Word2Vec()
    +    val model = word2vec.fit(data)
    +    model
    +  }
    +
    +  /**
    +   * Java stub for Python mllib Word2VecModel transform
    +   * @param model Word2VecModel instance
    +   * @param word a word
    +   * @return serialized vector representation of word
    +   */
    +  def Word2VecModelTransform(
    +    model: Word2VecModel,
    +    word: String
    +    ): Vector = {
    +    model.transform(word)
    +  }
    +
    +  /**
    +   * Java stub for Python mllib Word2VecModel findSynonyms
    +   * @param model Word2VecModel instance
    +   * @param word a word
    +   * @param num number of synonyms to find
    +   * @return a java LinkedList containing serialized version of
    +   * synonyms and similarities
    +   */
    +  def Word2VecModelSynonyms(
    --- End diff --
    
    @mengxr Thanks for pointing this out. Refactored code to introduce the model wrapper class. Greatly simplified implementation both on python and JVM side. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18486706
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,58 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD input JavaRDD
    +   * @param vectorSize size of vector
    +   * @param learningRate initial learning rate
    +   * @param numPartitions number of partitions
    +   * @param numIterations number of iterations
    +   * @param seed initial seed for random generator
    +   * @return A handle to java Word2VecModelWrapper instance at python side
    +   */
    +  def trainWord2Vec(
    +      dataJRDD: JavaRDD[java.util.ArrayList[String]],
    +      vectorSize: Int,
    +      learningRate: Double,
    +      numPartitions: Int,
    +      numIterations: Int,
    +      seed: Long): Word2VecModelWrapper = {
    +    val data = dataJRDD.rdd.cache()
    +    val word2vec = new Word2Vec()
    +        .setVectorSize(vectorSize)
    +        .setLearningRate(learningRate)
    +        .setNumPartitions(numPartitions)
    +        .setNumIterations(numIterations)
    +        .setSeed(seed)
    +    val model = word2vec.fit(data)
    +    new Word2VecModelWrapper(model)
    --- End diff --
    
    unpersist `data` after training explicitly because the user won't have access to it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-57047728
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20915/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18486524
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,192 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        """
    +        :param word: a word
    +        :return: vector representation of word
    +
    +        Note: local use only
    +        TODO: make transform usable in RDD operations from python side
    +        """
    +        result = self._java_model.transform(word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        """
    +        :param x: a word or a vector representation of word
    +        :param num: number of synonyms to find
    +        :return: array of (word, cosineSimilarity)
    +
    +        Note: local use only
    +        TODO: make findSynonyms usable in RDD operations from python side
    +        """
    +        ser = PickleSerializer()
    +        if type(x) == str:
    +            jlist = self._java_model.findSynonyms(x, num)
    +        else:
    +            bytes = bytearray(ser.dumps(_convert_to_vector(x)))
    +            vec = self._sc._jvm.SerDe.loads(bytes)
    +            jlist = self._java_model.findSynonyms(vec, num)
    +        words, similarity = ser.loads(str(self._sc._jvm.SerDe.dumps(jlist)))
    +        return zip(words, similarity)
    +
    +
    +class Word2Vec(object):
    +    """
    +    Word2Vec creates vector representation of words in a text corpus.
    +    The algorithm first constructs a vocabulary from the corpus
    +    and then learns vector representation of words in the vocabulary.
    +    The vector representation can be used as features in
    +    natural language processing and machine learning algorithms.
    +
    +    We used skip-gram model in our implementation and hierarchical softmax
    +    method to train the model. The variable names in the implementation
    +    matches the original C implementation.
    +    For original C implementation, see https://code.google.com/p/word2vec/
    +    For research papers, see
    +    Efficient Estimation of Word Representations in Vector Space
    +    and
    +    Distributed Representations of Words and Phrases and their Compositionality.
    +
    +    >>> sentence = "a b " * 100 + "a c " * 10
    +    >>> localDoc = [sentence, sentence]
    +    >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
    +    >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +    >>> syms = model.findSynonyms("a", 2)
    +    >>> str(syms[0][0])
    +    'b'
    +    >>> str(syms[1][0])
    +    'c'
    +    >>> len(syms)
    +    2
    +    >>> vec = model.transform("a")
    +    >>> len(vec)
    +    10
    +    >>> syms = model.findSynonyms(vec, 2)
    +    >>> str(syms[0][0])
    +    'b'
    +    >>> str(syms[1][0])
    +    'c'
    +    >>> len(syms)
    +    2
    +    """
    +    def __init__(self):
    +        """
    +        Construct Word2Vec instance
    +        """
    +        self.vectorSize = 100
    +        self.learningRate = 0.025
    +        self.numPartitions = 1
    +        self.numIterations = 1
    +        self.seed = 42L
    +
    +    def setVectorSize(self, vectorSize):
    +        """
    +        Sets vector size (default: 100).
    +        """
    +        self.vectorSize = vectorSize
    +        return self
    +
    +    def setLearningRate(self, learningRate):
    +        """
    +        Sets initial learning rate (default: 0.025).
    +        """
    +        self.learningRate = learningRate
    +        return self
    +
    +    def setNumPartitions(self, numPartitions):
    +        """
    +        Sets number of partitions (default: 1). Use a small number for accuracy.
    +        """
    +        self.numPartitions = numPartitions
    +        return self
    +
    +    def setNumIterations(self, numIterations):
    +        """
    +        Sets number of iterations (default: 1), which should be smaller than or equal to number of
    +        partitions.
    +        """
    +        self.numIterations = numIterations
    +        return self
    +
    +    def setSeed(self, seed):
    +        """
    +        Sets random seed (default: a random long integer).
    +        """
    +        self.seed = seed
    +        return self
    +
    +    def fit(self, data):
    +        """
    +        Computes the vector representation of each word in vocabulary.
    +
    +        :param data: training data.
    +        :return: python Word2VecModel instance
    +        """
    +        sc = data.context
    +        ser = PickleSerializer()
    +        vectorSize = self.vectorSize
    +        learningRate = self.learningRate
    +        numPartitions = self.numPartitions
    +        numIterations = self.numIterations
    +        seed = self.seed
    +
    +        # cached = data._reserialize(AutoBatchedSerializer(ser)).cache()
    +        model = sc._jvm.PythonMLLibAPI().trainWord2Vec(
    +            data._to_java_object_rdd(), vectorSize,
    +            learningRate, numPartitions, numIterations, seed)
    +        return Word2VecModel(sc, model)
    +
    +
    +def _test():
    --- End diff --
    
    After we rename `Word2Vec.py` to `feature.py`, please add it to `python/run-tests.py` so it gets tested automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-55308654
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20163/consoleFull) for   PR 2356 at commit [`ca1e5ff`](https://github.com/apache/spark/commit/ca1e5ffe60e51d4e6435a22d086689a00be38c1a).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Word2VecModel(object):`
      * `class Word2Vec(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-55299387
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20163/consoleFull) for   PR 2356 at commit [`ca1e5ff`](https://github.com/apache/spark/commit/ca1e5ffe60e51d4e6435a22d086689a00be38c1a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18117490
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModel instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]]
    +    ): Word2VecModel = {
    +    val data = dataJRDD.rdd.map(_.toArray(new Array[String](0)).toSeq).cache()
    --- End diff --
    
    @mengxr @davies Thank you for pointing this out. I am inclined to cache words RDD inside word2vec.fit as I discovered that words RDD is used twice, the first time is calling learnVocab(words) and the second time is creating newSentences RDD. This method will not increate memory as there is no overlapping between computation on words RDD and newSentences RDD. Thus, we can unpersist words RDD before caching newSentences. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18184093
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,151 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +from numpy import random
    +
    +from sys import maxint
    +
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        """
    +        local use only
    +        TODO: make transform usable in RDD operations from python side
    +        """
    +        result = self._java_model.transform(word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        """
    +        local use only
    --- End diff --
    
    ditto. For python, it is important to describe the return type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-56878764
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20816/consoleFull) for   PR 2356 at commit [`78bbb53`](https://github.com/apache/spark/commit/78bbb533be9f9a11cb81fe4278e0833ade7fe833).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18184098
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,151 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +from numpy import random
    +
    +from sys import maxint
    +
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        """
    +        local use only
    +        TODO: make transform usable in RDD operations from python side
    +        """
    +        result = self._java_model.transform(word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        """
    +        local use only
    +        TODO: make findSynonyms usable in RDD operations from python side
    +        """
    +        jlist = self._java_model.findSynonyms(x, num)
    +        words, similarity = PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(jlist)))
    +        return zip(words, similarity)
    +
    +
    +class Word2Vec(object):
    +    """
    +    Word2Vec creates vector representation of words in a text corpus.
    +    The algorithm first constructs a vocabulary from the corpus
    +    and then learns vector representation of words in the vocabulary.
    +    The vector representation can be used as features in
    +    natural language processing and machine learning algorithms.
    +
    +    We used skip-gram model in our implementation and hierarchical softmax
    +    method to train the model. The variable names in the implementation
    +    matches the original C implementation.
    +    For original C implementation, see https://code.google.com/p/word2vec/
    +    For research papers, see
    +    Efficient Estimation of Word Representations in Vector Space
    +    and
    +    Distributed Representations of Words and Phrases and their Compositionality.
    +
    +    >>> sentence = "a b " * 100 + "a c " * 10
    +    >>> localDoc = [sentence, sentence]
    +    >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
    +    >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +    >>> syms = model.findSynonyms("a", 2)
    +    >>> str(syms[0][0])
    +    'b'
    +    >>> str(syms[1][0])
    +    'c'
    +    >>> len(syms)
    +    2
    +    >>> vec = model.transform("a")
    +    >>> len(vec)
    +    10
    +    """
    +    def __init__(self):
    --- End diff --
    
    missing doc for all methods in `Word2Vec`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18117584
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModel instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]]
    +    ): Word2VecModel = {
    +    val data = dataJRDD.rdd.map(_.toArray(new Array[String](0)).toSeq).cache()
    --- End diff --
    
    @mengxr @davies Also this method reduces rounghly 40s on text8 data in scala and we also eliminate the need to cache on python side. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18061848
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModel instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]]
    +    ): Word2VecModel = {
    +    val data = dataJRDD.rdd.map(_.toArray(new Array[String](0)).toSeq).cache()
    --- End diff --
    
    maybe it's better to cache serialized data from Python, it will reduce the GC pressure (also less memory).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-55253328
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20153/consoleFull) for   PR 2356 at commit [`68e7276`](https://github.com/apache/spark/commit/68e7276896eeeb546f6f212f5a2f8ae5470cf0b5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58270752
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21408/consoleFull) for   PR 2356 at commit [`b13a0b9`](https://github.com/apache/spark/commit/b13a0b9d47cb3a6604ece9773bcbbd2877db6299).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-56420682
  
    We need to modify the implementation to use the new SerDe mechanism. Working on that now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-57046286
  
    @mengxr Repartition is very slow when caching at Python side. It takes 9 minutes to do the repartition where as caching in Java only takes 5s. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18063706
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,123 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +
    +from functools import wraps
    +
    +from pyspark import PickleSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        pythonAPI = self._sc._jvm.PythonMLLibAPI()
    +        result = pythonAPI.Word2VecModelTransform(self._java_model, word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        SerDe = self._sc._jvm.SerDe
    +        ser = PickleSerializer()
    +        pythonAPI = self._sc._jvm.PythonMLLibAPI()
    +        if type(x) == str:
    +            jlist = pythonAPI.Word2VecModelSynonyms(self._java_model, x, num)
    +        else:
    +            bytes = bytearray(ser.dumps(_convert_to_vector(x)))
    +            vec = self._sc._jvm.SerDe.loads(bytes)
    +            jlist = pythonAPI.Word2VecModelSynonyms(self._java_model, vec, num)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(jlist)))
    +
    +
    +class Word2Vec(object):
    +    """
    +    Word2Vec creates vector representation of words in a text corpus.
    +    The algorithm first constructs a vocabulary from the corpus
    +    and then learns vector representation of words in the vocabulary.
    +    The vector representation can be used as features in
    +    natural language processing and machine learning algorithms.
    +
    +    We used skip-gram model in our implementation and hierarchical softmax
    +    method to train the model. The variable names in the implementation
    +    matches the original C implementation.
    +    For original C implementation, see https://code.google.com/p/word2vec/
    +    For research papers, see
    +    Efficient Estimation of Word Representations in Vector Space
    +    and
    +    Distributed Representations of Words and Phrases and their Compositionality.
    +    """
    +    def __init__(self):
    +        self.vectorSize = 100
    +        self.startingAlpha = 0.025
    +        self.numPartitions = 1
    +        self.numIterations = 1
    +
    +    def setVectorSize(self, vectorSize):
    +        self.vectorSize = vectorSize
    +        return self
    +
    +    def setLearningRate(self, learningRate):
    +        self.startingAlpha = learningRate
    +        return self
    +
    +    def setNumPartitions(self, numPartitions):
    +        self.numPartitions = numPartitions
    +        return self
    +
    +    def setNumIterations(self, numIterations):
    +        self.numIterations = numIterations
    +        return self
    +
    +    def fit(self, data):
    +        """
    +        :param data: Input RDD
    +        """
    +        sc = data.context
    +        model = sc._jvm.PythonMLLibAPI().trainWord2Vec(data._to_java_object_rdd())
    --- End diff --
    
    parameters are not applied


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58254457
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21400/Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-57047727
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20915/consoleFull) for   PR 2356 at commit [`b7447eb`](https://github.com/apache/spark/commit/b7447eb1bdba4244ac5457489bbccaa118890f74).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Word2VecModel(object):`
      * `class Word2Vec(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58278852
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21411/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58278879
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21412/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18061937
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,123 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +
    +from functools import wraps
    --- End diff --
    
    This is not used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-57039639
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20894/consoleFull) for   PR 2356 at commit [`b9a7383`](https://github.com/apache/spark/commit/b9a73831c15b88955d69ee5ea359117d1441b298).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class IDF(val minDocFreq: Int) `
      * `  class DocumentFrequencyAggregator(val minDocFreq: Int) extends Serializable `
      * `class Word2VecModel(object):`
      * `class Word2Vec(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-56878776
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20816/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-55684719
  
    @davies Thanks for working on MLlib's SerDe! It definitely simplifies future Python API implementations. We will wait #2378 .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18184091
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,124 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +
    +from pyspark import PickleSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    --- End diff --
    
    Missing the main documentation. The doc only says "local use only" but not what this function does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-57046312
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20915/consoleFull) for   PR 2356 at commit [`b7447eb`](https://github.com/apache/spark/commit/b7447eb1bdba4244ac5457489bbccaa118890f74).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18063661
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModel instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]]
    +    ): Word2VecModel = {
    +    val data = dataJRDD.rdd.map(_.toArray(new Array[String](0)).toSeq).cache()
    +    val word2vec = new Word2Vec()
    +    val model = word2vec.fit(data)
    +    model
    +  }
    +
    +  /**
    +   * Java stub for Python mllib Word2VecModel transform
    +   * @param model Word2VecModel instance
    +   * @param word a word
    +   * @return serialized vector representation of word
    +   */
    +  def Word2VecModelTransform(
    +    model: Word2VecModel,
    --- End diff --
    
    ditto: make a single line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18063701
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,123 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +
    +from functools import wraps
    +
    +from pyspark import PickleSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        pythonAPI = self._sc._jvm.PythonMLLibAPI()
    +        result = pythonAPI.Word2VecModelTransform(self._java_model, word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        SerDe = self._sc._jvm.SerDe
    +        ser = PickleSerializer()
    +        pythonAPI = self._sc._jvm.PythonMLLibAPI()
    +        if type(x) == str:
    +            jlist = pythonAPI.Word2VecModelSynonyms(self._java_model, x, num)
    +        else:
    +            bytes = bytearray(ser.dumps(_convert_to_vector(x)))
    +            vec = self._sc._jvm.SerDe.loads(bytes)
    +            jlist = pythonAPI.Word2VecModelSynonyms(self._java_model, vec, num)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(jlist)))
    +
    +
    +class Word2Vec(object):
    +    """
    +    Word2Vec creates vector representation of words in a text corpus.
    +    The algorithm first constructs a vocabulary from the corpus
    +    and then learns vector representation of words in the vocabulary.
    +    The vector representation can be used as features in
    +    natural language processing and machine learning algorithms.
    +
    +    We used skip-gram model in our implementation and hierarchical softmax
    +    method to train the model. The variable names in the implementation
    +    matches the original C implementation.
    +    For original C implementation, see https://code.google.com/p/word2vec/
    +    For research papers, see
    +    Efficient Estimation of Word Representations in Vector Space
    +    and
    +    Distributed Representations of Words and Phrases and their Compositionality.
    +    """
    +    def __init__(self):
    +        self.vectorSize = 100
    +        self.startingAlpha = 0.025
    +        self.numPartitions = 1
    +        self.numIterations = 1
    +
    +    def setVectorSize(self, vectorSize):
    +        self.vectorSize = vectorSize
    +        return self
    +
    +    def setLearningRate(self, learningRate):
    +        self.startingAlpha = learningRate
    +        return self
    +
    +    def setNumPartitions(self, numPartitions):
    +        self.numPartitions = numPartitions
    +        return self
    +
    +    def setNumIterations(self, numIterations):
    +        self.numIterations = numIterations
    +        return self
    +
    +    def fit(self, data):
    +        """
    +        :param data: Input RDD
    +        """
    --- End diff --
    
    need some simple test code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18063664
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModel instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]]
    +    ): Word2VecModel = {
    +    val data = dataJRDD.rdd.map(_.toArray(new Array[String](0)).toSeq).cache()
    +    val word2vec = new Word2Vec()
    +    val model = word2vec.fit(data)
    +    model
    +  }
    +
    +  /**
    +   * Java stub for Python mllib Word2VecModel transform
    +   * @param model Word2VecModel instance
    +   * @param word a word
    +   * @return serialized vector representation of word
    +   */
    +  def Word2VecModelTransform(
    +    model: Word2VecModel,
    +    word: String
    +    ): Vector = {
    +    model.transform(word)
    +  }
    +
    +  /**
    +   * Java stub for Python mllib Word2VecModel findSynonyms
    +   * @param model Word2VecModel instance
    +   * @param word a word
    +   * @param num number of synonyms to find
    +   * @return a java LinkedList containing serialized version of
    +   * synonyms and similarities
    +   */
    +  def Word2VecModelSynonyms(
    --- End diff --
    
    Shall we define `Word2VecModelWrapper` that implements `transform` and `findSynonyms`? Let `trainWord2Vec` return the wrapper and you can call `.transform` and `.synonyms` on the python side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18184065
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,54 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModelWrapper instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]],
    +    vectorSize: Int,
    +    startingAlpha: Double,
    +    numPartitions: Int,
    +    numIterations: Int,
    +    seed: Long
    +    ): Word2VecModelWrapper = {
    +    val data = dataJRDD.rdd.cache()
    +    val word2vec = new Word2Vec()
    +                    .setVectorSize(vectorSize)
    --- End diff --
    
    use 2-space indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18184061
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,54 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModelWrapper instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]],
    +    vectorSize: Int,
    +    startingAlpha: Double,
    --- End diff --
    
    use `learningRate` instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18117647
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,123 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +
    +from functools import wraps
    +
    +from pyspark import PickleSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        pythonAPI = self._sc._jvm.PythonMLLibAPI()
    +        result = pythonAPI.Word2VecModelTransform(self._java_model, word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        SerDe = self._sc._jvm.SerDe
    +        ser = PickleSerializer()
    +        pythonAPI = self._sc._jvm.PythonMLLibAPI()
    +        if type(x) == str:
    +            jlist = pythonAPI.Word2VecModelSynonyms(self._java_model, x, num)
    +        else:
    +            bytes = bytearray(ser.dumps(_convert_to_vector(x)))
    +            vec = self._sc._jvm.SerDe.loads(bytes)
    +            jlist = pythonAPI.Word2VecModelSynonyms(self._java_model, vec, num)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(jlist)))
    +
    +
    +class Word2Vec(object):
    +    """
    +    Word2Vec creates vector representation of words in a text corpus.
    +    The algorithm first constructs a vocabulary from the corpus
    +    and then learns vector representation of words in the vocabulary.
    +    The vector representation can be used as features in
    +    natural language processing and machine learning algorithms.
    +
    +    We used skip-gram model in our implementation and hierarchical softmax
    +    method to train the model. The variable names in the implementation
    +    matches the original C implementation.
    +    For original C implementation, see https://code.google.com/p/word2vec/
    +    For research papers, see
    +    Efficient Estimation of Word Representations in Vector Space
    +    and
    +    Distributed Representations of Words and Phrases and their Compositionality.
    +    """
    +    def __init__(self):
    +        self.vectorSize = 100
    +        self.startingAlpha = 0.025
    +        self.numPartitions = 1
    +        self.numIterations = 1
    +
    +    def setVectorSize(self, vectorSize):
    +        self.vectorSize = vectorSize
    +        return self
    +
    +    def setLearningRate(self, learningRate):
    +        self.startingAlpha = learningRate
    +        return self
    +
    +    def setNumPartitions(self, numPartitions):
    +        self.numPartitions = numPartitions
    +        return self
    +
    +    def setNumIterations(self, numIterations):
    +        self.numIterations = numIterations
    +        return self
    +
    +    def fit(self, data):
    +        """
    +        :param data: Input RDD
    +        """
    +        sc = data.context
    +        model = sc._jvm.PythonMLLibAPI().trainWord2Vec(data._to_java_object_rdd())
    --- End diff --
    
    Can you elaborate? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18486384
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,192 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        """
    +        :param word: a word
    +        :return: vector representation of word
    +
    +        Note: local use only
    +        TODO: make transform usable in RDD operations from python side
    --- End diff --
    
    Let's move TODOs inside the implementation. Those are not for users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-55258752
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20153/consoleFull) for   PR 2356 at commit [`68e7276`](https://github.com/apache/spark/commit/68e7276896eeeb546f6f212f5a2f8ae5470cf0b5).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Word2VecModel(object):`
      * `class Word2Vec(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18486395
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,192 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        """
    +        :param word: a word
    +        :return: vector representation of word
    +
    +        Note: local use only
    +        TODO: make transform usable in RDD operations from python side
    +        """
    +        result = self._java_model.transform(word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        """
    +        :param x: a word or a vector representation of word
    +        :param num: number of synonyms to find
    +        :return: array of (word, cosineSimilarity)
    +
    +        Note: local use only
    +        TODO: make findSynonyms usable in RDD operations from python side
    +        """
    +        ser = PickleSerializer()
    +        if type(x) == str:
    +            jlist = self._java_model.findSynonyms(x, num)
    +        else:
    +            bytes = bytearray(ser.dumps(_convert_to_vector(x)))
    +            vec = self._sc._jvm.SerDe.loads(bytes)
    +            jlist = self._java_model.findSynonyms(vec, num)
    +        words, similarity = ser.loads(str(self._sc._jvm.SerDe.dumps(jlist)))
    +        return zip(words, similarity)
    +
    +
    +class Word2Vec(object):
    +    """
    +    Word2Vec creates vector representation of words in a text corpus.
    +    The algorithm first constructs a vocabulary from the corpus
    +    and then learns vector representation of words in the vocabulary.
    +    The vector representation can be used as features in
    +    natural language processing and machine learning algorithms.
    +
    +    We used skip-gram model in our implementation and hierarchical softmax
    +    method to train the model. The variable names in the implementation
    +    matches the original C implementation.
    +    For original C implementation, see https://code.google.com/p/word2vec/
    +    For research papers, see
    +    Efficient Estimation of Word Representations in Vector Space
    +    and
    +    Distributed Representations of Words and Phrases and their Compositionality.
    +
    +    >>> sentence = "a b " * 100 + "a c " * 10
    +    >>> localDoc = [sentence, sentence]
    +    >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
    +    >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +    >>> syms = model.findSynonyms("a", 2)
    +    >>> str(syms[0][0])
    +    'b'
    +    >>> str(syms[1][0])
    +    'c'
    +    >>> len(syms)
    +    2
    +    >>> vec = model.transform("a")
    +    >>> len(vec)
    +    10
    +    >>> syms = model.findSynonyms(vec, 2)
    +    >>> str(syms[0][0])
    +    'b'
    +    >>> str(syms[1][0])
    +    'c'
    +    >>> len(syms)
    +    2
    +    """
    +    def __init__(self):
    +        """
    +        Construct Word2Vec instance
    +        """
    +        self.vectorSize = 100
    +        self.learningRate = 0.025
    +        self.numPartitions = 1
    +        self.numIterations = 1
    +        self.seed = 42L
    +
    +    def setVectorSize(self, vectorSize):
    +        """
    +        Sets vector size (default: 100).
    +        """
    +        self.vectorSize = vectorSize
    +        return self
    +
    +    def setLearningRate(self, learningRate):
    +        """
    +        Sets initial learning rate (default: 0.025).
    +        """
    +        self.learningRate = learningRate
    +        return self
    +
    +    def setNumPartitions(self, numPartitions):
    +        """
    +        Sets number of partitions (default: 1). Use a small number for accuracy.
    +        """
    +        self.numPartitions = numPartitions
    +        return self
    +
    +    def setNumIterations(self, numIterations):
    +        """
    +        Sets number of iterations (default: 1), which should be smaller than or equal to number of
    +        partitions.
    +        """
    +        self.numIterations = numIterations
    +        return self
    +
    +    def setSeed(self, seed):
    +        """
    +        Sets random seed (default: a random long integer).
    +        """
    +        self.seed = seed
    +        return self
    +
    +    def fit(self, data):
    +        """
    +        Computes the vector representation of each word in vocabulary.
    +
    +        :param data: training data.
    --- End diff --
    
    need the type info on the input data


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-57888194
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21279/Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58270914
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21408/consoleFull) for   PR 2356 at commit [`b13a0b9`](https://github.com/apache/spark/commit/b13a0b9d47cb3a6604ece9773bcbbd2877db6299).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Word2VecModel(object):`
      * `class Word2Vec(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2356


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58252347
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-55375085
  
    @davies Could you take a look at this PR and see whether there is an easier way for SerDe? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18117608
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModel instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]]
    +    ): Word2VecModel = {
    +    val data = dataJRDD.rdd.map(_.toArray(new Array[String](0)).toSeq).cache()
    +    val word2vec = new Word2Vec()
    +    val model = word2vec.fit(data)
    +    model
    +  }
    +
    +  /**
    +   * Java stub for Python mllib Word2VecModel transform
    +   * @param model Word2VecModel instance
    +   * @param word a word
    +   * @return serialized vector representation of word
    +   */
    +  def Word2VecModelTransform(
    +    model: Word2VecModel,
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18117593
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,123 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +
    +from functools import wraps
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18486413
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,192 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        """
    +        :param word: a word
    +        :return: vector representation of word
    +
    +        Note: local use only
    +        TODO: make transform usable in RDD operations from python side
    +        """
    +        result = self._java_model.transform(word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        """
    +        :param x: a word or a vector representation of word
    +        :param num: number of synonyms to find
    +        :return: array of (word, cosineSimilarity)
    +
    +        Note: local use only
    +        TODO: make findSynonyms usable in RDD operations from python side
    +        """
    +        ser = PickleSerializer()
    +        if type(x) == str:
    +            jlist = self._java_model.findSynonyms(x, num)
    +        else:
    +            bytes = bytearray(ser.dumps(_convert_to_vector(x)))
    +            vec = self._sc._jvm.SerDe.loads(bytes)
    +            jlist = self._java_model.findSynonyms(vec, num)
    +        words, similarity = ser.loads(str(self._sc._jvm.SerDe.dumps(jlist)))
    +        return zip(words, similarity)
    +
    +
    +class Word2Vec(object):
    +    """
    +    Word2Vec creates vector representation of words in a text corpus.
    +    The algorithm first constructs a vocabulary from the corpus
    +    and then learns vector representation of words in the vocabulary.
    +    The vector representation can be used as features in
    +    natural language processing and machine learning algorithms.
    +
    +    We used skip-gram model in our implementation and hierarchical softmax
    +    method to train the model. The variable names in the implementation
    +    matches the original C implementation.
    +    For original C implementation, see https://code.google.com/p/word2vec/
    +    For research papers, see
    +    Efficient Estimation of Word Representations in Vector Space
    +    and
    +    Distributed Representations of Words and Phrases and their Compositionality.
    +
    +    >>> sentence = "a b " * 100 + "a c " * 10
    +    >>> localDoc = [sentence, sentence]
    +    >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
    +    >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +    >>> syms = model.findSynonyms("a", 2)
    +    >>> str(syms[0][0])
    +    'b'
    +    >>> str(syms[1][0])
    +    'c'
    +    >>> len(syms)
    +    2
    +    >>> vec = model.transform("a")
    +    >>> len(vec)
    +    10
    +    >>> syms = model.findSynonyms(vec, 2)
    +    >>> str(syms[0][0])
    +    'b'
    +    >>> str(syms[1][0])
    +    'c'
    +    >>> len(syms)
    +    2
    +    """
    +    def __init__(self):
    +        """
    +        Construct Word2Vec instance
    +        """
    +        self.vectorSize = 100
    +        self.learningRate = 0.025
    +        self.numPartitions = 1
    +        self.numIterations = 1
    +        self.seed = 42L
    +
    +    def setVectorSize(self, vectorSize):
    +        """
    +        Sets vector size (default: 100).
    +        """
    +        self.vectorSize = vectorSize
    +        return self
    +
    +    def setLearningRate(self, learningRate):
    +        """
    +        Sets initial learning rate (default: 0.025).
    +        """
    +        self.learningRate = learningRate
    +        return self
    +
    +    def setNumPartitions(self, numPartitions):
    +        """
    +        Sets number of partitions (default: 1). Use a small number for accuracy.
    +        """
    +        self.numPartitions = numPartitions
    +        return self
    +
    +    def setNumIterations(self, numIterations):
    +        """
    +        Sets number of iterations (default: 1), which should be smaller than or equal to number of
    +        partitions.
    +        """
    +        self.numIterations = numIterations
    +        return self
    +
    +    def setSeed(self, seed):
    +        """
    +        Sets random seed (default: a random long integer).
    +        """
    +        self.seed = seed
    +        return self
    +
    +    def fit(self, data):
    +        """
    +        Computes the vector representation of each word in vocabulary.
    +
    +        :param data: training data.
    +        :return: python Word2VecModel instance
    +        """
    +        sc = data.context
    +        ser = PickleSerializer()
    +        vectorSize = self.vectorSize
    +        learningRate = self.learningRate
    +        numPartitions = self.numPartitions
    +        numIterations = self.numIterations
    +        seed = self.seed
    +
    +        # cached = data._reserialize(AutoBatchedSerializer(ser)).cache()
    --- End diff --
    
    remove unused code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-56420439
  
    Now that #2378 has been merged, is this unblocked?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18184088
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,151 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +from numpy import random
    +
    +from sys import maxint
    --- End diff --
    
    organize imports:
    
    ~~~
    from sys ...
    
    from numpy ...
    
    from pyspark.serializers ...
    from pyspark.mllib ...
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-56869195
  
    @mengxr PR updated to use new pickle SerDe. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58270916
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21408/Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-57039641
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20894/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18121812
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,124 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +
    +from pyspark import PickleSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    --- End diff --
    
    add doc and note this is local


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-57881048
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21279/consoleFull) for   PR 2356 at commit [`a73fa19`](https://github.com/apache/spark/commit/a73fa19786bca754ecf8567bc83bdce1f90569ee).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58118924
  
    @Ishiihara Another file to update is `python/docs/pyspark.mllib.rst`. We need a section for `pyspark.mllib.feature` module.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-57888189
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21279/consoleFull) for   PR 2356 at commit [`a73fa19`](https://github.com/apache/spark/commit/a73fa19786bca754ecf8567bc83bdce1f90569ee).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `        case e: Exception => logError("Source class " + classPath + " cannot be instantiated", e)`
      * `  case class AddWebUIFilter(filterName:String, filterParams: Map[String, String], proxyBase :String)`
      * `class Word2VecModel(object):`
      * `class Word2Vec(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58267109
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/279/consoleFull) for   PR 2356 at commit [`daf88a6`](https://github.com/apache/spark/commit/daf88a6d6d901185b4699ee6f7325865f3174e07).
     * This patch **fails** unit tests.
     * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58278875
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21412/consoleFull) for   PR 2356 at commit [`476ea34`](https://github.com/apache/spark/commit/476ea34c9f576d425a05604f77cc3cab43fd5bae).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Word2VecModel(object):`
      * `class Word2Vec(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18486381
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,192 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        """
    +        :param word: a word
    --- End diff --
    
    It is nice to put a simple sentence summarizing the method before parameters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58270419
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-55425713
  
    @mengxr I'm looking into this, could we block this a few days until we find out the scalable way to do serialization?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-56869584
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20816/consoleFull) for   PR 2356 at commit [`78bbb53`](https://github.com/apache/spark/commit/78bbb533be9f9a11cb81fe4278e0833ade7fe833).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18121803
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,124 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +
    +from pyspark import PickleSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        result = self._java_model.transform(word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        jlist = self._java_model.findSynonyms(x, num)
    +        words, similarity = PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(jlist)))
    +        return zip(words, similarity)
    +
    +
    +class Word2Vec(object):
    +    """
    +    Word2Vec creates vector representation of words in a text corpus.
    +    The algorithm first constructs a vocabulary from the corpus
    +    and then learns vector representation of words in the vocabulary.
    +    The vector representation can be used as features in
    +    natural language processing and machine learning algorithms.
    +
    +    We used skip-gram model in our implementation and hierarchical softmax
    +    method to train the model. The variable names in the implementation
    +    matches the original C implementation.
    +    For original C implementation, see https://code.google.com/p/word2vec/
    +    For research papers, see
    +    Efficient Estimation of Word Representations in Vector Space
    +    and
    +    Distributed Representations of Words and Phrases and their Compositionality.
    +    >>> sentence = "a b " * 100 + "a c " * 10
    +    >>> localDoc = [sentence, sentence]
    +    >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
    +    >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +    >>> syms = model.findSynonyms("a", 2)
    +    >>> str(syms[0][0])
    +    'b'
    +    >>> str(syms[1][0])
    +    'c'
    +    """
    +    def __init__(self):
    +        self.vectorSize = 100
    +        self.startingAlpha = 0.025
    +        self.numPartitions = 1
    +        self.numIterations = 1
    +        self.seed = 42L
    +
    +    def setVectorSize(self, vectorSize):
    +        self.vectorSize = vectorSize
    +        return self
    +
    +    def setLearningRate(self, learningRate):
    +        self.startingAlpha = learningRate
    +        return self
    +
    +    def setNumPartitions(self, numPartitions):
    +        self.numPartitions = numPartitions
    +        return self
    +
    +    def setNumIterations(self, numIterations):
    +        self.numIterations = numIterations
    +        return self
    +
    +    def setSeed(self, seed):
    +        self.seed = seed
    +        return self
    +
    +    def fit(self, data):
    +        sc = data.context
    +        model = sc._jvm.PythonMLLibAPI().trainWord2Vec(data._to_java_object_rdd())
    --- End diff --
    
    parameters are not sent to scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18486326
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,58 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD input JavaRDD
    +   * @param vectorSize size of vector
    +   * @param learningRate initial learning rate
    +   * @param numPartitions number of partitions
    +   * @param numIterations number of iterations
    +   * @param seed initial seed for random generator
    +   * @return A handle to java Word2VecModelWrapper instance at python side
    +   */
    +  def trainWord2Vec(
    +      dataJRDD: JavaRDD[java.util.ArrayList[String]],
    +      vectorSize: Int,
    +      learningRate: Double,
    +      numPartitions: Int,
    +      numIterations: Int,
    +      seed: Long): Word2VecModelWrapper = {
    +    val data = dataJRDD.rdd.cache()
    --- End diff --
    
    Please change the storage level to `MEMORY_AND_DISK_SER`. With `spark.rdd.compress=true`, the memory reduces to 70MB from 900MB.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-57037893
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20894/consoleFull) for   PR 2356 at commit [`b9a7383`](https://github.com/apache/spark/commit/b9a73831c15b88955d69ee5ea359117d1441b298).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18184121
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,151 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +from numpy import random
    +
    +from sys import maxint
    +
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        """
    +        local use only
    +        TODO: make transform usable in RDD operations from python side
    +        """
    +        result = self._java_model.transform(word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        """
    +        local use only
    +        TODO: make findSynonyms usable in RDD operations from python side
    +        """
    +        jlist = self._java_model.findSynonyms(x, num)
    +        words, similarity = PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(jlist)))
    +        return zip(words, similarity)
    +
    +
    +class Word2Vec(object):
    +    """
    +    Word2Vec creates vector representation of words in a text corpus.
    +    The algorithm first constructs a vocabulary from the corpus
    +    and then learns vector representation of words in the vocabulary.
    +    The vector representation can be used as features in
    +    natural language processing and machine learning algorithms.
    +
    +    We used skip-gram model in our implementation and hierarchical softmax
    +    method to train the model. The variable names in the implementation
    +    matches the original C implementation.
    +    For original C implementation, see https://code.google.com/p/word2vec/
    +    For research papers, see
    +    Efficient Estimation of Word Representations in Vector Space
    +    and
    +    Distributed Representations of Words and Phrases and their Compositionality.
    +
    +    >>> sentence = "a b " * 100 + "a c " * 10
    +    >>> localDoc = [sentence, sentence]
    +    >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
    +    >>> model = Word2Vec().setVectorSize(10).setSeed(42L).fit(doc)
    +    >>> syms = model.findSynonyms("a", 2)
    +    >>> str(syms[0][0])
    +    'b'
    +    >>> str(syms[1][0])
    +    'c'
    +    >>> len(syms)
    +    2
    +    >>> vec = model.transform("a")
    +    >>> len(vec)
    +    10
    +    """
    +    def __init__(self):
    +        self.vectorSize = 100
    +        self.startingAlpha = 0.025
    --- End diff --
    
    use `learningRate` instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58271977
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21411/consoleFull) for   PR 2356 at commit [`476ea34`](https://github.com/apache/spark/commit/476ea34c9f576d425a05604f77cc3cab43fd5bae).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58251926
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21399/Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18121806
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,42 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModelWrapper instance at python side
    +   */
    +  def trainWord2Vec(dataJRDD: JavaRDD[java.util.ArrayList[String]]): Word2VecModelWrapper = {
    +    val data = dataJRDD.rdd.cache()
    --- End diff --
    
    cache() is not necessary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18063657
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModel instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]]
    +    ): Word2VecModel = {
    +    val data = dataJRDD.rdd.map(_.toArray(new Array[String](0)).toSeq).cache()
    --- End diff --
    
    +1 on @davies 's suggestion
    
    You don't need any type conversion here. `word2vec.fit(dataJRDD)` should work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18063656
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModel instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]]
    --- End diff --
    
    make the method declaration a single line (if it fits)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18184062
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,54 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModelWrapper instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]],
    +    vectorSize: Int,
    +    startingAlpha: Double,
    +    numPartitions: Int,
    +    numIterations: Int,
    +    seed: Long
    +    ): Word2VecModelWrapper = {
    --- End diff --
    
    merge this line to the one above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58255069
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/279/consoleFull) for   PR 2356 at commit [`daf88a6`](https://github.com/apache/spark/commit/daf88a6d6d901185b4699ee6f7325865f3174e07).
     * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-55248343
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20148/consoleFull) for   PR 2356 at commit [`48d5e72`](https://github.com/apache/spark/commit/48d5e721a58924f33ebef31b9e67454f45480d5c).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Word2VecModel(object):`
      * `class Word2Vec(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-55248249
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20148/consoleFull) for   PR 2356 at commit [`48d5e72`](https://github.com/apache/spark/commit/48d5e721a58924f33ebef31b9e67454f45480d5c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58257811
  
    @Ishiihara Could you try to merge master? Maybe the python doc conf changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58272541
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21412/consoleFull) for   PR 2356 at commit [`476ea34`](https://github.com/apache/spark/commit/476ea34c9f576d425a05604f77cc3cab43fd5bae).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18486738
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,58 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD input JavaRDD
    +   * @param vectorSize size of vector
    +   * @param learningRate initial learning rate
    +   * @param numPartitions number of partitions
    +   * @param numIterations number of iterations
    +   * @param seed initial seed for random generator
    +   * @return A handle to java Word2VecModelWrapper instance at python side
    +   */
    +  def trainWord2Vec(
    +      dataJRDD: JavaRDD[java.util.ArrayList[String]],
    +      vectorSize: Int,
    +      learningRate: Double,
    +      numPartitions: Int,
    +      numIterations: Int,
    +      seed: Long): Word2VecModelWrapper = {
    +    val data = dataJRDD.rdd.cache()
    +    val word2vec = new Word2Vec()
    +        .setVectorSize(vectorSize)
    --- End diff --
    
    use 2-space indentation for builder methods


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58271152
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18122598
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,124 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +"""
    +Python package for Word2Vec in MLlib.
    +"""
    +
    +from pyspark import PickleSerializer
    +
    +from pyspark.mllib.linalg import _convert_to_vector
    +
    +__all__ = ['Word2Vec', 'Word2VecModel']
    +
    +
    +class Word2VecModel(object):
    +    """
    +    class for Word2Vec model
    +    """
    +    def __init__(self, sc, java_model):
    +        """
    +        :param sc:  Spark context
    +        :param java_model:  Handle to Java model object
    +        """
    +        self._sc = sc
    +        self._java_model = java_model
    +
    +    def __del__(self):
    +        self._sc._gateway.detach(self._java_model)
    +
    +    def transform(self, word):
    +        result = self._java_model.transform(word)
    +        return PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(result)))
    +
    +    def findSynonyms(self, x, num):
    +        jlist = self._java_model.findSynonyms(x, num)
    +        words, similarity = PickleSerializer().loads(str(self._sc._jvm.SerDe.dumps(jlist)))
    +        return zip(words, similarity)
    +
    +
    +class Word2Vec(object):
    +    """
    +    Word2Vec creates vector representation of words in a text corpus.
    +    The algorithm first constructs a vocabulary from the corpus
    +    and then learns vector representation of words in the vocabulary.
    +    The vector representation can be used as features in
    +    natural language processing and machine learning algorithms.
    +
    +    We used skip-gram model in our implementation and hierarchical softmax
    +    method to train the model. The variable names in the implementation
    +    matches the original C implementation.
    +    For original C implementation, see https://code.google.com/p/word2vec/
    +    For research papers, see
    +    Efficient Estimation of Word Representations in Vector Space
    +    and
    +    Distributed Representations of Words and Phrases and their Compositionality.
    --- End diff --
    
    Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58119086
  
    @mengxr will take care of that and other comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18486359
  
    --- Diff: python/pyspark/mllib/Word2Vec.py ---
    @@ -0,0 +1,192 @@
    +#
    --- End diff --
    
    Please rename the file to `feature.py` to make `Word2Vec` live under `mllib.feature` package.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-56888002
  
    Could you add some tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18184054
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,54 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    --- End diff --
    
    add other parameters (or remove this line)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18118109
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModel instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]]
    +    ): Word2VecModel = {
    +    val data = dataJRDD.rdd.map(_.toArray(new Array[String](0)).toSeq).cache()
    --- End diff --
    
    @davies Are we always using batched serialization? The pythonToJava funciton at PythonRDD returns JavaRDD[Any]. Should I use JavaRDD[java.util.ArrayList[String]] as the return type? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2356#issuecomment-58278846
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21411/consoleFull) for   PR 2356 at commit [`476ea34`](https://github.com/apache/spark/commit/476ea34c9f576d425a05604f77cc3cab43fd5bae).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class Word2VecModel(object):`
      * `class Word2Vec(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

Posted by Ishiihara <gi...@git.apache.org>.

Github user Ishiihara commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2356#discussion_r18117604
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * Java stub for Python mllib Word2Vec fit(). This stub returns a
    +   * handle to the Java object instead of the content of the Java object.
    +   * Extra care needs to be taken in the Python code to ensure it gets freed on
    +   * exit; see the Py4J documentation.
    +   * @param dataJRDD Input JavaRDD
    +   * @return A handle to java Word2VecModel instance at python side
    +   */
    +  def trainWord2Vec(
    +    dataJRDD: JavaRDD[java.util.ArrayList[String]]
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org