You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yinxusen <gi...@git.apache.org> on 2015/04/20 18:07:17 UTC

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

GitHub user yinxusen opened a pull request:

    https://github.com/apache/spark/pull/5596

    [ML][SPARK-6529] Add Word2Vec transformer

    See JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-6529).
    
    There are some notes:
    
    1. I add `learningRate` in sharedParams since it is a common parameter for ML algorithms.
    2. We will not support transform of finding synonyms from a `Vector`, which will support in further JIRA issues.
    3. Word2Vec is different with other ML models that its training set and transformed set are different. Its training set is an `RDD[Iterable[String]]` which represents documents, but the transformed set we want is an `RDD[String]` that represents unique words. So you have to switch your `inputCol` in these two stages.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yinxusen/spark SPARK-6529

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5596.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5596
    
----
commit 6a514f16fd12f7b7dbf9fe33e442b33958d1cd20
Author: Xusen Yin <yi...@gmail.com>
Date:   2015-04-19T09:22:33Z

    add word2vec transformer

commit 02767fb59d2b3583000a15d3a337b6a41c6be71f
Author: Xusen Yin <yi...@gmail.com>
Date:   2015-04-20T04:43:48Z

    add shared params

commit fe3afe99214f72517a3a695063dc710110f8dd31
Author: Xusen Yin <yi...@gmail.com>
Date:   2015-04-20T06:53:29Z

    add test suite and pass it

commit e29680a091806bcb3ee6c9b8a44e407b4bd040fa
Author: Xusen Yin <yi...@gmail.com>
Date:   2015-04-20T15:34:09Z

    fix errors

commit 618abd0cc3727896448c227ccccac351a0e592a6
Author: Xusen Yin <yi...@gmail.com>
Date:   2015-04-20T15:57:37Z

    refine comments

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-94496921
  
      [Test build #30592 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30592/consoleFull) for   PR 5596 at commit [`618abd0`](https://github.com/apache/spark/commit/618abd0cc3727896448c227ccccac351a0e592a6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97594780
  
    @yinxusen Could you close this PR and re-open it? It forces GitHub to update the diff.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95513009
  
      [Test build #30830 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30830/consoleFull) for   PR 5596 at commit [`566ec20`](https://github.com/apache/spark/commit/566ec202f6fc0975041f25d6640cf60154b7d6e8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r29059297
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala ---
    @@ -0,0 +1,63 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.mllib.util.TestingUtils._
    +import org.apache.spark.sql.{Row, SQLContext}
    +
    +class Word2VecSuite extends FunSuite with MLlibTestSparkContext {
    +
    +  test("Word2Vec") {
    +    val sqlContext = new SQLContext(sc)
    +    import sqlContext.implicits._
    +
    +    val sentence = "a b " * 100 + "a c " * 10
    +    val numOfWords = sentence.split(" ").size
    +    val doc = sc.parallelize(Seq(sentence, sentence)).map(line => line.split(" "))
    +
    +    val codes = Map(
    +      "a" -> Array(-0.2811822295188904,-0.6356269121170044,-0.3020961284637451),
    +      "b" -> Array(1.0309048891067505,-1.29472815990448,0.22276712954044342),
    +      "c" -> Array(-0.08456747233867645,0.5137411952018738,0.11731560528278351)
    +    )
    +
    +    val expected = doc.map { sentence =>
    +      Vectors.dense(sentence.map(codes.apply).reduce((word1, word2) =>
    +        word1.zip(word2).map { case (v1, v2) => v1 + v2 }
    +      ).map(_ / numOfWords))
    +    }
    +
    +    val docDF = doc.zip(expected).toDF("text", "expected")
    +
    +    val model = new Word2Vec()
    +      .setVectorSize(3)
    +      .setInputCol("text")
    +      .setOutputCol("result")
    +      .fit(docDF)
    +
    +    model.transform(docDF).select("result", "expected").collect().foreach {
    +      case Row(vector1: Vector, vector2: Vector) =>
    +        assert(vector1 ~== vector2 absTol 1E-5, "Transformed vector is different with expected.")
    --- End diff --
    
    Please run this test multiple times and see whether this is deterministic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by oefirouz <gi...@git.apache.org>.

Github user oefirouz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r28739862
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala ---
    @@ -0,0 +1,238 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.ml.{Estimator, Model}
    +import org.apache.spark.ml.param.{HasInputCol, ParamMap, Params, _}
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
    +import org.apache.spark.sql.{DataFrame, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Params for [[Word2Vec]] and [[Word2VecModel]].
    + */
    +private[feature] trait Word2VecParams extends Params
    +  with HasInputCol with HasMaxIter with HasLearningRate {
    +
    +  /**
    +   * The dimension of the code that you want to transform from words.
    +   */
    +  val vectorSize = new IntParam(
    +    this, "vectorSize", "the dimension of codes after transforming from words", Some(100))
    +
    +  /** @group getParam */
    +  def getVectorSize: Int = get(vectorSize)
    +
    +  /**
    +   * Number of partitions for sentences of words.
    +   */
    +  val numPartitions = new IntParam(
    +    this, "numPartitions", "number of partitions for sentences of words", Some(1))
    +
    +  /** @group getParam */
    +  def getNumPartitions: Int = get(numPartitions)
    +
    +  /**
    +   * A random seed to random an initial vector.
    +   */
    +  val seed = new LongParam(
    +    this, "seed", "a random seed to random an initial vector", Some(Utils.random.nextLong()))
    +
    +  /** @group getParam */
    +  def getSeed: Long = get(seed)
    +
    +  /**
    +   * The minimum count of words that can be kept in training set.
    --- End diff --
    
    this wording is unclear, perhaps it would just be easier to copy the comments from the implementation?
    
    so for example: "The minimum number of times a token must appear to be included in the word2vec model's vocabulary"
    
    https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97561638
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95190641
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30758/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95542936
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30830/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97314155
  
    @mengxr Does it due to the former spark HEAD contains some errors? How about if I merge the PR with the newest HEAD?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-96874430
  
    I'm not sure if that was a valid failure.  I'll try re-running the tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97517089
  
    @yinxusen The easiest way to fix merge issues is to update your master branch and then use "git rebase master" (and fix whatever conflicts come up)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97561641
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31302/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r28745911
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala ---
    @@ -0,0 +1,238 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.ml.{Estimator, Model}
    +import org.apache.spark.ml.param.{HasInputCol, ParamMap, Params, _}
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
    +import org.apache.spark.sql.{DataFrame, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Params for [[Word2Vec]] and [[Word2VecModel]].
    + */
    +private[feature] trait Word2VecParams extends Params
    +  with HasInputCol with HasMaxIter with HasLearningRate {
    +
    +  /**
    +   * The dimension of the code that you want to transform from words.
    +   */
    +  val vectorSize = new IntParam(
    +    this, "vectorSize", "the dimension of codes after transforming from words", Some(100))
    +
    +  /** @group getParam */
    +  def getVectorSize: Int = get(vectorSize)
    +
    +  /**
    +   * Number of partitions for sentences of words.
    +   */
    +  val numPartitions = new IntParam(
    +    this, "numPartitions", "number of partitions for sentences of words", Some(1))
    +
    +  /** @group getParam */
    +  def getNumPartitions: Int = get(numPartitions)
    +
    +  /**
    +   * A random seed to random an initial vector.
    +   */
    +  val seed = new LongParam(
    +    this, "seed", "a random seed to random an initial vector", Some(Utils.random.nextLong()))
    +
    +  /** @group getParam */
    +  def getSeed: Long = get(seed)
    +
    +  /**
    +   * The minimum count of words that can be kept in training set.
    +   */
    +  val minCount = new IntParam(
    +    this, "minCount", "the minimum count of words to filter words", Some(5))
    +
    +  /** @group getParam */
    +  def getMinCount: Int = get(minCount)
    +
    +  /**
    +   * The column name of the output column - synonyms.
    +   */
    +  val synonymsCol = new Param[String](this, "synonymsCol", "Synonyms column name")
    +
    +  /** @group getParam */
    +  def getSynonymsCol: String = get(synonymsCol)
    +
    +  /**
    +   * The column name of the output column - code.
    +   */
    +  val codeCol = new Param[String](this, "codeCol", "Code column name")
    +
    +  /** @group getParam */
    +  def getCodeCol: String = get(codeCol)
    +
    +  /**
    +   * The number of synonyms that you want to have.
    +   */
    +  val numSynonyms = new IntParam(this, "numSynonyms", "number of synonyms to find", Some(0))
    +
    +  /** @group getParam */
    +  def getNumSynonyms: Int = get(numSynonyms)
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + * Word2Vec trains a model of `Map(String, Vector)`, i.e. transforms a word into a code for further
    + * natural language processing or machine learning process.
    + */
    +@AlphaComponent
    +class Word2Vec extends Estimator[Word2VecModel] with Word2VecParams {
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setVectorSize(value: Int) = set(vectorSize, value)
    +
    +  /** @group setParam */
    +  def setLearningRate(value: Double) = set(learningRate, value)
    +
    +  /** @group setParam */
    +  def setNumPartitions(value: Int) = set(numPartitions, value)
    +
    +  /** @group setParam */
    +  def setMaxIter(value: Int) = set(maxIter, value)
    +
    +  /** @group setParam */
    +  def setSeed(value: Long) = set(seed, value)
    +
    +  /** @group setParam */
    +  def setMinCount(value: Int) = set(minCount, value)
    +
    +  override def fit(dataset: DataFrame, paramMap: ParamMap): Word2VecModel = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = this.paramMap ++ paramMap
    +    val input = dataset.select(map(inputCol)).map { case Row(v: Seq[String]) => v }
    +    val wordVectors = new feature.Word2Vec()
    +      .setLearningRate(map(learningRate))
    +      .setMinCount(map(minCount))
    +      .setNumIterations(map(maxIter))
    +      .setNumPartitions(map(numPartitions))
    +      .setSeed(map(seed))
    +      .setVectorSize(map(vectorSize))
    +      .fit(input)
    +    val model = new Word2VecModel(this, map, wordVectors)
    +    Params.inheritValues(map, this, model)
    +    model
    +  }
    +
    +  override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = {
    +    val map = this.paramMap ++ paramMap
    +    val inputType = schema(map(inputCol)).dataType
    +    require(inputType.isInstanceOf[ArrayType],
    +      s"Input column ${map(inputCol)} must be a Iterable[String] column")
    +    schema
    +  }
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + * Model fitted by [[Word2Vec]].
    + */
    +@AlphaComponent
    +class Word2VecModel private[ml] (
    +    override val parent: Word2Vec,
    +    override val fittingParamMap: ParamMap,
    +    wordVectors: feature.Word2VecModel)
    +  extends Model[Word2VecModel] with Word2VecParams {
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setSynonymsCol(value: String): this.type = set(synonymsCol, value)
    +
    +  /** @group setParam */
    +  def setNumSynonyms(value: Int): this.type = set(numSynonyms, value)
    +
    +  /** @group setParam */
    +  def setCodeCol(value: String): this.type = set(codeCol, value)
    +
    +  /**
    +   * The transforming process of `Word2Vec` model has two approaches - 1. Transform a word of
    +   * `String` into a code of `Vector`; 2. Find n (given by you) synonyms of a given word.
    +   *
    +   * Note. Currently we only support finding synonyms for word of `String`, not `Vector`.
    +   */
    +  override def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = this.paramMap ++ paramMap
    +
    +    var tmpData = dataset
    +    var numColsOutput = 0
    +
    +    if (map(codeCol) != "") {
    +      val word2vec: String => Vector = (word) => wordVectors.transform(word)
    +      tmpData = tmpData.withColumn(map(codeCol),
    +        callUDF(word2vec, new VectorUDT, col(map(inputCol))))
    +      numColsOutput += 1
    +    }
    +
    +    if (map(synonymsCol) != "" & map(numSynonyms) > 0) {
    +      // TODO We will add finding synonyms for code of `Vector`.
    +      val findSynonyms = udf { (word: String) =>
    +        wordVectors.findSynonyms(word, map(numSynonyms)).toMap : Map[String, Double]
    +      }
    +      tmpData = tmpData.withColumn(map(synonymsCol), findSynonyms(col(map(inputCol))))
    +      numColsOutput += 1
    +    }
    +
    +    if (numColsOutput == 0) {
    +      this.logWarning(s"$uid: Word2VecModel.transform() was called as NOOP" +
    +        s" since no output columns were set.")
    +    }
    +
    +    tmpData
    +  }
    +
    +  override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = {
    +    val map = this.paramMap ++ paramMap
    +
    +    val inputType = schema(map(inputCol)).dataType
    +    require(inputType.isInstanceOf[StringType],
    +      s"Input column ${map(inputCol)} must be a string column")
    +
    +    var outputFields = schema.fields
    +
    +    if (map(codeCol) != "") {
    +      require(!schema.fieldNames.contains(map(codeCol)),
    +        s"Output column ${map(codeCol)} already exists.")
    +      outputFields = outputFields :+ StructField(map(codeCol), new VectorUDT, nullable = false)
    +    }
    +
    +    if (map(synonymsCol) != "") {
    +      require(!schema.fieldNames.contains(map(synonymsCol)),
    +        s"Output column ${map(synonymsCol)} already exists.")
    +      require(map(numSynonyms) > 0,
    +        s"Number of synonyms should larger than 0")
    --- End diff --
    
    typo. should be larger than.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-96856458
  
      [Test build #31078 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31078/consoleFull) for   PR 5596 at commit [`23d77fa`](https://github.com/apache/spark/commit/23d77fae32cbc59af869b4346f7e5c8a966d3678).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase `
      * `trait HasStepSize extends Params `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97529296
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97315270
  
    Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r28873658
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala ---
    @@ -0,0 +1,238 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.ml.{Estimator, Model}
    +import org.apache.spark.ml.param.{HasInputCol, ParamMap, Params, _}
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
    +import org.apache.spark.sql.{DataFrame, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Params for [[Word2Vec]] and [[Word2VecModel]].
    + */
    +private[feature] trait Word2VecParams extends Params
    +  with HasInputCol with HasMaxIter with HasLearningRate {
    +
    +  /**
    +   * The dimension of the code that you want to transform from words.
    +   */
    +  val vectorSize = new IntParam(
    +    this, "vectorSize", "the dimension of codes after transforming from words", Some(100))
    +
    +  /** @group getParam */
    +  def getVectorSize: Int = get(vectorSize)
    +
    +  /**
    +   * Number of partitions for sentences of words.
    +   */
    +  val numPartitions = new IntParam(
    +    this, "numPartitions", "number of partitions for sentences of words", Some(1))
    +
    +  /** @group getParam */
    +  def getNumPartitions: Int = get(numPartitions)
    +
    +  /**
    +   * A random seed to random an initial vector.
    +   */
    +  val seed = new LongParam(
    +    this, "seed", "a random seed to random an initial vector", Some(Utils.random.nextLong()))
    +
    +  /** @group getParam */
    +  def getSeed: Long = get(seed)
    +
    +  /**
    +   * The minimum count of words that can be kept in training set.
    +   */
    +  val minCount = new IntParam(
    +    this, "minCount", "the minimum count of words to filter words", Some(5))
    +
    +  /** @group getParam */
    +  def getMinCount: Int = get(minCount)
    +
    +  /**
    +   * The column name of the output column - synonyms.
    +   */
    +  val synonymsCol = new Param[String](this, "synonymsCol", "Synonyms column name")
    +
    +  /** @group getParam */
    +  def getSynonymsCol: String = get(synonymsCol)
    +
    +  /**
    +   * The column name of the output column - code.
    +   */
    +  val codeCol = new Param[String](this, "codeCol", "Code column name")
    +
    +  /** @group getParam */
    +  def getCodeCol: String = get(codeCol)
    +
    +  /**
    +   * The number of synonyms that you want to have.
    +   */
    +  val numSynonyms = new IntParam(this, "numSynonyms", "number of synonyms to find", Some(0))
    +
    +  /** @group getParam */
    +  def getNumSynonyms: Int = get(numSynonyms)
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + * Word2Vec trains a model of `Map(String, Vector)`, i.e. transforms a word into a code for further
    + * natural language processing or machine learning process.
    + */
    +@AlphaComponent
    +class Word2Vec extends Estimator[Word2VecModel] with Word2VecParams {
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setVectorSize(value: Int) = set(vectorSize, value)
    +
    +  /** @group setParam */
    +  def setLearningRate(value: Double) = set(learningRate, value)
    +
    +  /** @group setParam */
    +  def setNumPartitions(value: Int) = set(numPartitions, value)
    +
    +  /** @group setParam */
    +  def setMaxIter(value: Int) = set(maxIter, value)
    +
    +  /** @group setParam */
    +  def setSeed(value: Long) = set(seed, value)
    +
    +  /** @group setParam */
    +  def setMinCount(value: Int) = set(minCount, value)
    +
    +  override def fit(dataset: DataFrame, paramMap: ParamMap): Word2VecModel = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = this.paramMap ++ paramMap
    +    val input = dataset.select(map(inputCol)).map { case Row(v: Seq[String]) => v }
    +    val wordVectors = new feature.Word2Vec()
    +      .setLearningRate(map(learningRate))
    +      .setMinCount(map(minCount))
    +      .setNumIterations(map(maxIter))
    +      .setNumPartitions(map(numPartitions))
    +      .setSeed(map(seed))
    +      .setVectorSize(map(vectorSize))
    +      .fit(input)
    +    val model = new Word2VecModel(this, map, wordVectors)
    +    Params.inheritValues(map, this, model)
    +    model
    +  }
    +
    +  override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = {
    +    val map = this.paramMap ++ paramMap
    +    val inputType = schema(map(inputCol)).dataType
    +    require(inputType.isInstanceOf[ArrayType],
    +      s"Input column ${map(inputCol)} must be a Iterable[String] column")
    --- End diff --
    
    Sure, will do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97529323
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95189684
  
    @jkbradley I add stepSize as a sharedParam in the codegen file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97567675
  
    It looks like the rebase worked.  I'll let @mengxr give the final OK since he's done the review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r28745764
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala ---
    @@ -0,0 +1,238 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.ml.{Estimator, Model}
    +import org.apache.spark.ml.param.{HasInputCol, ParamMap, Params, _}
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
    +import org.apache.spark.sql.{DataFrame, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Params for [[Word2Vec]] and [[Word2VecModel]].
    + */
    +private[feature] trait Word2VecParams extends Params
    +  with HasInputCol with HasMaxIter with HasLearningRate {
    +
    +  /**
    +   * The dimension of the code that you want to transform from words.
    +   */
    +  val vectorSize = new IntParam(
    +    this, "vectorSize", "the dimension of codes after transforming from words", Some(100))
    +
    +  /** @group getParam */
    +  def getVectorSize: Int = get(vectorSize)
    +
    +  /**
    +   * Number of partitions for sentences of words.
    +   */
    +  val numPartitions = new IntParam(
    +    this, "numPartitions", "number of partitions for sentences of words", Some(1))
    +
    +  /** @group getParam */
    +  def getNumPartitions: Int = get(numPartitions)
    +
    +  /**
    +   * A random seed to random an initial vector.
    +   */
    +  val seed = new LongParam(
    +    this, "seed", "a random seed to random an initial vector", Some(Utils.random.nextLong()))
    +
    +  /** @group getParam */
    +  def getSeed: Long = get(seed)
    +
    +  /**
    +   * The minimum count of words that can be kept in training set.
    +   */
    +  val minCount = new IntParam(
    +    this, "minCount", "the minimum count of words to filter words", Some(5))
    +
    +  /** @group getParam */
    +  def getMinCount: Int = get(minCount)
    +
    +  /**
    +   * The column name of the output column - synonyms.
    +   */
    +  val synonymsCol = new Param[String](this, "synonymsCol", "Synonyms column name")
    +
    +  /** @group getParam */
    +  def getSynonymsCol: String = get(synonymsCol)
    +
    +  /**
    +   * The column name of the output column - code.
    +   */
    +  val codeCol = new Param[String](this, "codeCol", "Code column name")
    +
    +  /** @group getParam */
    +  def getCodeCol: String = get(codeCol)
    +
    +  /**
    +   * The number of synonyms that you want to have.
    +   */
    +  val numSynonyms = new IntParam(this, "numSynonyms", "number of synonyms to find", Some(0))
    +
    +  /** @group getParam */
    +  def getNumSynonyms: Int = get(numSynonyms)
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + * Word2Vec trains a model of `Map(String, Vector)`, i.e. transforms a word into a code for further
    + * natural language processing or machine learning process.
    + */
    +@AlphaComponent
    +class Word2Vec extends Estimator[Word2VecModel] with Word2VecParams {
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setVectorSize(value: Int) = set(vectorSize, value)
    +
    +  /** @group setParam */
    +  def setLearningRate(value: Double) = set(learningRate, value)
    +
    +  /** @group setParam */
    +  def setNumPartitions(value: Int) = set(numPartitions, value)
    +
    +  /** @group setParam */
    +  def setMaxIter(value: Int) = set(maxIter, value)
    +
    +  /** @group setParam */
    +  def setSeed(value: Long) = set(seed, value)
    +
    +  /** @group setParam */
    +  def setMinCount(value: Int) = set(minCount, value)
    +
    +  override def fit(dataset: DataFrame, paramMap: ParamMap): Word2VecModel = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = this.paramMap ++ paramMap
    +    val input = dataset.select(map(inputCol)).map { case Row(v: Seq[String]) => v }
    +    val wordVectors = new feature.Word2Vec()
    +      .setLearningRate(map(learningRate))
    +      .setMinCount(map(minCount))
    +      .setNumIterations(map(maxIter))
    +      .setNumPartitions(map(numPartitions))
    +      .setSeed(map(seed))
    +      .setVectorSize(map(vectorSize))
    +      .fit(input)
    +    val model = new Word2VecModel(this, map, wordVectors)
    +    Params.inheritValues(map, this, model)
    +    model
    +  }
    +
    +  override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = {
    +    val map = this.paramMap ++ paramMap
    +    val inputType = schema(map(inputCol)).dataType
    +    require(inputType.isInstanceOf[ArrayType],
    +      s"Input column ${map(inputCol)} must be a Iterable[String] column")
    --- End diff --
    
    Also check the elementType of inputType?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95946705
  
    @mengxr ready to review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95981773
  
      [Test build #30942 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30942/consoleFull) for   PR 5596 at commit [`66e7bd3`](https://github.com/apache/spark/commit/66e7bd3ca4ac0e0f96498d13123c702a323afbff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-96011239
  
      [Test build #30942 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30942/consoleFull) for   PR 5596 at commit [`66e7bd3`](https://github.com/apache/spark/commit/66e7bd3ca4ac0e0f96498d13123c702a323afbff).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase `
      * `trait HasStepSize extends Params `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-94528814
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30592/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r29059168
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala ---
    @@ -0,0 +1,190 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.ml.param._
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.ml.{Estimator, Model}
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{VectorUDT, Vectors}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.sql.{DataFrame, Row}
    +
    +/**
    + * Params for [[Word2Vec]] and [[Word2VecModel]].
    + */
    +private[feature] trait Word2VecBase extends Params
    +  with HasInputCol with HasOutputCol with HasMaxIter with HasStepSize {
    +
    +  /**
    +   * The dimension of the code that you want to transform from words.
    +   */
    +  final val vectorSize = new IntParam(
    +    this, "vectorSize", "the dimension of codes after transforming from words")
    +
    +  setDefault(vectorSize -> 100)
    +
    +  /** @group getParam */
    +  def getVectorSize: Int = getOrDefault(vectorSize)
    +
    +  /**
    +   * Number of partitions for sentences of words.
    +   */
    +  final val numPartitions = new IntParam(
    +    this, "numPartitions", "number of partitions for sentences of words")
    +
    +  setDefault(numPartitions -> 1)
    +
    +  /** @group getParam */
    +  def getNumPartitions: Int = getOrDefault(numPartitions)
    +
    +  /**
    +   * A random seed to random an initial vector.
    +   */
    +  final val seed = new LongParam(this, "seed", "a random seed to random an initial vector")
    --- End diff --
    
    If https://github.com/apache/spark/pull/5626 gets merged first, please update this PR to use shared params.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97293054
  
     Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97529580
  
      [Test build #31302 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31302/consoleFull) for   PR 5596 at commit [`ee2b37a`](https://github.com/apache/spark/commit/ee2b37ac50a816f7fa1c779a97cb2db881f0cc10).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5596


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r28745663
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala ---
    @@ -0,0 +1,238 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.ml.{Estimator, Model}
    +import org.apache.spark.ml.param.{HasInputCol, ParamMap, Params, _}
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
    +import org.apache.spark.sql.{DataFrame, Row}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.util.Utils
    +
    +/**
    + * Params for [[Word2Vec]] and [[Word2VecModel]].
    + */
    +private[feature] trait Word2VecParams extends Params
    +  with HasInputCol with HasMaxIter with HasLearningRate {
    +
    +  /**
    +   * The dimension of the code that you want to transform from words.
    +   */
    +  val vectorSize = new IntParam(
    +    this, "vectorSize", "the dimension of codes after transforming from words", Some(100))
    +
    +  /** @group getParam */
    +  def getVectorSize: Int = get(vectorSize)
    +
    +  /**
    +   * Number of partitions for sentences of words.
    +   */
    +  val numPartitions = new IntParam(
    +    this, "numPartitions", "number of partitions for sentences of words", Some(1))
    +
    +  /** @group getParam */
    +  def getNumPartitions: Int = get(numPartitions)
    +
    +  /**
    +   * A random seed to random an initial vector.
    +   */
    +  val seed = new LongParam(
    +    this, "seed", "a random seed to random an initial vector", Some(Utils.random.nextLong()))
    +
    +  /** @group getParam */
    +  def getSeed: Long = get(seed)
    +
    +  /**
    +   * The minimum count of words that can be kept in training set.
    --- End diff --
    
    Sure, I'll refine it in the next commit. Thx 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97293061
  
    Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95190631
  
      [Test build #30758 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30758/consoleFull) for   PR 5596 at commit [`b54399f`](https://github.com/apache/spark/commit/b54399f8729bcd71c3ded9708248f450c5fa192d).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase `
      * `trait HasStepSize extends Params `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r29060128
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala ---
    @@ -0,0 +1,190 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.ml.param._
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.ml.{Estimator, Model}
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{VectorUDT, Vectors}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.sql.{DataFrame, Row}
    +
    +/**
    + * Params for [[Word2Vec]] and [[Word2VecModel]].
    + */
    +private[feature] trait Word2VecBase extends Params
    +  with HasInputCol with HasOutputCol with HasMaxIter with HasStepSize {
    +
    +  /**
    +   * The dimension of the code that you want to transform from words.
    +   */
    +  final val vectorSize = new IntParam(
    +    this, "vectorSize", "the dimension of codes after transforming from words")
    +
    +  setDefault(vectorSize -> 100)
    +
    +  /** @group getParam */
    +  def getVectorSize: Int = getOrDefault(vectorSize)
    +
    +  /**
    +   * Number of partitions for sentences of words.
    +   */
    +  final val numPartitions = new IntParam(
    +    this, "numPartitions", "number of partitions for sentences of words")
    +
    +  setDefault(numPartitions -> 1)
    +
    +  /** @group getParam */
    +  def getNumPartitions: Int = getOrDefault(numPartitions)
    +
    +  /**
    +   * A random seed to random an initial vector.
    +   */
    +  final val seed = new LongParam(this, "seed", "a random seed to random an initial vector")
    --- End diff --
    
    sure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r28847697
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala ---
    @@ -0,0 +1,87 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.mllib.linalg.{Vector, Vectors}
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.mllib.util.TestingUtils._
    +import org.apache.spark.sql.{Row, SQLContext}
    +
    +class Word2VecSuite extends FunSuite with MLlibTestSparkContext {
    +
    +  test("Word2Vec") {
    +    val sqlContext = new SQLContext(sc)
    +    import sqlContext.implicits._
    +
    +    val sentence = "a b " * 100 + "a c " * 10
    +    val localDoc = Seq(sentence, sentence)
    +    val doc = sc.parallelize(localDoc)
    +      .map(line => line.split(" "))
    +    val docDF = doc.map(text => Tuple1(text)).toDF("text")
    +
    +    val model = new Word2Vec()
    +      .setVectorSize(3)
    +      .setSeed(42L)
    +      .setInputCol("text")
    +      .setMaxIter(1)
    +      .fit(docDF)
    +
    +    val words = sc.parallelize(Seq("a", "b", "c"))
    +    val codes = Map(
    +      "a" -> Vectors.dense(-0.2811822295188904,-0.6356269121170044,-0.3020961284637451),
    +      "b" -> Vectors.dense(1.0309048891067505,-1.29472815990448,0.22276712954044342),
    +      "c" -> Vectors.dense(-0.08456747233867645,0.5137411952018738,0.11731560528278351)
    +    )
    +
    +    val synonyms = Map(
    +      "a" -> Map("b" -> 0.3680490553379059),
    +      "b" -> Map("a" -> 0.3680490553379059),
    +      "c" -> Map("b" -> -0.8148014545440674)
    +    )
    +    val wordsDF = words.map(word => Tuple3(word, codes(word), synonyms(word)))
    +      .toDF("word", "realCode", "realSynonyms")
    +
    +    val res = model
    +      .setInputCol("word")
    +      .setCodeCol("code")
    +      .setSynonymsCol("syn")
    +      .setNumSynonyms(1)
    +      .transform(wordsDF)
    +
    +    assert(
    --- End diff --
    
    move `assert` into `foreach` to get more information.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-96874740
  
      [Test build #727 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/727/consoleFull) for   PR 5596 at commit [`23d77fa`](https://github.com/apache/spark/commit/23d77fae32cbc59af869b4346f7e5c8a966d3678).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-96855295
  
      [Test build #31078 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31078/consoleFull) for   PR 5596 at commit [`23d77fa`](https://github.com/apache/spark/commit/23d77fae32cbc59af869b4346f7e5c8a966d3678).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-94633001
  
    @jkbradley Yes, I think `stepSize` is good enough. Then I think we can break the name consistency between `Word2Vec` in mllib and ml. Say, `learningRate` of `Word2Vec` in mllib package can be substituted with `stepSize`. There is also another different name, i.e. the `numIterations` and `maxIter`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95263894
  
    @yinxusen Run dev/scalastyle
    It's not the generated sharedParams since style checking skips that file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95190057
  
      [Test build #30758 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30758/consoleFull) for   PR 5596 at commit [`b54399f`](https://github.com/apache/spark/commit/b54399f8729bcd71c3ded9708248f450c5fa192d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r29059173
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala ---
    @@ -0,0 +1,190 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.ml.param._
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.ml.{Estimator, Model}
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{VectorUDT, Vectors}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.sql.{DataFrame, Row}
    +
    +/**
    + * Params for [[Word2Vec]] and [[Word2VecModel]].
    + */
    +private[feature] trait Word2VecBase extends Params
    +  with HasInputCol with HasOutputCol with HasMaxIter with HasStepSize {
    +
    +  /**
    +   * The dimension of the code that you want to transform from words.
    +   */
    +  final val vectorSize = new IntParam(
    +    this, "vectorSize", "the dimension of codes after transforming from words")
    +
    +  setDefault(vectorSize -> 100)
    +
    +  /** @group getParam */
    +  def getVectorSize: Int = getOrDefault(vectorSize)
    +
    +  /**
    +   * Number of partitions for sentences of words.
    +   */
    +  final val numPartitions = new IntParam(
    +    this, "numPartitions", "number of partitions for sentences of words")
    +
    +  setDefault(numPartitions -> 1)
    +
    +  /** @group getParam */
    +  def getNumPartitions: Int = getOrDefault(numPartitions)
    +
    +  /**
    +   * A random seed to random an initial vector.
    +   */
    +  final val seed = new LongParam(this, "seed", "a random seed to random an initial vector")
    +
    +  setDefault(seed -> 42L)
    +
    +  /** @group getParam */
    +  def getSeed: Long = getOrDefault(seed)
    +
    +  /**
    +   * The minimum number of times a token must appear to be included in the word2vec model's
    +   * vocabulary.
    +   */
    +  final val minCount = new IntParam(this, "minCount", "the minimum number of times a token must " +
    +    "appear to be included in the word2vec model's vocabulary")
    +
    +  setDefault(minCount -> 5)
    +
    +  /** @group getParam */
    +  def getMinCount: Int = getOrDefault(minCount)
    +
    +  setDefault(stepSize -> 0.025)
    +  setDefault(maxIter -> 1)
    +
    +  /**
    +   * Validate and transform the input schema.
    +   */
    +  protected def validateAndTransformSchema(schema: StructType, paramMap: ParamMap): StructType = {
    +    val map = extractParamMap(paramMap)
    +    SchemaUtils.checkColumnType(schema, map(inputCol), new ArrayType(StringType, true))
    +    SchemaUtils.appendColumn(schema, map(outputCol), new VectorUDT)
    +  }
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + * Word2Vec trains a model of `Map(String, Vector)`, i.e. transforms a word into a code for further
    + * natural language processing or machine learning process.
    + */
    +@AlphaComponent
    +final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase {
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  /** @group setParam */
    +  def setVectorSize(value: Int): this.type = set(vectorSize, value)
    +
    +  /** @group setParam */
    +  def setStepSize(value: Double): this.type = set(stepSize, value)
    +
    +  /** @group setParam */
    +  def setNumPartitions(value: Int): this.type = set(numPartitions, value)
    +
    +  /** @group setParam */
    +  def setMaxIter(value: Int): this.type = set(maxIter, value)
    +
    +  /** @group setParam */
    +  def setSeed(value: Long): this.type = set(seed, value)
    +
    +  /** @group setParam */
    +  def setMinCount(value: Int): this.type = set(minCount, value)
    +
    +  override def fit(dataset: DataFrame, paramMap: ParamMap): Word2VecModel = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = extractParamMap(paramMap)
    +    val input = dataset.select(map(inputCol)).map { case Row(v: Seq[String]) => v }
    +    val wordVectors = new feature.Word2Vec()
    +      .setLearningRate(map(stepSize))
    +      .setMinCount(map(minCount))
    +      .setNumIterations(map(maxIter))
    +      .setNumPartitions(map(numPartitions))
    +      .setSeed(map(seed))
    +      .setVectorSize(map(vectorSize))
    +      .fit(input)
    +    val model = new Word2VecModel(this, map, wordVectors)
    +    Params.inheritValues(map, this, model)
    +    model
    +  }
    +
    +  override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = {
    +    validateAndTransformSchema(schema, paramMap)
    +  }
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + * Model fitted by [[Word2Vec]].
    + */
    +@AlphaComponent
    +class Word2VecModel private[ml] (
    +    override val parent: Word2Vec,
    +    override val fittingParamMap: ParamMap,
    +    wordVectors: feature.Word2VecModel)
    +  extends Model[Word2VecModel] with Word2VecBase {
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  /**
    +   * Transform a sentence column to a vector column to represent the whole sentence. The transform
    +   * is performed by averaging all word vectors it contains.
    +   */
    +  override def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = extractParamMap(paramMap)
    +    val bWordVectors = dataset.sqlContext.sparkContext.broadcast(wordVectors)
    +    val word2Vec = udf { v: Seq[String] =>
    +      if (v.size == 0) {
    +        Vectors.zeros(map(vectorSize))
    +      } else {
    +        Vectors.dense(
    +          v.map(bWordVectors.value.getVectors).foldLeft(Array.fill[Double](map(vectorSize))(0)) {
    +            (cum, vec) => cum.zip(vec).map(x => x._1 + x._2)
    --- End diff --
    
    Use blas's axpy and skip words that are not in the model. The functional style usually creates temp objects, which hurts performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97315272
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31225/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95192973
  
    @mengxr Maybe the generated file `sharedParams` breaks the max columns in a line rule.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95981895
  
    @mengxr Fixed the problems. I run the test 10 times and it looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-96769165
  
      [Test build #30998 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30998/consoleFull) for   PR 5596 at commit [`23d77fa`](https://github.com/apache/spark/commit/23d77fae32cbc59af869b4346f7e5c8a966d3678).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r29059360
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala ---
    @@ -0,0 +1,190 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.ml.param._
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.ml.{Estimator, Model}
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{VectorUDT, Vectors}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.sql.{DataFrame, Row}
    +
    +/**
    + * Params for [[Word2Vec]] and [[Word2VecModel]].
    + */
    +private[feature] trait Word2VecBase extends Params
    +  with HasInputCol with HasOutputCol with HasMaxIter with HasStepSize {
    +
    +  /**
    +   * The dimension of the code that you want to transform from words.
    +   */
    +  final val vectorSize = new IntParam(
    +    this, "vectorSize", "the dimension of codes after transforming from words")
    +
    +  setDefault(vectorSize -> 100)
    +
    +  /** @group getParam */
    +  def getVectorSize: Int = getOrDefault(vectorSize)
    +
    +  /**
    +   * Number of partitions for sentences of words.
    +   */
    +  final val numPartitions = new IntParam(
    +    this, "numPartitions", "number of partitions for sentences of words")
    +
    +  setDefault(numPartitions -> 1)
    +
    +  /** @group getParam */
    +  def getNumPartitions: Int = getOrDefault(numPartitions)
    +
    +  /**
    +   * A random seed to random an initial vector.
    +   */
    +  final val seed = new LongParam(this, "seed", "a random seed to random an initial vector")
    +
    +  setDefault(seed -> 42L)
    +
    +  /** @group getParam */
    +  def getSeed: Long = getOrDefault(seed)
    +
    +  /**
    +   * The minimum number of times a token must appear to be included in the word2vec model's
    +   * vocabulary.
    +   */
    +  final val minCount = new IntParam(this, "minCount", "the minimum number of times a token must " +
    +    "appear to be included in the word2vec model's vocabulary")
    +
    +  setDefault(minCount -> 5)
    +
    +  /** @group getParam */
    +  def getMinCount: Int = getOrDefault(minCount)
    +
    +  setDefault(stepSize -> 0.025)
    +  setDefault(maxIter -> 1)
    +
    +  /**
    +   * Validate and transform the input schema.
    +   */
    +  protected def validateAndTransformSchema(schema: StructType, paramMap: ParamMap): StructType = {
    +    val map = extractParamMap(paramMap)
    +    SchemaUtils.checkColumnType(schema, map(inputCol), new ArrayType(StringType, true))
    +    SchemaUtils.appendColumn(schema, map(outputCol), new VectorUDT)
    +  }
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + * Word2Vec trains a model of `Map(String, Vector)`, i.e. transforms a word into a code for further
    + * natural language processing or machine learning process.
    + */
    +@AlphaComponent
    +final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase {
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  /** @group setParam */
    +  def setVectorSize(value: Int): this.type = set(vectorSize, value)
    +
    +  /** @group setParam */
    +  def setStepSize(value: Double): this.type = set(stepSize, value)
    +
    +  /** @group setParam */
    +  def setNumPartitions(value: Int): this.type = set(numPartitions, value)
    +
    +  /** @group setParam */
    +  def setMaxIter(value: Int): this.type = set(maxIter, value)
    +
    +  /** @group setParam */
    +  def setSeed(value: Long): this.type = set(seed, value)
    +
    +  /** @group setParam */
    +  def setMinCount(value: Int): this.type = set(minCount, value)
    +
    +  override def fit(dataset: DataFrame, paramMap: ParamMap): Word2VecModel = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = extractParamMap(paramMap)
    +    val input = dataset.select(map(inputCol)).map { case Row(v: Seq[String]) => v }
    +    val wordVectors = new feature.Word2Vec()
    +      .setLearningRate(map(stepSize))
    +      .setMinCount(map(minCount))
    +      .setNumIterations(map(maxIter))
    +      .setNumPartitions(map(numPartitions))
    +      .setSeed(map(seed))
    +      .setVectorSize(map(vectorSize))
    +      .fit(input)
    +    val model = new Word2VecModel(this, map, wordVectors)
    +    Params.inheritValues(map, this, model)
    +    model
    +  }
    +
    +  override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = {
    +    validateAndTransformSchema(schema, paramMap)
    +  }
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + * Model fitted by [[Word2Vec]].
    + */
    +@AlphaComponent
    +class Word2VecModel private[ml] (
    +    override val parent: Word2Vec,
    +    override val fittingParamMap: ParamMap,
    +    wordVectors: feature.Word2VecModel)
    +  extends Model[Word2VecModel] with Word2VecBase {
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  /**
    +   * Transform a sentence column to a vector column to represent the whole sentence. The transform
    +   * is performed by averaging all word vectors it contains.
    +   */
    +  override def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = extractParamMap(paramMap)
    +    val bWordVectors = dataset.sqlContext.sparkContext.broadcast(wordVectors)
    +    val word2Vec = udf { v: Seq[String] =>
    +      if (v.size == 0) {
    +        Vectors.zeros(map(vectorSize))
    --- End diff --
    
    Output sparse vector.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97561582
  
      [Test build #31302 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31302/consoleFull) for   PR 5596 at commit [`ee2b37a`](https://github.com/apache/spark/commit/ee2b37ac50a816f7fa1c779a97cb2db881f0cc10).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase `
      * `trait HasStepSize extends Params `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97292571
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-94618313
  
    A few initial thoughts:
    * ```learningRate```: This is the same as ```stepSize```.  We should probably only use one of the two.  I'm OK with either but would vote for ```stepSize``` if I had to choose since that's more common in the optimization literature, which is ML is referencing more and more.  Ping @mengxr -- opinions?
    * Very good point about the different input column types.  I'm adding a note to the JIRA about it since it's a design issue we should discuss a little.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95542923
  
      [Test build #30830 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30830/consoleFull) for   PR 5596 at commit [`566ec20`](https://github.com/apache/spark/commit/566ec202f6fc0975041f25d6640cf60154b7d6e8).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase `
      * `trait HasStepSize extends Params `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95514660
  
    @jkbradley Thanks! Can't believe that I knew the tool just now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-96854811
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-96392739
  
    @mengxr I have merged it with #5626. You can retest it when possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-95189512
  
    @mengxr As we talked, I average all vectors of words in a sentence as the output column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97293130
  
      [Test build #31225 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31225/consoleFull) for   PR 5596 at commit [`23d77fa`](https://github.com/apache/spark/commit/23d77fae32cbc59af869b4346f7e5c8a966d3678).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5596#discussion_r29059175
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala ---
    @@ -0,0 +1,190 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.ml.param._
    +import org.apache.spark.ml.param.shared._
    +import org.apache.spark.ml.util.SchemaUtils
    +import org.apache.spark.ml.{Estimator, Model}
    +import org.apache.spark.mllib.feature
    +import org.apache.spark.mllib.linalg.{VectorUDT, Vectors}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types._
    +import org.apache.spark.sql.{DataFrame, Row}
    +
    +/**
    + * Params for [[Word2Vec]] and [[Word2VecModel]].
    + */
    +private[feature] trait Word2VecBase extends Params
    +  with HasInputCol with HasOutputCol with HasMaxIter with HasStepSize {
    +
    +  /**
    +   * The dimension of the code that you want to transform from words.
    +   */
    +  final val vectorSize = new IntParam(
    +    this, "vectorSize", "the dimension of codes after transforming from words")
    +
    +  setDefault(vectorSize -> 100)
    +
    +  /** @group getParam */
    +  def getVectorSize: Int = getOrDefault(vectorSize)
    +
    +  /**
    +   * Number of partitions for sentences of words.
    +   */
    +  final val numPartitions = new IntParam(
    +    this, "numPartitions", "number of partitions for sentences of words")
    +
    +  setDefault(numPartitions -> 1)
    +
    +  /** @group getParam */
    +  def getNumPartitions: Int = getOrDefault(numPartitions)
    +
    +  /**
    +   * A random seed to random an initial vector.
    +   */
    +  final val seed = new LongParam(this, "seed", "a random seed to random an initial vector")
    +
    +  setDefault(seed -> 42L)
    +
    +  /** @group getParam */
    +  def getSeed: Long = getOrDefault(seed)
    +
    +  /**
    +   * The minimum number of times a token must appear to be included in the word2vec model's
    +   * vocabulary.
    +   */
    +  final val minCount = new IntParam(this, "minCount", "the minimum number of times a token must " +
    +    "appear to be included in the word2vec model's vocabulary")
    +
    +  setDefault(minCount -> 5)
    +
    +  /** @group getParam */
    +  def getMinCount: Int = getOrDefault(minCount)
    +
    +  setDefault(stepSize -> 0.025)
    +  setDefault(maxIter -> 1)
    +
    +  /**
    +   * Validate and transform the input schema.
    +   */
    +  protected def validateAndTransformSchema(schema: StructType, paramMap: ParamMap): StructType = {
    +    val map = extractParamMap(paramMap)
    +    SchemaUtils.checkColumnType(schema, map(inputCol), new ArrayType(StringType, true))
    +    SchemaUtils.appendColumn(schema, map(outputCol), new VectorUDT)
    +  }
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + * Word2Vec trains a model of `Map(String, Vector)`, i.e. transforms a word into a code for further
    + * natural language processing or machine learning process.
    + */
    +@AlphaComponent
    +final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase {
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  /** @group setParam */
    +  def setVectorSize(value: Int): this.type = set(vectorSize, value)
    +
    +  /** @group setParam */
    +  def setStepSize(value: Double): this.type = set(stepSize, value)
    +
    +  /** @group setParam */
    +  def setNumPartitions(value: Int): this.type = set(numPartitions, value)
    +
    +  /** @group setParam */
    +  def setMaxIter(value: Int): this.type = set(maxIter, value)
    +
    +  /** @group setParam */
    +  def setSeed(value: Long): this.type = set(seed, value)
    +
    +  /** @group setParam */
    +  def setMinCount(value: Int): this.type = set(minCount, value)
    +
    +  override def fit(dataset: DataFrame, paramMap: ParamMap): Word2VecModel = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = extractParamMap(paramMap)
    +    val input = dataset.select(map(inputCol)).map { case Row(v: Seq[String]) => v }
    +    val wordVectors = new feature.Word2Vec()
    +      .setLearningRate(map(stepSize))
    +      .setMinCount(map(minCount))
    +      .setNumIterations(map(maxIter))
    +      .setNumPartitions(map(numPartitions))
    +      .setSeed(map(seed))
    +      .setVectorSize(map(vectorSize))
    +      .fit(input)
    +    val model = new Word2VecModel(this, map, wordVectors)
    +    Params.inheritValues(map, this, model)
    +    model
    +  }
    +
    +  override def transformSchema(schema: StructType, paramMap: ParamMap): StructType = {
    +    validateAndTransformSchema(schema, paramMap)
    +  }
    +}
    +
    +/**
    + * :: AlphaComponent ::
    + * Model fitted by [[Word2Vec]].
    + */
    +@AlphaComponent
    +class Word2VecModel private[ml] (
    +    override val parent: Word2Vec,
    +    override val fittingParamMap: ParamMap,
    +    wordVectors: feature.Word2VecModel)
    +  extends Model[Word2VecModel] with Word2VecBase {
    +
    +  /** @group setParam */
    +  def setInputCol(value: String): this.type = set(inputCol, value)
    +
    +  /** @group setParam */
    +  def setOutputCol(value: String): this.type = set(outputCol, value)
    +
    +  /**
    +   * Transform a sentence column to a vector column to represent the whole sentence. The transform
    +   * is performed by averaging all word vectors it contains.
    +   */
    +  override def transform(dataset: DataFrame, paramMap: ParamMap): DataFrame = {
    +    transformSchema(dataset.schema, paramMap, logging = true)
    +    val map = extractParamMap(paramMap)
    +    val bWordVectors = dataset.sqlContext.sparkContext.broadcast(wordVectors)
    +    val word2Vec = udf { v: Seq[String] =>
    +      if (v.size == 0) {
    +        Vectors.zeros(map(vectorSize))
    +      } else {
    +        Vectors.dense(
    +          v.map(bWordVectors.value.getVectors).foldLeft(Array.fill[Double](map(vectorSize))(0)) {
    +            (cum, vec) => cum.zip(vec).map(x => x._1 + x._2)
    +          }.map(_ / v.size)
    --- End diff --
    
    Similar here. Use BLAS's dscal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97529316
  
    Well ... I just rebased it. Can you merge it? @jkbradley If not, I will reopen a new PR and close this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97315263
  
      [Test build #31225 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31225/consoleFull) for   PR 5596 at commit [`23d77fa`](https://github.com/apache/spark/commit/23d77fae32cbc59af869b4346f7e5c8a966d3678).
     * This patch **passes all tests**.
     * This patch **does not merge cleanly**.
     * This patch adds the following public classes _(experimental)_:
      * `final class Word2Vec extends Estimator[Word2VecModel] with Word2VecBase `
      * `trait HasStepSize extends Params `
    
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-94642692
  
    @yinxusen  True, we'll have to introduce some inconsistencies between .ml and .mllib no matter what.  For iterations, I like "max" since it's more precise that "num" for some algorithms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-97600239
  
    Okay, LGTM. I verified the diff on my machine and merged this into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [ML][SPARK-6529] Add Word2Vec transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5596#issuecomment-96011257
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30942/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org