You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by burakkose <gi...@git.apache.org> on 2016/03/22 01:02:42 UTC

[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

GitHub user burakkose opened a pull request:

    https://github.com/apache/spark/pull/11871

    [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover

    Apache Spark is a global engine, so It should appeal to everyone as much as possible. I added multiple language support for StopWordRemover, and used nltk's language list except for English(English is still same). I added some facilitator functions such as setLanguage, setAdditionalWords and setIgnoredWords. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/burakkose/spark StopWordsImp

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11871.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11871
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199830083
  
    @burakkose the `LICENSE` file and other license info probably needs updating if you're including this. I can help if you'll point me to the source of this data and its license.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r57397655
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -31,51 +31,16 @@ import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
      */
     private[spark] object StopWords {
    --- End diff --
    
    This class is package private and hence users do not have access to `readStopWords`. We can either make it public or move the methods to `object StopWordsRemover`. If we take the former approach, we should check Java compatibility. I like the latter approach better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200137181
  
    **[Test build #53868 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53868/consoleFull)** for PR 11871 at commit [`2e7c54e`](https://github.com/apache/spark/commit/2e7c54e5c17e7c5672a43ffc28acb207e94bf28a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r56936907
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -98,13 +67,16 @@ class StopWordsRemover(override val uid: String)
     
       /**
        * the stop words set to be filtered out
    -   * Default: [[StopWords.English]]
    +   * Default: [[StopWords.languageMap("english")]]
    --- End diff --
    
    It won't show up correctly since `object StopWords` is private. `StopWords.English` is not correct either. We can just mention that by default it is unset and it will be set to the default list for the selected language during transformation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r57875878
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +71,26 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  setDefault(stopWords -> Array.empty[String], caseSensitive -> false)
     
       override def transform(dataset: DataFrame): DataFrame = {
    +    val stopWordsSet = if ($(stopWords).isEmpty) {
    +      StopWordsRemover.loadStopWords("english").toSet
    +    } else {
    +      $(stopWords).toSet
    +    }
    +
         val outputSchema = transformSchema(dataset.schema)
         val t = if ($(caseSensitive)) {
    -        val stopWordsSet = $(stopWords).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !stopWordsSet.contains(s))
    -        }
    -      } else {
    -        val toLower = (s: String) => if (s != null) s.toLowerCase else s
    -        val lowerStopWords = $(stopWords).map(toLower(_)).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !lowerStopWords.contains(toLower(s)))
    -        }
    +      udf { terms: Seq[String] =>
    +        terms.filter(s => !stopWordsSet.contains(s))
    --- End diff --
    
    I think this can be `terms.filterNot(stopWordsSet.contains)`?
    It seems like this code path will always pay the cost of making a set out of the stopwords. It's not huge but wonder if it makes sense to store a ref to the set once?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199906689
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200801207
  
    **[Test build #54028 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54028/consoleFull)** for PR 11871 at commit [`7efda40`](https://github.com/apache/spark/commit/7efda40e39663deef0b0884a7bfca13b5d10d706).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r58301673
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +71,26 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  setDefault(stopWords -> Array.empty[String], caseSensitive -> false)
     
       override def transform(dataset: DataFrame): DataFrame = {
    +    val stopWordsSet = if ($(stopWords).isEmpty) {
    +      StopWordsRemover.loadStopWords("english").toSet
    +    } else {
    +      $(stopWords).toSet
    +    }
    +
         val outputSchema = transformSchema(dataset.schema)
         val t = if ($(caseSensitive)) {
    -        val stopWordsSet = $(stopWords).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !stopWordsSet.contains(s))
    -        }
    -      } else {
    -        val toLower = (s: String) => if (s != null) s.toLowerCase else s
    -        val lowerStopWords = $(stopWords).map(toLower(_)).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !lowerStopWords.contains(toLower(s)))
    -        }
    +      udf { terms: Seq[String] =>
    +        terms.filter(s => !stopWordsSet.contains(s))
    +      }
    +    } else {
    +      val toLower = (s: String) => if (s != null) s.toLowerCase else s
    +      val lowerStopWords = stopWordsSet.map(toLower(_)).toSet
    --- End diff --
    
    Before my editing, they wrote that condition. I thought as you said. However, user may do that.
    ```
    //Other operations to assign to the word. Just an example
    val word: String = null
    val stopwords = Array(word)
    val remover = new StopWordsRemover()
          .setInputCol("raw")
          .setOutputCol("filtered")
          .setStopWords(stopWords)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r59193903
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +71,26 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  setDefault(stopWords -> Array.empty[String], caseSensitive -> false)
     
       override def transform(dataset: DataFrame): DataFrame = {
    +    val stopWordsSet = if ($(stopWords).isEmpty) {
    +      StopWordsRemover.loadStopWords("english").toSet
    +    } else {
    +      $(stopWords).toSet
    +    }
    +
         val outputSchema = transformSchema(dataset.schema)
         val t = if ($(caseSensitive)) {
    -        val stopWordsSet = $(stopWords).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !stopWordsSet.contains(s))
    -        }
    -      } else {
    -        val toLower = (s: String) => if (s != null) s.toLowerCase else s
    -        val lowerStopWords = $(stopWords).map(toLower(_)).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !lowerStopWords.contains(toLower(s)))
    -        }
    +      udf { terms: Seq[String] =>
    +        terms.filter(s => !stopWordsSet.contains(s))
    +      }
    +    } else {
    +      val toLower = (s: String) => if (s != null) s.toLowerCase else s
    +      val lowerStopWords = stopWordsSet.map(toLower(_)).toSet
    --- End diff --
    
    Sorry for late response, I have midterms in these days, I will work on these isues as soon as possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-201489258
  
    **[Test build #54184 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54184/consoleFull)** for PR 11871 at commit [`789342f`](https://github.com/apache/spark/commit/789342f2d26759db180868a9f59b02c8f85cc835).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-217013268
  
    **[Test build #57806 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57806/consoleFull)** for PR 11871 at commit [`dec0634`](https://github.com/apache/spark/commit/dec0634a574124ab53c706b14982a6c81a282c97).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200954644
  
    @srowen, spelling's ok, need re-test, does this pr need anything else?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r56936989
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +95,74 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  /**
    +    * the language of stop words
    +    * Default: "english"
    +    * @group param
    +    */
    +  val language: Param[String] = new Param[String](this, "language", "stopwords language")
    --- End diff --
    
    In the doc, we should mention supported languages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200137391
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53868/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199634641
  
    **[Test build #53745 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53745/consoleFull)** for PR 11871 at commit [`6d215b3`](https://github.com/apache/spark/commit/6d215b31a205c4a79e8cc0ef6963d239941e80ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-203039860
  
    **[Test build #54446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54446/consoleFull)** for PR 11871 at commit [`789342f`](https://github.com/apache/spark/commit/789342f2d26759db180868a9f59b02c8f85cc835).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216961493
  
    **[Test build #57785 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57785/consoleFull)** for PR 11871 at commit [`01471ec`](https://github.com/apache/spark/commit/01471ec2a74ff86dfaa417509d0f90e2db80b768).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-217013298
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216281122
  
    @burakkose I sent out a PR at https://github.com/apache/spark/pull/12843 and it would be great if you can help review it. I think we should get this one into Spark 2.0. There is also a TODO to add locale support. If you have time, could you start working on https://issues.apache.org/jira/browse/SPARK-15064? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200057572
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r58301587
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +71,26 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  setDefault(stopWords -> Array.empty[String], caseSensitive -> false)
     
       override def transform(dataset: DataFrame): DataFrame = {
    +    val stopWordsSet = if ($(stopWords).isEmpty) {
    +      StopWordsRemover.loadStopWords("english").toSet
    +    } else {
    +      $(stopWords).toSet
    +    }
    +
         val outputSchema = transformSchema(dataset.schema)
         val t = if ($(caseSensitive)) {
    -        val stopWordsSet = $(stopWords).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !stopWordsSet.contains(s))
    -        }
    -      } else {
    -        val toLower = (s: String) => if (s != null) s.toLowerCase else s
    -        val lowerStopWords = $(stopWords).map(toLower(_)).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !lowerStopWords.contains(toLower(s)))
    -        }
    +      udf { terms: Seq[String] =>
    +        terms.filter(s => !stopWordsSet.contains(s))
    --- End diff --
    
    See question below about null words


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r58300978
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +71,26 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  setDefault(stopWords -> Array.empty[String], caseSensitive -> false)
     
       override def transform(dataset: DataFrame): DataFrame = {
    +    val stopWordsSet = if ($(stopWords).isEmpty) {
    +      StopWordsRemover.loadStopWords("english").toSet
    +    } else {
    +      $(stopWords).toSet
    +    }
    +
         val outputSchema = transformSchema(dataset.schema)
         val t = if ($(caseSensitive)) {
    -        val stopWordsSet = $(stopWords).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !stopWordsSet.contains(s))
    -        }
    -      } else {
    -        val toLower = (s: String) => if (s != null) s.toLowerCase else s
    -        val lowerStopWords = $(stopWords).map(toLower(_)).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !lowerStopWords.contains(toLower(s)))
    -        }
    +      udf { terms: Seq[String] =>
    +        terms.filter(s => !stopWordsSet.contains(s))
    --- End diff --
    
    Can you give more information about that case. What is the best way for you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-217013300
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57806/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-202534323
  
    Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200440838
  
    **[Test build #53946 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53946/consoleFull)** for PR 11871 at commit [`7efda40`](https://github.com/apache/spark/commit/7efda40e39663deef0b0884a7bfca13b5d10d706).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200933745
  
    **[Test build #54058 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54058/consoleFull)** for PR 11871 at commit [`a066e8b`](https://github.com/apache/spark/commit/a066e8b34ec4824fa26a1e306e197b66400f5ccb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199945398
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53788/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200044982
  
    **[Test build #53822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53822/consoleFull)** for PR 11871 at commit [`a308622`](https://github.com/apache/spark/commit/a30862231c3944c55c96cc94e162f61614aee6d5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-203040222
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200801405
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54028/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-202981940
  
    **[Test build #54446 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54446/consoleFull)** for PR 11871 at commit [`789342f`](https://github.com/apache/spark/commit/789342f2d26759db180868a9f59b02c8f85cc835).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216697656
  
    **[Test build #57689 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57689/consoleFull)** for PR 11871 at commit [`cb786ee`](https://github.com/apache/spark/commit/cb786eef0f75aa19d9416ed1c07b90510b4bf70b).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199906693
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53784/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r58301543
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +71,26 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  setDefault(stopWords -> Array.empty[String], caseSensitive -> false)
     
       override def transform(dataset: DataFrame): DataFrame = {
    +    val stopWordsSet = if ($(stopWords).isEmpty) {
    +      StopWordsRemover.loadStopWords("english").toSet
    +    } else {
    +      $(stopWords).toSet
    +    }
    +
         val outputSchema = transformSchema(dataset.schema)
         val t = if ($(caseSensitive)) {
    -        val stopWordsSet = $(stopWords).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !stopWordsSet.contains(s))
    -        }
    -      } else {
    -        val toLower = (s: String) => if (s != null) s.toLowerCase else s
    -        val lowerStopWords = $(stopWords).map(toLower(_)).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !lowerStopWords.contains(toLower(s)))
    -        }
    +      udf { terms: Seq[String] =>
    +        terms.filter(s => !stopWordsSet.contains(s))
    --- End diff --
    
    Yes, I will fix. Do you have any additional suggestions about the pull-request, such as additional features?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200057577
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53822/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200755657
  
    **[Test build #54028 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54028/consoleFull)** for PR 11871 at commit [`7efda40`](https://github.com/apache/spark/commit/7efda40e39663deef0b0884a7bfca13b5d10d706).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199945394
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200588724
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #11871: [SPARK-14050][ML] Add multiple languages support and add...

Posted by Halawa13 <gi...@git.apache.org>.
Github user Halawa13 commented on the issue:

    https://github.com/apache/spark/pull/11871
  
    Hello,
    
    I'm new to programming pyspark, i have problem with this code 
    
    **from pyspark.ml.feature import StopWordsRemover
    df = sc.createDataFrame([(0,["je", "suis", "malade", "comme", "la", "dernierer"]),(1,["si", "non", "tu", "vas", "bien"])],["label", "raw"])
    remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
    remover.transform(sentenceData).show(truncate=False)**
    
    I want use "loadDefaultStopWords("french")" bat i don't now how use it.
    I tried 
    remover.loadDefaultStopWords("french").transform(sentenceData).show(truncate=False) : 
    It is not working
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-201108669
  
    ~~~scala
        val remover = new StopWordsRemover()
          .setInputCol("raw")
          .setOutputCol("filtered")
          .setStopWords(StopWordsRemover.loadStopWords("danish"))
    ~~~
    
    It doesn't complicate the code by much but the interface is cleaner and it is slightly easier for us to maintain the Python API (just adding a class method to `StopWordsRemover` in Python).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200754041
  
    Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r57291924
  
    --- Diff: licenses/LICENCE-postgresql.txt ---
    @@ -0,0 +1,24 @@
    +PostgreSQL Database Management System
    --- End diff --
    
    Yes, though nit: "LICENSE" vs "LICENCE". We're using the US spelling consistently


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-201489396
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r56936780
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -31,51 +31,20 @@ import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
      */
     private[spark] object StopWords {
     
    -  /**
    -   * Use the same default stopwords list as scikit-learn.
    -   * The original list can be found from "Glasgow Information Retrieval Group"
    -   * [[http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words]]
    -   */
    -  val English = Array( "a", "about", "above", "across", "after", "afterwards", "again",
    -    "against", "all", "almost", "alone", "along", "already", "also", "although", "always",
    -    "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
    -    "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
    -    "around", "as", "at", "back", "be", "became", "because", "become",
    -    "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
    -    "below", "beside", "besides", "between", "beyond", "bill", "both",
    -    "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
    -    "could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
    -    "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
    -    "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
    -    "everything", "everywhere", "except", "few", "fifteen", "fify", "fill",
    -    "find", "fire", "first", "five", "for", "former", "formerly", "forty",
    -    "found", "four", "from", "front", "full", "further", "get", "give", "go",
    -    "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
    -    "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
    -    "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
    -    "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
    -    "latterly", "least", "less", "ltd", "made", "many", "may", "me",
    -    "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
    -    "move", "much", "must", "my", "myself", "name", "namely", "neither",
    -    "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
    -    "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    -    "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
    -    "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
    -    "please", "put", "rather", "re", "same", "see", "seem", "seemed",
    -    "seeming", "seems", "serious", "several", "she", "should", "show", "side",
    -    "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
    -    "something", "sometime", "sometimes", "somewhere", "still", "such",
    -    "system", "take", "ten", "than", "that", "the", "their", "them",
    -    "themselves", "then", "thence", "there", "thereafter", "thereby",
    -    "therefore", "therein", "thereupon", "these", "they", "thick", "thin",
    -    "third", "this", "those", "though", "three", "through", "throughout",
    -    "thru", "thus", "to", "together", "too", "top", "toward", "towards",
    -    "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
    -    "very", "via", "was", "we", "well", "were", "what", "whatever", "when",
    -    "whence", "whenever", "where", "whereafter", "whereas", "whereby",
    -    "wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
    -    "who", "whoever", "whole", "whom", "whose", "why", "will", "with",
    -    "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves")
    +  /** Read stop words list from resources */
    +  def readStopWords(language: String): Array[String] = {
    +    val is = getClass.getResourceAsStream(s"/org/apache/spark/ml/feature/stopwords/$language.txt")
    +    scala.io.Source.fromInputStream(is).getLines().toArray
    +  }
    +
    +  /** Supported languages list must be lowercase */
    +  val supportedLanguages = Array("danish", "dutch", "english", "finnish", "french", "german",
    +    "hungarian", "italian", "norwegian", "portuguese", "russian", "spanish", "swedish", "turkish")
    +
    +  /** Languages and stopwords map */
    +  val languageMap = supportedLanguages.map{
    --- End diff --
    
    Loading all lists by default is unnecessary. Check supported languages in `readStopWords` load there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200122376
  
    **[Test build #53868 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53868/consoleFull)** for PR 11871 at commit [`2e7c54e`](https://github.com/apache/spark/commit/2e7c54e5c17e7c5672a43ffc28acb207e94bf28a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200940198
  
    **[Test build #54058 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54058/consoleFull)** for PR 11871 at commit [`a066e8b`](https://github.com/apache/spark/commit/a066e8b34ec4824fa26a1e306e197b66400f5ccb).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200940267
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216262333
  
    I"m going to send a PR based on this. So we can catch 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199926929
  
    **[Test build #53788 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53788/consoleFull)** for PR 11871 at commit [`4d1812a`](https://github.com/apache/spark/commit/4d1812aae64b0b15312940b1a6c42e19f9686480).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r56936403
  
    --- Diff: mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt ---
    @@ -0,0 +1,319 @@
    +a
    --- End diff --
    
    If other lists are from NLTK, maybe we should use their English stopwords too. It would be good to make quick comparison and check the differences.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r57875912
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +71,26 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  setDefault(stopWords -> Array.empty[String], caseSensitive -> false)
     
       override def transform(dataset: DataFrame): DataFrame = {
    +    val stopWordsSet = if ($(stopWords).isEmpty) {
    +      StopWordsRemover.loadStopWords("english").toSet
    +    } else {
    +      $(stopWords).toSet
    +    }
    +
         val outputSchema = transformSchema(dataset.schema)
         val t = if ($(caseSensitive)) {
    -        val stopWordsSet = $(stopWords).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !stopWordsSet.contains(s))
    -        }
    -      } else {
    -        val toLower = (s: String) => if (s != null) s.toLowerCase else s
    -        val lowerStopWords = $(stopWords).map(toLower(_)).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !lowerStopWords.contains(toLower(s)))
    -        }
    +      udf { terms: Seq[String] =>
    +        terms.filter(s => !stopWordsSet.contains(s))
    +      }
    +    } else {
    +      val toLower = (s: String) => if (s != null) s.toLowerCase else s
    +      val lowerStopWords = stopWordsSet.map(toLower(_)).toSet
    --- End diff --
    
    When would the stopwords set ever have a null?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-208486421
  
    @burakkose There were some merge conflicts introduced by recent commits. So please rebase master when you update this PR. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200544196
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r57397670
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -31,51 +31,16 @@ import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
      */
     private[spark] object StopWords {
     
    -  /**
    -   * Use the same default stopwords list as scikit-learn.
    -   * The original list can be found from "Glasgow Information Retrieval Group"
    -   * [[http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words]]
    -   */
    -  val English = Array( "a", "about", "above", "across", "after", "afterwards", "again",
    -    "against", "all", "almost", "alone", "along", "already", "also", "although", "always",
    -    "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
    -    "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
    -    "around", "as", "at", "back", "be", "became", "because", "become",
    -    "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
    -    "below", "beside", "besides", "between", "beyond", "bill", "both",
    -    "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
    -    "could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
    -    "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
    -    "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
    -    "everything", "everywhere", "except", "few", "fifteen", "fify", "fill",
    -    "find", "fire", "first", "five", "for", "former", "formerly", "forty",
    -    "found", "four", "from", "front", "full", "further", "get", "give", "go",
    -    "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
    -    "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
    -    "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
    -    "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
    -    "latterly", "least", "less", "ltd", "made", "many", "may", "me",
    -    "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
    -    "move", "much", "must", "my", "myself", "name", "namely", "neither",
    -    "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
    -    "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    -    "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
    -    "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
    -    "please", "put", "rather", "re", "same", "see", "seem", "seemed",
    -    "seeming", "seems", "serious", "several", "she", "should", "show", "side",
    -    "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
    -    "something", "sometime", "sometimes", "somewhere", "still", "such",
    -    "system", "take", "ten", "than", "that", "the", "their", "them",
    -    "themselves", "then", "thence", "there", "thereafter", "thereby",
    -    "therefore", "therein", "thereupon", "these", "they", "thick", "thin",
    -    "third", "this", "those", "though", "three", "through", "throughout",
    -    "thru", "thus", "to", "together", "too", "top", "toward", "towards",
    -    "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
    -    "very", "via", "was", "we", "well", "were", "what", "whatever", "when",
    -    "whence", "whenever", "where", "whereafter", "whereas", "whereby",
    -    "wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
    -    "who", "whoever", "whole", "whom", "whose", "why", "will", "with",
    -    "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves")
    +  /** Read stop words list from resources */
    +  def readStopWords(language: String): Array[String] = {
    +    require(supportedLanguages.contains(language), s"$language is not in language list")
    +    val is = getClass.getResourceAsStream(s"/org/apache/spark/ml/feature/stopwords/$language.txt")
    +    scala.io.Source.fromInputStream(is)(scala.io.Codec.UTF8).getLines().toArray
    +  }
    +
    +  /** Supported languages list must be lowercase */
    +  private val supportedLanguages = Set("danish", "dutch", "english", "finnish", "french", "german",
    --- End diff --
    
    If we make it public, we need to return `Array[String]` instead of `Set[String]` for Java compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200801404
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11871


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-201489403
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54184/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199927885
  
    @mengxr , can you review again? However, I have a problem about charset. What is the best solution for "java.nio.charset.MalformedInputException: Input length = 1".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-201069222
  
    @mengxr, Alright then, these are my plans.
    
    I am going to merge `StopWords` with `StopWordsRemover`(delete readStopWords), and we will read stopwords list from `loadStopWords` in object `StopWordsRemover`. So If users want to load stopwords directly, they will do that.
    
    `val stopWords = StopWordsRemover.loadStopWords("english").toSet`
    
    But I have a question about `language` param. If we don't keep the param, how should they do this?
    
    ```scala
    val remover = new StopWordsRemover()
          .setInputCol("raw")
          .setOutputCol("filtered")
          .setLanguage("danish")
    ```
    Should we force to do that?
    
    ```scala
     val stopWords = StopWordsRemover.loadStopWords("danish").toSet ++ Set("python", "scala")
        val remover = new StopWordsRemover()
          .setInputCol("raw")
          .setOutputCol("filtered")
          .setStopWords(stopWords.toArray)
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216987188
  
    **[Test build #57785 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57785/consoleFull)** for PR 11871 at commit [`01471ec`](https://github.com/apache/spark/commit/01471ec2a74ff86dfaa417509d0f90e2db80b768).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200940271
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54058/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r56936977
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -98,13 +67,16 @@ class StopWordsRemover(override val uid: String)
     
       /**
        * the stop words set to be filtered out
    -   * Default: [[StopWords.English]]
    +   * Default: [[StopWords.languageMap("english")]]
        * @group param
        */
       val stopWords: StringArrayParam = new StringArrayParam(this, "stopWords", "stop words")
     
       /** @group setParam */
    -  def setStopWords(value: Array[String]): this.type = set(stopWords, value)
    +  def setStopWords(value: Array[String]): this.type = {
    +    set(stopWords, value)
    +    set(language, "unknown")
    --- End diff --
    
    We shouldn't set other parameters in the setters since their are other ways to overwrite param values. Let's keep the default language to `english` and if `stopWords` is set, ignore the language setting. This logic should happen in `transform`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-201031303
  
    Shall we put a README.txt under `mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/` and mention the source of stop words?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199638568
  
    @burakkose I made a quick pass. Just want to mention another option for the implementation. Instead of having `language`, `ignoredWords`, and `additionalWords`, we can separate the lists from `StopwordsRemover`:
    
    ~~~scala
    val stopWords = StopWordsRemover.loadStopWords("turkish").toSet ++ Set("a") -- Set("b"))
    val swr = new StopWordsRemover()
      .setStopWords(stopWords.toArray)
    ...
    ~~~
    
    This makes more composite code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216692475
  
    **[Test build #57686 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57686/consoleFull)** for PR 11871 at commit [`bca7c01`](https://github.com/apache/spark/commit/bca7c0112dce12526d9539d7a6f96325a4c8cc39).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216987382
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199639913
  
    **[Test build #53745 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53745/consoleFull)** for PR 11871 at commit [`6d215b3`](https://github.com/apache/spark/commit/6d215b31a205c4a79e8cc0ef6963d239941e80ff).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200496200
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53946/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199549680
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-203040224
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54446/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216692727
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57686/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200331277
  
    OK, the license for the stopwords is OK. It's the PostgreSQL license, which is BSD-like. You will want to add an entry for PostgresSQL licenses in the style that you see other BSD licenses recorded in the `LICENSE` file, and then add a copy of http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/COPYRIGHT at `licenses/LICENSE-postgresql.txt`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r56937022
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +95,74 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  /**
    +    * the language of stop words
    +    * Default: "english"
    +    * @group param
    +    */
    +  val language: Param[String] = new Param[String](this, "language", "stopwords language")
    +
    +  /** @group setParam */
    +  def setLanguage(value: String): this.type = {
    +    val lang = value.toLowerCase
    +    require(StopWords.languageMap.contains(lang), s"$lang is not in language list")
    +    set(language, lang)
    +    set(stopWords, StopWords.languageMap(lang))
    --- End diff --
    
    Ditto. Do not set the stop words here. Implement the logic in `transform` instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-217010289
  
    **[Test build #57806 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57806/consoleFull)** for PR 11871 at commit [`dec0634`](https://github.com/apache/spark/commit/dec0634a574124ab53c706b14982a6c81a282c97).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199634279
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r57397678
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -160,4 +147,11 @@ object StopWordsRemover extends DefaultParamsReadable[StopWordsRemover] {
     
       @Since("1.6.0")
       override def load(path: String): StopWordsRemover = super.load(path)
    +
    +  /**
    +   * Stop words for the language
    +   * Supported languages: Danish, Dutch, English, Finnish, French, German, Hungarian,
    --- End diff --
    
    We use lowercase in MLlib for string values, so `"danish"`, `"dutch"`, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-201055799
  
    @burakkose We can merge `StopWords` into `StopWordsRemover` to simplify the implementation. I see the `language` param provides a convenient way to set parameters, but eventually we need to add it in Python as well and document the behavior when both `language` and `stopWords` are set, just more maintenance cost. So I'd recommend removing it for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r57397665
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -31,51 +31,16 @@ import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
      */
     private[spark] object StopWords {
     
    -  /**
    -   * Use the same default stopwords list as scikit-learn.
    -   * The original list can be found from "Glasgow Information Retrieval Group"
    -   * [[http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words]]
    -   */
    -  val English = Array( "a", "about", "above", "across", "after", "afterwards", "again",
    -    "against", "all", "almost", "alone", "along", "already", "also", "although", "always",
    -    "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
    -    "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
    -    "around", "as", "at", "back", "be", "became", "because", "become",
    -    "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
    -    "below", "beside", "besides", "between", "beyond", "bill", "both",
    -    "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
    -    "could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
    -    "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
    -    "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
    -    "everything", "everywhere", "except", "few", "fifteen", "fify", "fill",
    -    "find", "fire", "first", "five", "for", "former", "formerly", "forty",
    -    "found", "four", "from", "front", "full", "further", "get", "give", "go",
    -    "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
    -    "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
    -    "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
    -    "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
    -    "latterly", "least", "less", "ltd", "made", "many", "may", "me",
    -    "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
    -    "move", "much", "must", "my", "myself", "name", "namely", "neither",
    -    "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
    -    "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    -    "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
    -    "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
    -    "please", "put", "rather", "re", "same", "see", "seem", "seemed",
    -    "seeming", "seems", "serious", "several", "she", "should", "show", "side",
    -    "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
    -    "something", "sometime", "sometimes", "somewhere", "still", "such",
    -    "system", "take", "ten", "than", "that", "the", "their", "them",
    -    "themselves", "then", "thence", "there", "thereafter", "thereby",
    -    "therefore", "therein", "thereupon", "these", "they", "thick", "thin",
    -    "third", "this", "those", "though", "three", "through", "throughout",
    -    "thru", "thus", "to", "together", "too", "top", "toward", "towards",
    -    "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
    -    "very", "via", "was", "we", "well", "were", "what", "whatever", "when",
    -    "whence", "whenever", "where", "whereafter", "whereas", "whereby",
    -    "wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
    -    "who", "whoever", "whole", "whom", "whose", "why", "will", "with",
    -    "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves")
    +  /** Read stop words list from resources */
    +  def readStopWords(language: String): Array[String] = {
    --- End diff --
    
    `loadStopWords` might be a better name here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r56936835
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -31,51 +31,20 @@ import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
      */
     private[spark] object StopWords {
     
    -  /**
    -   * Use the same default stopwords list as scikit-learn.
    -   * The original list can be found from "Glasgow Information Retrieval Group"
    -   * [[http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words]]
    -   */
    -  val English = Array( "a", "about", "above", "across", "after", "afterwards", "again",
    -    "against", "all", "almost", "alone", "along", "already", "also", "although", "always",
    -    "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
    -    "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
    -    "around", "as", "at", "back", "be", "became", "because", "become",
    -    "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
    -    "below", "beside", "besides", "between", "beyond", "bill", "both",
    -    "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
    -    "could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
    -    "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
    -    "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
    -    "everything", "everywhere", "except", "few", "fifteen", "fify", "fill",
    -    "find", "fire", "first", "five", "for", "former", "formerly", "forty",
    -    "found", "four", "from", "front", "full", "further", "get", "give", "go",
    -    "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
    -    "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
    -    "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
    -    "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
    -    "latterly", "least", "less", "ltd", "made", "many", "may", "me",
    -    "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
    -    "move", "much", "must", "my", "myself", "name", "namely", "neither",
    -    "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
    -    "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    -    "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
    -    "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
    -    "please", "put", "rather", "re", "same", "see", "seem", "seemed",
    -    "seeming", "seems", "serious", "several", "she", "should", "show", "side",
    -    "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
    -    "something", "sometime", "sometimes", "somewhere", "still", "such",
    -    "system", "take", "ten", "than", "that", "the", "their", "them",
    -    "themselves", "then", "thence", "there", "thereafter", "thereby",
    -    "therefore", "therein", "thereupon", "these", "they", "thick", "thin",
    -    "third", "this", "those", "though", "three", "through", "throughout",
    -    "thru", "thus", "to", "together", "too", "top", "toward", "towards",
    -    "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
    -    "very", "via", "was", "we", "well", "were", "what", "whatever", "when",
    -    "whence", "whenever", "where", "whereafter", "whereas", "whereby",
    -    "wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
    -    "who", "whoever", "whole", "whom", "whose", "why", "will", "with",
    -    "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves")
    +  /** Read stop words list from resources */
    +  def readStopWords(language: String): Array[String] = {
    +    val is = getClass.getResourceAsStream(s"/org/apache/spark/ml/feature/stopwords/$language.txt")
    --- End diff --
    
    Validate language before loading resource file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199906533
  
    **[Test build #53784 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53784/consoleFull)** for PR 11871 at commit [`41cd258`](https://github.com/apache/spark/commit/41cd25815af3baa8fe9ed9336812f436d7ed7bd5).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-201353993
  
    **[Test build #54184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54184/consoleFull)** for PR 11871 at commit [`789342f`](https://github.com/apache/spark/commit/789342f2d26759db180868a9f59b02c8f85cc835).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216697313
  
    **[Test build #57689 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57689/consoleFull)** for PR 11871 at commit [`cb786ee`](https://github.com/apache/spark/commit/cb786eef0f75aa19d9416ed1c07b90510b4bf70b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216697662
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216692722
  
    **[Test build #57686 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57686/consoleFull)** for PR 11871 at commit [`bca7c01`](https://github.com/apache/spark/commit/bca7c0112dce12526d9539d7a6f96325a4c8cc39).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216987384
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57785/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r58301366
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +71,26 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  setDefault(stopWords -> Array.empty[String], caseSensitive -> false)
     
       override def transform(dataset: DataFrame): DataFrame = {
    +    val stopWordsSet = if ($(stopWords).isEmpty) {
    +      StopWordsRemover.loadStopWords("english").toSet
    +    } else {
    +      $(stopWords).toSet
    +    }
    +
         val outputSchema = transformSchema(dataset.schema)
         val t = if ($(caseSensitive)) {
    -        val stopWordsSet = $(stopWords).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !stopWordsSet.contains(s))
    -        }
    -      } else {
    -        val toLower = (s: String) => if (s != null) s.toLowerCase else s
    -        val lowerStopWords = $(stopWords).map(toLower(_)).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !lowerStopWords.contains(toLower(s)))
    -        }
    +      udf { terms: Seq[String] =>
    +        terms.filter(s => !stopWordsSet.contains(s))
    --- End diff --
    
    Can you save a reference to the active set of stopwords instead of making the list into a set each time? might be more natural to have a defensive copy anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r60991630
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -98,7 +46,7 @@ class StopWordsRemover(override val uid: String)
     
       /**
        * the stop words set to be filtered out
    -   * Default: [[StopWords.English]]
    +   * Default: [[Array.empty]]
    --- End diff --
    
    This could be a little clearer with the scaladoc, I think we should mention that Array.empty actually implies loading the english stop words. (Or we could just have the default be the loaded version of the english stop words as is done in the PySpark code).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200440068
  
    @srowen , thank you for your help, I added the licence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-201033906
  
    Do we need that? In my view, stop words list will not be updated anymore, and we have already added the license.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-202979360
  
    Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199945293
  
    **[Test build #53788 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53788/consoleFull)** for PR 11871 at commit [`4d1812a`](https://github.com/apache/spark/commit/4d1812aae64b0b15312940b1a6c42e19f9686480).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200496196
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r58302144
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +71,26 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  setDefault(stopWords -> Array.empty[String], caseSensitive -> false)
     
       override def transform(dataset: DataFrame): DataFrame = {
    +    val stopWordsSet = if ($(stopWords).isEmpty) {
    +      StopWordsRemover.loadStopWords("english").toSet
    +    } else {
    +      $(stopWords).toSet
    +    }
    +
         val outputSchema = transformSchema(dataset.schema)
         val t = if ($(caseSensitive)) {
    -        val stopWordsSet = $(stopWords).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !stopWordsSet.contains(s))
    -        }
    -      } else {
    -        val toLower = (s: String) => if (s != null) s.toLowerCase else s
    -        val lowerStopWords = $(stopWords).map(toLower(_)).toSet
    -        udf { terms: Seq[String] =>
    -          terms.filter(s => !lowerStopWords.contains(toLower(s)))
    -        }
    +      udf { terms: Seq[String] =>
    +        terms.filter(s => !stopWordsSet.contains(s))
    +      }
    +    } else {
    +      val toLower = (s: String) => if (s != null) s.toLowerCase else s
    +      val lowerStopWords = stopWordsSet.map(toLower(_)).toSet
    --- End diff --
    
    OK, if we don't treat that as an error, then null can be filtered when the stopwords set is created. In fact it can be lowercased at that time too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199639984
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53745/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199639979
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-201501574
  
    @mengxr , you will like this time :) , added a static method to Python,readme for resources, deleted `StopWords` and `language` . But we need to retest.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r57397675
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +88,43 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  /**
    +   * the language of stop words
    +   * Supported languages: Danish, Dutch, English, Finnish, French, German, Hungarian,
    +   * Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, Turkish
    +   * Default: "English"
    +   * @group param
    +   */
    +  val language: Param[String] = new Param[String](this, "language", "stopwords language")
    --- End diff --
    
    I guess we don't need `language` as a param anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216961706
  
    @mengxr, can you check? I added the locale support, and applied your changes. I haven't opened a new pull request for the locale support.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r57397662
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -31,51 +31,16 @@ import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
      */
     private[spark] object StopWords {
     
    -  /**
    -   * Use the same default stopwords list as scikit-learn.
    -   * The original list can be found from "Glasgow Information Retrieval Group"
    -   * [[http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words]]
    -   */
    -  val English = Array( "a", "about", "above", "across", "after", "afterwards", "again",
    -    "against", "all", "almost", "alone", "along", "already", "also", "although", "always",
    -    "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
    -    "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
    -    "around", "as", "at", "back", "be", "became", "because", "become",
    -    "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
    -    "below", "beside", "besides", "between", "beyond", "bill", "both",
    -    "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
    -    "could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
    -    "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
    -    "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
    -    "everything", "everywhere", "except", "few", "fifteen", "fify", "fill",
    -    "find", "fire", "first", "five", "for", "former", "formerly", "forty",
    -    "found", "four", "from", "front", "full", "further", "get", "give", "go",
    -    "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
    -    "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
    -    "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
    -    "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
    -    "latterly", "least", "less", "ltd", "made", "many", "may", "me",
    -    "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
    -    "move", "much", "must", "my", "myself", "name", "namely", "neither",
    -    "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
    -    "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    -    "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
    -    "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
    -    "please", "put", "rather", "re", "same", "see", "seem", "seemed",
    -    "seeming", "seems", "serious", "several", "she", "should", "show", "side",
    -    "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
    -    "something", "sometime", "sometimes", "somewhere", "still", "such",
    -    "system", "take", "ten", "than", "that", "the", "their", "them",
    -    "themselves", "then", "thence", "there", "thereafter", "thereby",
    -    "therefore", "therein", "thereupon", "these", "they", "thick", "thin",
    -    "third", "this", "those", "though", "three", "through", "throughout",
    -    "thru", "thus", "to", "together", "too", "top", "toward", "towards",
    -    "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
    -    "very", "via", "was", "we", "well", "were", "what", "whatever", "when",
    -    "whence", "whenever", "where", "whereafter", "whereas", "whereby",
    -    "wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
    -    "who", "whoever", "whole", "whom", "whose", "why", "will", "with",
    -    "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves")
    +  /** Read stop words list from resources */
    --- End diff --
    
    List supported languages. Or make `supportedLanguages` public and link to it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216697664
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57689/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199832509
  
    @srowen , I got from http://www.nltk.org/nltk_data/ , Stopwords Corpus, and they mentioned that 
    "
    They were obtained from:
    http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/
    The English list has been augmented
    https://github.com/nltk/nltk_data/issues/22
    "


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200495661
  
    **[Test build #53946 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53946/consoleFull)** for PR 11871 at commit [`7efda40`](https://github.com/apache/spark/commit/7efda40e39663deef0b0884a7bfca13b5d10d706).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199634270
  
    @burakkose I guess it is okay to copy the lists from NLTK since it is Apache licensed. Could you add a header to each stopword file and put a link there? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216919349
  
    @mengxr, can you check? I added locale support. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216279005
  
    @mengxr , I couldn't find free time, sorry for that. I actually wrote new codes, and I was just waiting for tests. I am going to send a new PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199890483
  
    **[Test build #53784 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53784/consoleFull)** for PR 11871 at commit [`41cd258`](https://github.com/apache/spark/commit/41cd25815af3baa8fe9ed9336812f436d7ed7bd5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-199828612
  
    @mengxr , thank you for quick review, I am working on your comment, this will be my first commit, so if you review my new changes, I am happy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r56936821
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -31,51 +31,20 @@ import org.apache.spark.sql.types.{ArrayType, StringType, StructType}
      */
     private[spark] object StopWords {
     
    -  /**
    -   * Use the same default stopwords list as scikit-learn.
    -   * The original list can be found from "Glasgow Information Retrieval Group"
    -   * [[http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words]]
    -   */
    -  val English = Array( "a", "about", "above", "across", "after", "afterwards", "again",
    -    "against", "all", "almost", "alone", "along", "already", "also", "although", "always",
    -    "am", "among", "amongst", "amoungst", "amount", "an", "and", "another",
    -    "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are",
    -    "around", "as", "at", "back", "be", "became", "because", "become",
    -    "becomes", "becoming", "been", "before", "beforehand", "behind", "being",
    -    "below", "beside", "besides", "between", "beyond", "bill", "both",
    -    "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con",
    -    "could", "couldnt", "cry", "de", "describe", "detail", "do", "done",
    -    "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else",
    -    "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone",
    -    "everything", "everywhere", "except", "few", "fifteen", "fify", "fill",
    -    "find", "fire", "first", "five", "for", "former", "formerly", "forty",
    -    "found", "four", "from", "front", "full", "further", "get", "give", "go",
    -    "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter",
    -    "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his",
    -    "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed",
    -    "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter",
    -    "latterly", "least", "less", "ltd", "made", "many", "may", "me",
    -    "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly",
    -    "move", "much", "must", "my", "myself", "name", "namely", "neither",
    -    "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone",
    -    "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on",
    -    "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our",
    -    "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps",
    -    "please", "put", "rather", "re", "same", "see", "seem", "seemed",
    -    "seeming", "seems", "serious", "several", "she", "should", "show", "side",
    -    "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone",
    -    "something", "sometime", "sometimes", "somewhere", "still", "such",
    -    "system", "take", "ten", "than", "that", "the", "their", "them",
    -    "themselves", "then", "thence", "there", "thereafter", "thereby",
    -    "therefore", "therein", "thereupon", "these", "they", "thick", "thin",
    -    "third", "this", "those", "though", "three", "through", "throughout",
    -    "thru", "thus", "to", "together", "too", "top", "toward", "towards",
    -    "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us",
    -    "very", "via", "was", "we", "well", "were", "what", "whatever", "when",
    -    "whence", "whenever", "where", "whereafter", "whereas", "whereby",
    -    "wherein", "whereupon", "wherever", "whether", "which", "while", "whither",
    -    "who", "whoever", "whole", "whom", "whose", "why", "will", "with",
    -    "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves")
    +  /** Read stop words list from resources */
    +  def readStopWords(language: String): Array[String] = {
    +    val is = getClass.getResourceAsStream(s"/org/apache/spark/ml/feature/stopwords/$language.txt")
    +    scala.io.Source.fromInputStream(is).getLines().toArray
    +  }
    +
    +  /** Supported languages list must be lowercase */
    +  val supportedLanguages = Array("danish", "dutch", "english", "finnish", "french", "german",
    --- End diff --
    
    `private val supportedLanguages: Set[String] = Set(...)`. This should be a private val and using `Set` to test membership.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by burakkose <gi...@git.apache.org>.
Github user burakkose commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200537833
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200137388
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-216692725
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11871#issuecomment-200057472
  
    **[Test build #53822 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53822/consoleFull)** for PR 11871 at commit [`a308622`](https://github.com/apache/spark/commit/a30862231c3944c55c96cc94e162f61614aee6d5).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14050][ML] Add multiple languages suppo...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11871#discussion_r56937008
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
    @@ -123,21 +95,74 @@ class StopWordsRemover(override val uid: String)
       /** @group getParam */
       def getCaseSensitive: Boolean = $(caseSensitive)
     
    -  setDefault(stopWords -> StopWords.English, caseSensitive -> false)
    +  /**
    +    * the language of stop words
    +    * Default: "english"
    +    * @group param
    +    */
    +  val language: Param[String] = new Param[String](this, "language", "stopwords language")
    +
    +  /** @group setParam */
    +  def setLanguage(value: String): this.type = {
    +    val lang = value.toLowerCase
    +    require(StopWords.languageMap.contains(lang), s"$lang is not in language list")
    --- End diff --
    
    use ParamValidators.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org