You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by hhbyyh <gi...@git.apache.org> on 2015/12/12 04:21:44 UTC

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

GitHub user hhbyyh opened a pull request:

    https://github.com/apache/spark/pull/10272

    [SPARK-9578] [ML] Stemmer feature transformer

    jira: https://issues.apache.org/jira/browse/SPARK-9578
    
    Classical Porter stemmer, which is implemented referring to scalanlp/chalk
     * [[https://github.com/scalanlp/chalk/blob/master/src/main/scala/chalk/text/analyze]]. 
    @jasonbaldridge Let me know if you're interested. 
    
    I compared the following implementations:
    http://tartarus.org/martin/PorterStemmer/scala.txt
    https://github.com/ifesdjeen/jReadability/blob/master/src/scala/main/com/jreadability/main/Stemmer.scala
    https://github.com/aztek/porterstemmer/blob/master/src/main/scala/com/github/aztek/porterstemmer/PorterStemmer.scala
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hhbyyh/spark stem

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10272.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10272
    
----
commit 857f68d2fd21c97c780e7d04be4e0a666ec9d52d
Author: Yuhao Yang <hh...@gmail.com>
Date:   2015-12-12T02:38:30Z

    initial stemmer

commit 211ac04593afb79051287e159261ad06fa4b4464
Author: Yuhao Yang <hh...@gmail.com>
Date:   2015-12-12T02:51:45Z

    line fix

commit ddb26da2b3e15a2b433636bfa4c6d4e5877ea96a
Author: Yuhao Yang <hh...@gmail.com>
Date:   2015-12-12T03:13:45Z

    case fix

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by jasonbaldridge <gi...@git.apache.org>.

Github user jasonbaldridge commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-199631674
  
    @mengxr Chalk isn't under current development unfortunately, and I'm not sure whether I'll be getting back to it. Another option might be to add it to the lib-text library, which is being maintained and updated, and might have some other useful things for you:
    
    https://github.com/peoplepattern/lib-text
    
    cc @eponvert and @dlwh


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by dlwh <gi...@git.apache.org>.

Github user dlwh commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-199632765
  
    (Totally not paying attention to this issue.)
    
    Epic has a PorterStemmer as well. (Well, I re-took it from Chalk.)
    https://github.com/dlwh/epic
    
    I've been neglecting Epic to some extent of late, but it's there and
    available.
    
    I'm not sure it's worth adding a large dependency just for that, but just
    FYI.
    
    On Mon, Mar 21, 2016 at 9:21 PM, Jason Baldridge <no...@github.com>
    wrote:
    
    > @mengxr <https://github.com/mengxr> Chalk isn't under current development
    > unfortunately, and I'm not sure whether I'll be getting back to it. Another
    > option might be to add it to the lib-text library, which is being
    > maintained and updated, and might have some other useful things for you:
    >
    > https://github.com/peoplepattern/lib-text
    >
    > cc @eponvert <https://github.com/eponvert> and @dlwh
    > <https://github.com/dlwh>
    >
    > —
    > You are receiving this because you were mentioned.
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/10272#issuecomment-199631674>
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by jasonbaldridge <gi...@git.apache.org>.

Github user jasonbaldridge commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-199884904
  
    @MLnick CoreNLP is GPL, so I would worry that some people would use it as though it were part of an ASL suite. Spark-CoreNLP is correctly licensed as GPL, but there should be some big warning flags in the README so that people don't inadvertently use it in a way that is inconsistent with the GPL.
    
    FWIW, I would strongly prefer an "NLP for Spark" package to be ASL, so Spark-CoreNLP isn't useful (though it's great stuff, objectively speaking).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-164108452
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh closed the pull request at:

    https://github.com/apache/spark/pull/10272


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-164108453
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47606/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-191888717
  
    @hhbyyh I would try to avoid maintaining a stemmer implementation in MLlib. This is not a distributed algorithm and there exist several implementations from NLP libraries. The best option is to introduce a dependency and wrap the stemmer implementation there. If we made some improvements to an existing stemmer implementation, we should consider contributing to it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by BenFradet <gi...@git.apache.org>.

Github user BenFradet commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-164438709
  
    There are a few quirks regarding formatting.
    
    Also, I'm wondering if the different "step" methods should be documented or renamed so we get what they're doing without skimming over the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by MLnick <gi...@git.apache.org>.

Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/10272#issuecomment-199700457

IMO more specific or complex domain-specific stuff should live outside of core, until such time as there is clear demand across a wider user base that justifies the maintenance cost of including it. Already Spark ML has a large maintenance & code review burden just with the algos and feature transformers that are already in there.

The whole point of an API for pipelines is to enable external libraries for more specific use cases. This is doubly the case when well-known and robust libraries already provide the functionality. As you can see from your PR, implementing one's own stemmer transformer using one of the external NLP libs is a few lines of code.

Things like NLP (and image, video and audio processing, for example) should start life as a Spark package. How about looking at contributing to https://github.com/mengxr/spark-corenlp and wrapping the CoreNLP stemmer functionality as a transformer?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-165378488
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-194587735
  
    @mengxr 
    Thanks for taking a look.
    The comment from Joseph (https://issues.apache.org/jira/browse/SPARK-5571?focusedCommentId=14632052&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14632052) seems to prefer to add the code directly. 
    We can put this on hold if necessary. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by BenFradet <gi...@git.apache.org>.

Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10272#discussion_r47496959
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Stemmer.scala ---
    @@ -0,0 +1,260 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.annotation.{Experimental, Since}
    +import org.apache.spark.ml.UnaryTransformer
    +import org.apache.spark.ml.param.ParamMap
    +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable}
    +import org.apache.spark.sql.types.{ArrayType, DataType, StringType}
    +
    +/**
    + * :: Experimental ::
    + * Stemmer removes the commoner morphological and inflexional endings from words in English
    + */
    +@Experimental
    +@Since("1.7.0")
    +class Stemmer (override val uid: String)
    +  extends UnaryTransformer[Seq[String], Seq[String], Stemmer] with DefaultParamsWritable {
    +  def this() = this(Identifiable.randomUID("stemmer"))
    +
    +  override protected def createTransformFunc: Seq[String] => Seq[String] = {
    +    terms => terms.map(t => PorterStemmer(t))
    +  }
    +
    +  override protected def validateInputType(inputType: DataType): Unit = {
    +    require(inputType.sameType(ArrayType(StringType)),
    +      s"Input type must be ArrayType(StringType) but got $inputType.")
    +  }
    +
    +  override protected def outputDataType: DataType = new ArrayType(StringType, true)
    +
    +  override def copy(extra: ParamMap): Stemmer = defaultCopy(extra)
    +}
    +
    +@Since("1.7.0")
    +object Stemmer extends DefaultParamsReadable[Stemmer] {
    +
    +  @Since("1.7.0")
    +  override def load(path: String): Stemmer = super.load(path)
    +}
    +
    +/**
    + * :: Experimental ::
    + * Classical Porter stemmer, which is implemented referring to scalanlp/chalk
    + * [[https://github.com/scalanlp/chalk/blob/master/src/main/scala/chalk/text/analyze]].
    + * The details of PorterStemmer can be found at
    + * [[http://snowball.tartarus.org/algorithms/porter/stemmer.html]].
    + */
    +private[feature] object PorterStemmer {
    +
    +  def apply(w: String): String = {
    +    if (w.length < 3) w.toLowerCase
    +    else {
    +      val ret = w.toLowerCase.replaceAll("([aeiou])y", "$1Y").replaceAll("^y", "Y")
    +      step5(step4(step3(step2(step1(ret))))).toLowerCase
    +    }
    +  }
    +
    +  private def step1(w: String): String = step1c(step1b(step1a(w)))
    +
    +  private def step1a(w: String): String = {
    +    if (w.endsWith("sses") || w.endsWith("ies")) {
    +      w.substring(0, w.length - 2)
    +    }
    +    else if (w.endsWith("s") && w.charAt(w.length - 2) != 's') {
    +      w.substring(0, w.length - 1)
    +    }
    +    else w
    +  }
    +
    +  private def step1b(w: String): String = {
    +    def extra(w: String) = {
    +      if (w.endsWith("at") || w.endsWith("bl") || w.endsWith("iz")) w + 'e'
    +      else if (doublec(w) && !"lsz".contains(w.last)) w.substring(0, w.length - 1)
    +      else if (m(w) == 1 && cvc(w)) w + "e"
    +      else w
    +    }
    +
    +    if (w.endsWith("eed")) {
    +      if (m(w.substring(0, w.length - 3)) > 0) w.substring(0, w.length - 1) else w
    +    } else if (w.endsWith("ed")) {
    +      if (w.indexWhere(isVowel) < (w.length - 2)) extra(w.substring(0, w.length - 2))
    +      else w
    +    } else if (w.endsWith("ing")) {
    +      if (w.indexWhere(isVowel) < (w.length - 3)) extra(w.substring(0, w.length - 3))
    +      else w
    +    } else w
    +  }
    +
    +  private def step1c(w: String): String = {
    +    if ((w.last == 'y' || w.last == 'Y') && w.indexWhere(isVowel) < w.length - 1) {
    +      w.substring(0, w.length - 1) + 'i'
    +    } else w
    +  }
    +
    +  private def step2(w: String): String = {
    +    if (w.length < 3) w
    +    else {
    +      val opt = w(w.length - 2) match {
    +        case 'a' => replaceSuffix(w, "ational", "ate").orElse(replaceSuffix(w, "tional", "tion"))
    +        case 'c' =>
    +          replaceSuffix(w, "enci", "ence").orElse(replaceSuffix(w, "anci", "ance"))
    +        case 'e' => replaceSuffix(w, "izer", "ize")
    +        case 'g' => replaceSuffix(w, "logi", "log")
    +        case 'l' => replaceSuffix(w, "bli", "ble")
    +          .orElse(replaceSuffix(w, "alli", "al"))
    +          .orElse ( replaceSuffix(w, "entli", "ent"))
    --- End diff --
    
    The different `orElse` need to be uniformized.
    Same thing goes for `step4`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-165356666
  
    **[Test build #47903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47903/consoleFull)** for PR 10272 at commit [`ff03152`](https://github.com/apache/spark/commit/ff03152daa3d710dbb54b244488f9eb4b4a80378).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-164105692
  
    **[Test build #47606 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47606/consoleFull)** for PR 10272 at commit [`ddb26da`](https://github.com/apache/spark/commit/ddb26da2b3e15a2b433636bfa4c6d4e5877ea96a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by BenFradet <gi...@git.apache.org>.

Github user BenFradet commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10272#discussion_r47496999
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StemmerSuite.scala ---
    @@ -0,0 +1,154 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.ml.feature
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.param.ParamsSuite
    +import org.apache.spark.ml.util.DefaultReadWriteTest
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +import org.apache.spark.sql.{DataFrame, Row}
    +
    +class StemmerSuite extends SparkFunSuite with MLlibTestSparkContext  with DefaultReadWriteTest {
    +
    +  test("params") {
    +    ParamsSuite.checkParams(new Tokenizer)
    +  }
    +
    +  test("read/write") {
    +    val t = new Stemmer()
    +      .setInputCol("myInputCol")
    +      .setOutputCol("myOutputCol")
    +    testDefaultReadWrite(t)
    +  }
    +
    +  test("stem plurals") {
    +    val plurals = Map(
    +      "caresses" -> "caress",
    +      "ponies" -> "poni",
    +      "ties" -> "ti",
    +      "caress" -> "caress",
    +      "cats" -> "cat")
    +    val dataset = sqlContext.createDataFrame(
    +      plurals.map(kv => Array(kv._1) -> Array(kv._2)).toSeq).toDF("tokens", "expected")
    +    StemmerSuite.testStemmer(dataset)
    +  }
    +
    +  test("stem past participles") {
    +    val participles = Map(
    +      "plastered" -> "plaster",
    +      "bled" -> "bled",
    +      "motoring" -> "motor",
    +      "sing" -> "sing",
    +
    +      "conflated" -> "conflat",
    +      "troubled" -> "troubl",
    +      "sized" -> "size",
    +      "hopping" -> "hop",
    +      "tanned" -> "tan",
    +      "falling" -> "fall",
    +      "hissing" -> "hiss",
    +      "fizzed" -> "fizz",
    +      "failing" -> "fail",
    +      "filing" -> "file",
    +
    +      "happy" -> "happi",
    +      "sky" -> "sky"
    +    )
    +    val dataset = sqlContext.createDataFrame(
    +      participles.map(kv => Array(kv._1) -> Array(kv._2)).toSeq).toDF("tokens", "expected")
    +    StemmerSuite.testStemmer(dataset)
    +  }
    +
    +  test("change suffixes") {
    +    val changes = Map(
    +      "relational" -> "relat",
    +      "conditional" -> "condit",
    +      "rational" -> "ration",
    +      "valenci" -> "valenc",
    +      "hesitanci" -> "hesit",
    +      "digitizer" -> "digit",
    +      "conformabli" -> "conform",
    +      "radicalli" -> "radic",
    +      "differentli" -> "differ",
    +      "vileli" -> "vile",
    +      "analogousli" -> "analog",
    +      "vietnamization" -> "vietnam",
    +      "predication" -> "predic",
    +      "operator" -> "oper",
    +      "feudalism" -> "feudal",
    +      "decisiveness" -> "decis",
    +      "hopefulness" -> "hope",
    +      "callousness" -> "callous",
    +      "formaliti" -> "formal",
    +      "sensitiviti" -> "sensit",
    +      "sensibiliti" -> "sensibl",
    +
    +      "triplicate" -> "triplic",
    +      "formative" -> "form",
    +      "formalize" -> "formal",
    +      "electriciti" -> "electr",
    +      "electrical" -> "electr",
    +      "hopeful" -> "hope",
    +      "goodness" -> "good",
    +
    +      "revival" -> "reviv",
    +      "allowance" -> "allow",
    +      "inference" -> "infer",
    +      "airliner" -> "airlin",
    +      "gyroscopic" -> "gyroscop",
    +      "adjustable" -> "adjust",
    +      "defensible" -> "defens",
    +      "irritant" -> "irrit",
    +      "replacement" -> "replac",
    +      "adjustment" -> "adjust",
    +      "dependent" -> "depend",
    +      "adoption" -> "adopt",
    +      "homologou" -> "homolog",
    +      "communism" -> "commun",
    +      "activate" -> "activ",
    +      "angulariti" -> "angular",
    +      "homologous" -> "homolog",
    +      "effective" -> "effect",
    +      "bowdlerize" -> "bowdler",
    +
    +      "probate" -> "probat",
    +      "rate" -> "rate",
    +      "cease" -> "ceas",
    +      "controll" -> "control",
    +      "roll" -> "roll"
    +    )
    +    val dataset = sqlContext.createDataFrame(
    +      changes.map(kv => Array(kv._1) -> Array(kv._2)).toSeq).toDF("tokens", "expected")
    +    StemmerSuite.testStemmer(dataset)
    +  }
    +}
    +
    +private object StemmerSuite extends SparkFunSuite {
    +
    +  def testStemmer(dataset: DataFrame): Unit = {
    +
    +    val stemmer = new Stemmer()
    +      .setInputCol("tokens")
    +      .setOutputCol("stemmed")
    +
    +    stemmer.transform(dataset).select("expected", "stemmed")
    +      .collect()
    +      .foreach { case Row(tokens, wantedTokens) =>
    +      assert(tokens === wantedTokens)
    --- End diff --
    
    The indent seems to be broken.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-165378340
  
    **[Test build #47903 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47903/consoleFull)** for PR 10272 at commit [`ff03152`](https://github.com/apache/spark/commit/ff03152daa3d710dbb54b244488f9eb4b4a80378).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `class Stemmer (override val uid: String)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-165378491
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47903/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by jasonbaldridge <gi...@git.apache.org>.

Github user jasonbaldridge commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-165439054
  
    FWIW, I did a Scala adaptation of a Java PorterStemmer here too: https://github.com/utcompling/Scalabha/blob/master/src/main/scala/opennlp/scalabha/lang/eng/PorterStemmer.scala



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/10272#issuecomment-199535165

@hhbyyh Joseph's comment was about carefully introducing new dependencies. If we pick a library that doesn't depend on many others, it should be safe for us. There are several packages containing Porter stemmer, e.g., lucene, CoreNLP, and chalk. Lucene is a lightweight library, but used by many other systems. So it is not a safe choice. CoreNLP is licensed under LGPL, so not an option here. chalk seems okay to me by looking at its dependencies.

I'm a little worried about the cost if we maintain our own implementation in MLlib. We cannot leverage other NLP projects (where the experts are) on possible improvements. So could you take a look at chalk?

@jasonbaldridge To add chalk as a dependency, we need chalk releases for both Scala 2.10 and 2.11. But I only see 2.10 releases on maven central. Do you have plans for publishing new releases for both 2.10 and 2.11?

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-164108430
  
    **[Test build #47606 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47606/consoleFull)** for PR 10272 at commit [`ddb26da`](https://github.com/apache/spark/commit/ddb26da2b3e15a2b433636bfa4c6d4e5877ea96a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `class Stemmer (override val uid: String)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-199823007
  
    I'm fine to put this on hold like previously stated.Thank you all for the discussion. It should have provided enough information for anyone interested in using Stemmer with Spark.
    
    I plan to put this in the topic modeling Spark package for now. Will send a link here afterwards.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by jasonbaldridge <gi...@git.apache.org>.

Github user jasonbaldridge commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-165810877
  
    @hhbyyh: I'm pretty slammed right now with other things, so if you'd like to go ahead and compare and choose whichever, that's totally fine with me. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-165354198
  
    @BenFradet, Thanks for helping review. I added some comments, yet listing all conditions may not be necessary. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-167059047
  
    Current implementation still outperforms the one from https://github.com/utcompling/Scalabha/blob/master/src/main/scala/opennlp/scalabha/lang/eng/PorterStemmer.scala, by about 70%.
    
    This is ready for review now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by eponvert <gi...@git.apache.org>.

Github user eponvert commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-199815213
  
    @jasonbaldridge @mengxr I went ahead and opened https://github.com/peoplepattern/lib-text/issues/21, though I've not scheduled it for a target release yet


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by hhbyyh <gi...@git.apache.org>.

Github user hhbyyh commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-165656076
  
    Thanks @jasonbaldridge for taking a look. If you're interested, I'll close the PR thus you can send a new one. I can help review and compare the performance.
    
    Thanks @BenFradet for helping review.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-9578] [ML] Stemmer feature transformer

Posted by BenFradet <gi...@git.apache.org>.

Github user BenFradet commented on the pull request:

    https://github.com/apache/spark/pull/10272#issuecomment-165378793
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org