You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by keypointt <gi...@git.apache.org> on 2016/06/10 00:50:10 UTC

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

GitHub user keypointt opened a pull request:

    https://github.com/apache/spark/pull/13584

    [SPARK-15509][ML][SparkR] R MLlib algorithms should support input columns "features" and "label"

    https://issues.apache.org/jira/browse/SPARK-15509
    
    ## What changes were proposed in this pull request?
    
    Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM:
    `training <- loadDF(sqlContext, ".../mnist", "libsvm")`
    `model <- naiveBayes(label ~ features, training)`
    This fails with:
    ```
    16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed
    Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
      java.lang.IllegalArgumentException: Output column features already exists.
    	at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120)
    	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
    	at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179)
    	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
    	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
    	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
    	at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179)
    	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
    	at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131)
    	at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169)
    	at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62)
    	at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca
    The same issue appears for the "label" column once you rename the "features" column.
    ```
    The cause is, when using `loadDF()` to generate dataframes, sometimes it’s with default column name `“label”` and `“features”`, and these two name will conflict with default column names `setDefault(labelCol, "label")` and ` setDefault(featuresCol, "features")` of `SharedParams.scala`
    
    
    
    
    ## How was this patch tested?
    
    Test on my local machine.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/keypointt/spark SPARK-15509

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13584.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13584
    
----
commit cfed8844cbadbd760f73c2f906a1591806001a93
Author: Xin Ren <ia...@126.com>
Date:   2016-06-08T23:39:28Z

    [SPARK-15509] remove duplicate of intercept[IllegalArgumentException]

commit 77886fe59463027f24c6ca909638731145b46ee2
Author: Xin Ren <ia...@126.com>
Date:   2016-06-09T20:59:38Z

    [SPARK-15509] no column exists error for naivebayes. expand to other wrappers

commit e112ac0c0685f399f72e9ed60be00964ec4fcdc4
Author: Xin Ren <ia...@126.com>
Date:   2016-06-09T21:04:56Z

    [SPARK-15509] add a util function for all wrappers

commit ef3702ee5beefad1ee51fe15cb01e1716aeda362
Author: Xin Ren <ia...@126.com>
Date:   2016-06-09T22:27:37Z

    [SPARK-15509] expand column check to other wrappers

commit aab3a12fe09cf3039708468a80837fa421739c69
Author: Xin Ren <ia...@126.com>
Date:   2016-06-09T23:05:51Z

    [SPARK-15509] add unit test

commit f68ac34907f3a7d1d66e98572ada34d47df3eab9
Author: Xin Ren <ia...@126.com>
Date:   2016-06-10T00:01:44Z

    [SPARK-15509] some clean up

commit c8e30e9452031908fc829e527ab82a8e93598302
Author: Xin Ren <ia...@126.com>
Date:   2016-06-10T00:45:53Z

    [SPARK-15509] fix path

commit 43b2f8c5fb9e0d74579b948b1d52cad4faa76b66
Author: Xin Ren <ia...@126.com>
Date:   2016-06-10T00:48:36Z

    [SPARK-15509] fix path

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Merged. I could't change the assignee in the JIRA, somehow - @shivaram could you please do that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #64764 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64764/consoleFull)** for PR 13584 at commit [`caa4183`](https://github.com/apache/spark/commit/caa41833c5d6dcfac046e5a804b1498258121d94).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #64811 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64811/consoleFull)** for PR 13584 at commit [`d9e3be5`](https://github.com/apache/spark/commit/d9e3be5f45db5731e74613507047380d3f6f40f3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64811/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #64811 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64811/consoleFull)** for PR 13584 at commit [`d9e3be5`](https://github.com/apache/spark/commit/d9e3be5f45db5731e74613507047380d3f6f40f3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by keypointt <gi...@git.apache.org>.

Github user keypointt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r77272735
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
    @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
        */
       def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
         if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
    -      logWarning("data containing 'label' column, so change its name to avoid conflict")
    -      rFormula.setLabelCol(rFormula.getLabelCol + "_output")
    +      val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames)
    --- End diff --
    
    ok, it's a better way


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by keypointt <gi...@git.apache.org>.

Github user keypointt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r77271690
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
    @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
        */
       def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
         if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
    -      logWarning("data containing 'label' column, so change its name to avoid conflict")
    -      rFormula.setLabelCol(rFormula.getLabelCol + "_output")
    +      val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames)
    +      logWarning(
    +        s"data containing ${rFormula.getLabelCol} column, changing its name to $newLabelName")
    --- End diff --
    
    sure, I'll change it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by junyangq <gi...@git.apache.org>.

Github user junyangq commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    @shivaram Does it sound reasonable to you? Just discussed this with @jkbradley.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #64578 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64578/consoleFull)** for PR 13584 at commit [`1bc150f`](https://github.com/apache/spark/commit/1bc150f8af93f0e5d35e40fd39e33176c974d8cf).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Yeah I was going to say that we need to handle cases where `labels_output` is also used. We can just add a numeric suffix maybe ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    @keypointt Is this PR still relevant ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64578/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r76940367
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
    @@ -0,0 +1,47 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.ml.feature.RFormula
    +import org.apache.spark.sql.Dataset
    +
    +object RWrapperUtils extends Logging {
    +
    +  /**
    +   * DataFrame column check.
    +   * When loading data, default columns "features" and "label" will be added. And these two names
    +   * would conflict with RFormula default feature and label column names.
    +   * Here is to change the column name to avoid "column already exists" error.
    +   *
    +   * @param rFormula RFormula instance
    +   * @param data Input dataset
    +   * @return Unit
    +   */
    +  def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
    +    if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
    +      logWarning("data containing 'label' column, so change its name to avoid conflict")
    --- End diff --
    
    is it possible to include the featurecol name in logging?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    LGTM - @felixcheung Feel free to merge when its ready


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by junyangq <gi...@git.apache.org>.

Github user junyangq commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    @keypointt Can we keep searching (in random or sequential way) until an unused column name has been found?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r77271384
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
    @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
        */
       def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
         if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
    -      logWarning("data containing 'label' column, so change its name to avoid conflict")
    -      rFormula.setLabelCol(rFormula.getLabelCol + "_output")
    +      val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames)
    +      logWarning(
    +        s"data containing ${rFormula.getLabelCol} column, changing its name to $newLabelName")
    --- End diff --
    
    this sounds a bit like we are renaming the existing `label` column?
    perhaps just change to `s"data containing ${rFormula.getLabelCol} column, using new name to $newLabelName instead"`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64813/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #64578 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64578/consoleFull)** for PR 13584 at commit [`1bc150f`](https://github.com/apache/spark/commit/1bc150f8af93f0e5d35e40fd39e33176c974d8cf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    @jkbradley Is this important for 2.0 ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60261/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r77272395
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
    @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
        */
       def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
         if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
    -      logWarning("data containing 'label' column, so change its name to avoid conflict")
    -      rFormula.setLabelCol(rFormula.getLabelCol + "_output")
    +      val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames)
    --- End diff --
    
    nit: i think we end up checking for `label_output` twice, once in `if (data.schema.fieldNames.contains(rFormula.getFeaturesCol))` and second time within `convertToUniqueName`? Perhaps we merge them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by keypointt <gi...@git.apache.org>.

Github user keypointt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r76543182
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala ---
    @@ -54,9 +54,6 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul
         intercept[IllegalArgumentException] {
           formula.fit(original)
         }
    -    intercept[IllegalArgumentException] {
    --- End diff --
    
    here is just a duplication of above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by keypointt <gi...@git.apache.org>.

Github user keypointt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r76543133
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/r/RWrapperUtilsSuite.scala ---
    @@ -0,0 +1,47 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import org.apache.spark.SparkFunSuite
    +import org.apache.spark.ml.feature.{RFormula, RFormulaModel}
    +import org.apache.spark.mllib.util.MLlibTestSparkContext
    +
    +class RWrapperUtilsSuite extends SparkFunSuite with MLlibTestSparkContext {
    +
    +  test("avoid column name conflicting") {
    +    val rFormula = new RFormula().setFormula("label ~ features")
    +    val data = spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt")
    --- End diff --
    
    Here I used `"../data/"`, I'm not sure if there is a better way to do it, something like `$current_directory/data/mllib/sample_libsvm_data.txt`?
    
    All I found is like this `val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")` https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/NaiveBayesExample.scala#L36


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13584


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Hmm - the problem still seems to be relevant. @mengxr @junyangq Would one of you be able to look at this ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by keypointt <gi...@git.apache.org>.

Github user keypointt commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    sure I'll try to scan through all the mllib algorithms 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by junyangq <gi...@git.apache.org>.

Github user junyangq commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by keypointt <gi...@git.apache.org>.

Github user keypointt commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    I'm not sure, I guess this one is skipped and not important anymore?
    
    I can close it if it's not going to be merged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by junyangq <gi...@git.apache.org>.

Github user junyangq commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Sounds good. That's also what we meant. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r77272270
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
    @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
        */
       def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
         if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
    -      logWarning("data containing 'label' column, so change its name to avoid conflict")
    -      rFormula.setLabelCol(rFormula.getLabelCol + "_output")
    +      val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames)
    +      logWarning(
    +        s"data containing ${rFormula.getLabelCol} column, changing its name to $newLabelName")
    +      rFormula.setLabelCol(newLabelName)
         }
     
         if (data.schema.fieldNames.contains(rFormula.getFeaturesCol)) {
    -      logWarning("data containing 'features' column, so change its name to avoid conflict")
    -      rFormula.setFeaturesCol(rFormula.getFeaturesCol + "_output")
    +      val newFeaturesName = convertToUniqueName(rFormula.getFeaturesCol, data.schema.fieldNames)
    +      logWarning(
    +        s"data containing ${rFormula.getFeaturesCol} column, changing its name to $newFeaturesName")
    --- End diff --
    
    same here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #64764 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64764/consoleFull)** for PR 13584 at commit [`caa4183`](https://github.com/apache/spark/commit/caa41833c5d6dcfac046e5a804b1498258121d94).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by keypointt <gi...@git.apache.org>.

Github user keypointt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r77098961
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
    @@ -0,0 +1,47 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.ml.feature.RFormula
    +import org.apache.spark.sql.Dataset
    +
    +object RWrapperUtils extends Logging {
    +
    +  /**
    +   * DataFrame column check.
    +   * When loading data, default columns "features" and "label" will be added. And these two names
    +   * would conflict with RFormula default feature and label column names.
    +   * Here is to change the column name to avoid "column already exists" error.
    +   *
    +   * @param rFormula RFormula instance
    +   * @param data Input dataset
    +   * @return Unit
    +   */
    +  def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
    +    if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
    +      logWarning("data containing 'label' column, so change its name to avoid conflict")
    +      rFormula.setLabelCol(rFormula.getLabelCol + "_output")
    --- End diff --
    
    sure I'll add this logic, incrementing until no conflict


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #60261 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60261/consoleFull)** for PR 13584 at commit [`43b2f8c`](https://github.com/apache/spark/commit/43b2f8c5fb9e0d74579b948b1d52cad4faa76b66).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #64813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64813/consoleFull)** for PR 13584 at commit [`8bb370e`](https://github.com/apache/spark/commit/8bb370ec0408ae10fe0c1c359b0f1b68066cbf87).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r77290933
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
    @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
        */
       def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
         if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
    -      logWarning("data containing 'label' column, so change its name to avoid conflict")
    -      rFormula.setLabelCol(rFormula.getLabelCol + "_output")
    +      val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames)
    --- End diff --
    
    fair enough. that makes sense, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    LGTM. @shivaram do you have any other comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #64765 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64765/consoleFull)** for PR 13584 at commit [`1701252`](https://github.com/apache/spark/commit/1701252cf86a615874126215d956fd32d8eab0d0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #64813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64813/consoleFull)** for PR 13584 at commit [`8bb370e`](https://github.com/apache/spark/commit/8bb370ec0408ae10fe0c1c359b0f1b68066cbf87).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #60261 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60261/consoleFull)** for PR 13584 at commit [`43b2f8c`](https://github.com/apache/spark/commit/43b2f8c5fb9e0d74579b948b1d52cad4faa76b66).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r77272532
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
    @@ -0,0 +1,71 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.ml.feature.RFormula
    +import org.apache.spark.sql.Dataset
    +
    +object RWrapperUtils extends Logging {
    +
    +  /**
    +   * DataFrame column check.
    +   * When loading data, default columns "features" and "label" will be added. And these two names
    +   * would conflict with RFormula default feature and label column names.
    +   * Here is to change the column name to avoid "column already exists" error.
    +   *
    +   * @param rFormula RFormula instance
    +   * @param data Input dataset
    +   * @return Unit
    +   */
    +  def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
    +    if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
    +      val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames)
    +      logWarning(
    +        s"data containing ${rFormula.getLabelCol} column, using new name $newLabelName instead")
    +      rFormula.setLabelCol(newLabelName)
    +    }
    +
    +    if (data.schema.fieldNames.contains(rFormula.getFeaturesCol)) {
    +      val newFeaturesName = convertToUniqueName(rFormula.getFeaturesCol, data.schema.fieldNames)
    +      logWarning(
    +        s"data containing ${rFormula.getFeaturesCol} column, using new name $newFeaturesName")
    --- End diff --
    
    let's make this consistent with the message above in L40?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64764/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    **[Test build #64765 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64765/consoleFull)** for PR 13584 at commit [`1701252`](https://github.com/apache/spark/commit/1701252cf86a615874126215d956fd32d8eab0d0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by keypointt <gi...@git.apache.org>.

Github user keypointt commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    hi @jkbradley do you mind have a look on this one? thanks a lot :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64765/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by keypointt <gi...@git.apache.org>.

Github user keypointt commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r77273428
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
    @@ -35,13 +35,37 @@ object RWrapperUtils extends Logging {
        */
       def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
         if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
    -      logWarning("data containing 'label' column, so change its name to avoid conflict")
    -      rFormula.setLabelCol(rFormula.getLabelCol + "_output")
    +      val newLabelName = convertToUniqueName(rFormula.getLabelCol, data.schema.fieldNames)
    --- End diff --
    
    I think in `if (data.schema.fieldNames.contains(rFormula.getFeaturesCol))`, it's checking `label` only
    
    and in `convertToUniqueName ()`,  `_output` will be appended resulting in `label_output `: `var newName = originalName + "_output"`, and then `label_output ` is checked at `while (fieldNames.contains(newName))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by keypointt <gi...@git.apache.org>.

Github user keypointt commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    hi @jkbradley do you mind have a look on this one? thanks a lot :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    cc @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms should supp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13584
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #13584: [SPARK-15509][ML][SparkR] R MLlib algorithms shou...

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13584#discussion_r76940615
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/RWrapperUtils.scala ---
    @@ -0,0 +1,47 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.ml.feature.RFormula
    +import org.apache.spark.sql.Dataset
    +
    +object RWrapperUtils extends Logging {
    +
    +  /**
    +   * DataFrame column check.
    +   * When loading data, default columns "features" and "label" will be added. And these two names
    +   * would conflict with RFormula default feature and label column names.
    +   * Here is to change the column name to avoid "column already exists" error.
    +   *
    +   * @param rFormula RFormula instance
    +   * @param data Input dataset
    +   * @return Unit
    +   */
    +  def checkDataColumns(rFormula: RFormula, data: Dataset[_]): Unit = {
    +    if (data.schema.fieldNames.contains(rFormula.getLabelCol)) {
    +      logWarning("data containing 'label' column, so change its name to avoid conflict")
    +      rFormula.setLabelCol(rFormula.getLabelCol + "_output")
    --- End diff --
    
    what if `something_output` is also already in the DataFrame?
    should we check for it?
    I thought the earlier discussion calls for appending a sequence number, like `something_output1` and incrementing util it is not already there?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org