You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/02 08:41:51 UTC

[GitHub] [spark] viirya opened a new pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

viirya opened a new pull request #28704:
URL: https://github.com/apache/spark/pull/28704


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This patch adds user-specified fold column support to `CrossValidator`. User can assign fold numbers to dataset instead of letting Spark do random splits.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   This gives `CrossValidator` users more flexibility in splitting folds.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   Yes, a new `foldCol` param is added to `CrossValidator`. User can use it to specify custom fold splitting.
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Added unit tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638022253






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637691431


   **[Test build #123442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123442/testReport)** for PR 28704 at commit [`f34ab0d`](https://github.com/apache/spark/commit/f34ab0d3245ec0609be081a0def47deb38b6498f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641655170


   Added fold number check both in Scala and Python.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637695722






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638001274


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123464/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-640962985


   @mengxr you opened the JIRA - any comments? Looks reasonable to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637990461


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123463/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-645062852


   Merging to master!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641662756






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641656471


   **[Test build #123712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123712/testReport)** for PR 28704 at commit [`aa7c8d0`](https://github.com/apache/spark/commit/aa7c8d07826d31219acd617fbd8392181256957e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637391122






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637720304


   **[Test build #123442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123442/testReport)** for PR 28704 at commit [`f34ab0d`](https://github.com/apache/spark/commit/f34ab0d3245ec0609be081a0def47deb38b6498f).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637654846


   cc @mengxr @srowen 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-644852913


   Thanks! I will merge this in one day if no other opinions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-643937918


   @srowen @huaxingao @zhengruifeng thanks for previous review. Do you have more comments for this change?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638020051


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637990448


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641656471


   **[Test build #123712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123712/testReport)** for PR 28704 at commit [`aa7c8d0`](https://github.com/apache/spark/commit/aa7c8d07826d31219acd617fbd8392181256957e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641681041






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637695099


   **[Test build #123444 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123444/testReport)** for PR 28704 at commit [`786abaa`](https://github.com/apache/spark/commit/786abaa44376697e862e01744a246f1bfb53d497).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641655007


   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/28336/
   Test PASSed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641662475


   **[Test build #123713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123713/testReport)** for PR 28704 at commit [`218db10`](https://github.com/apache/spark/commit/218db10533bb803807e453a9288402b2415eaa30).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641681041






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641656994






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637720744


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123442/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638068044


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
huaxingao commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-644213530


   LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637989881






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28704:
URL: https://github.com/apache/spark/pull/28704#discussion_r434020686



##########
File path: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala
##########
@@ -40,10 +41,22 @@ class CrossValidatorSuite
   import testImplicits._
 
   @transient var dataset: Dataset[_] = _
+  @transient var datasetWithFold: Dataset[_] = _
 
   override def beforeAll(): Unit = {
     super.beforeAll()
     dataset = sc.parallelize(generateLogisticInput(1.0, 1.0, 100, 42), 2).toDF()
+    val foldCol = udf { () =>
+      val r = Math.random()

Review comment:
       Good idea.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #28704:
URL: https://github.com/apache/spark/pull/28704#discussion_r434008701



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
##########
@@ -56,6 +56,19 @@ private[ml] trait CrossValidatorParams extends ValidatorParams {
   def getNumFolds: Int = $(numFolds)
 
   setDefault(numFolds -> 3)
+
+  /**
+   * Param for the column name of user specified fold number. Once this is specified,
+   * `CrossValidator` won't do random k-fold split. Note that this column should be
+   * integer type with range [0, numFolds) and Spark won't do sanity-check for this

Review comment:
       I guess we could, perhaps, say that we'll take the value mod numFolds? That might open up a few more usages and avoid puzzling errors where some data isn't in any fold. (Or else, do sanity check the range)

##########
File path: mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala
##########
@@ -40,10 +41,22 @@ class CrossValidatorSuite
   import testImplicits._
 
   @transient var dataset: Dataset[_] = _
+  @transient var datasetWithFold: Dataset[_] = _
 
   override def beforeAll(): Unit = {
     super.beforeAll()
     dataset = sc.parallelize(generateLogisticInput(1.0, 1.0, 100, 42), 2).toDF()
+    val foldCol = udf { () =>
+      val r = Math.random()

Review comment:
       Not sure if it matters but do you want to create this list ahead of time from a seeded RNG, and parallelize it, to ensure it doesn't vary?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637390451


   **[Test build #123426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123426/testReport)** for PR 28704 at commit [`baec279`](https://github.com/apache/spark/commit/baec279c45b4ac9782e4d1c3286063fc04146eb1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #28704:
URL: https://github.com/apache/spark/pull/28704#discussion_r437107792



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
##########
@@ -248,6 +248,19 @@ object MLUtils extends Logging {
     }.toArray
   }
 
+  /**
+   * Version of `kFold()` taking a fold column name.
+   */
+  @Since("3.1.0")
+  def kFold(df: DataFrame, numFolds: Int, foldColName: String): Array[(RDD[Row], RDD[Row])] = {
+    val foldCol = df.col(foldColName)
+    val dfWithMod = df.withColumn(foldColName, pmod(foldCol, lit(numFolds)))
+    (0 until numFolds).map { fold =>
+      (dfWithMod.filter(col(foldColName) =!= fold).drop(foldColName).rdd,
+        dfWithMod.filter(col(foldColName) === fold).drop(foldColName).rdd)

Review comment:
       I also perfer adding a numFolds checking here, but not strongly.
   Since ML imlps tends to transform input dataframe to `RDD[Vector]` and then cache it, compared with the training, this checking maybe cheap.
   

##########
File path: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
##########
@@ -248,6 +248,19 @@ object MLUtils extends Logging {
     }.toArray
   }
 
+  /**
+   * Version of `kFold()` taking a fold column name.
+   */
+  @Since("3.1.0")
+  def kFold(df: DataFrame, numFolds: Int, foldColName: String): Array[(RDD[Row], RDD[Row])] = {
+    val foldCol = df.col(foldColName)
+    val dfWithMod = df.withColumn(foldColName, pmod(foldCol, lit(numFolds)))
+    (0 until numFolds).map { fold =>
+      (dfWithMod.filter(col(foldColName) =!= fold).drop(foldColName).rdd,
+        dfWithMod.filter(col(foldColName) === fold).drop(foldColName).rdd)

Review comment:
       Both, 1, checking `foldCol` valuse are in [0, numFolds); 2, for each fold, both the train rdd and the validation rdd are not empty.
   
   I am afraid if the `numFolds` is wrongly set, for example, numFolds=3, and the `foldCol` values are in {0, 1, 2, 3}, then there maybe task skewness;
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641680743


   **[Test build #123713 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123713/testReport)** for PR 28704 at commit [`218db10`](https://github.com/apache/spark/commit/218db10533bb803807e453a9288402b2415eaa30).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638021723


   **[Test build #123470 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123470/testReport)** for PR 28704 at commit [`9c8658b`](https://github.com/apache/spark/commit/9c8658b9a8201bfbfa5140b1a5a8b79aa2ae400b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637996673






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641655003


   Merged build finished. Test PASSed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638280800


   **[Test build #123489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123489/testReport)** for PR 28704 at commit [`9c8658b`](https://github.com/apache/spark/commit/9c8658b9a8201bfbfa5140b1a5a8b79aa2ae400b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637996673






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637990448






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638067129


   **[Test build #123470 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123470/testReport)** for PR 28704 at commit [`9c8658b`](https://github.com/apache/spark/commit/9c8658b9a8201bfbfa5140b1a5a8b79aa2ae400b).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641656999


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123712/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641655170


   Added fold number check and test case both in Scala and Python.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637989205


   **[Test build #123463 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123463/testReport)** for PR 28704 at commit [`244befc`](https://github.com/apache/spark/commit/244befc91c1138495ac36b6d85c5a8a238cf5172).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637432380


   **[Test build #123426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123426/testReport)** for PR 28704 at commit [`baec279`](https://github.com/apache/spark/commit/baec279c45b4ac9782e4d1c3286063fc04146eb1).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638022253






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637994040


   Python change was added.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637996117


   **[Test build #123464 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123464/testReport)** for PR 28704 at commit [`9c8658b`](https://github.com/apache/spark/commit/9c8658b9a8201bfbfa5140b1a5a8b79aa2ae400b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] huaxingao commented on a change in pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
huaxingao commented on a change in pull request #28704:
URL: https://github.com/apache/spark/pull/28704#discussion_r434665819



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
##########
@@ -248,6 +248,19 @@ object MLUtils extends Logging {
     }.toArray
   }
 
+  /**
+   * Version of `kFold()` taking a fold column name.
+   */
+  @Since("3.1.0")
+  def kFold(df: DataFrame, numFolds: Int, foldColName: String): Array[(RDD[Row], RDD[Row])] = {
+    val foldCol = df.col(foldColName)
+    val dfWithMod = df.withColumn(foldColName, pmod(foldCol, lit(numFolds)))
+    (0 until numFolds).map { fold =>
+      (dfWithMod.filter(col(foldColName) =!= fold).drop(foldColName).rdd,
+        dfWithMod.filter(col(foldColName) === fold).drop(foldColName).rdd)

Review comment:
       I have a question here: 
   If foldCol doesn't contain a certain specific fold num, the validation dataset for that iteration will be empty. For example, if numFolds is 3 and foldCol only contains 0 and 2, the validation dataset is empty for fold 1. Should we check for empty splits and remove them?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637691431


   **[Test build #123442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123442/testReport)** for PR 28704 at commit [`f34ab0d`](https://github.com/apache/spark/commit/f34ab0d3245ec0609be081a0def47deb38b6498f).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641656985


   **[Test build #123712 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123712/testReport)** for PR 28704 at commit [`aa7c8d0`](https://github.com/apache/spark/commit/aa7c8d07826d31219acd617fbd8392181256957e).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637692073






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637695099


   **[Test build #123444 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123444/testReport)** for PR 28704 at commit [`786abaa`](https://github.com/apache/spark/commit/786abaa44376697e862e01744a246f1bfb53d497).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637674507


   For Python side, I think to add it in a follow-up. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28704:
URL: https://github.com/apache/spark/pull/28704#discussion_r434754145



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
##########
@@ -248,6 +248,19 @@ object MLUtils extends Logging {
     }.toArray
   }
 
+  /**
+   * Version of `kFold()` taking a fold column name.
+   */
+  @Since("3.1.0")
+  def kFold(df: DataFrame, numFolds: Int, foldColName: String): Array[(RDD[Row], RDD[Row])] = {
+    val foldCol = df.col(foldColName)
+    val dfWithMod = df.withColumn(foldColName, pmod(foldCol, lit(numFolds)))
+    (0 until numFolds).map { fold =>
+      (dfWithMod.filter(col(foldColName) =!= fold).drop(foldColName).rdd,
+        dfWithMod.filter(col(foldColName) === fold).drop(foldColName).rdd)

Review comment:
       Generally I think a valid user-specified folds should not miss any fold number. I mentioned it in above discussion that adding fold number check is possible however I don't want to do it because performance concern. I'd more towards letting users to be responsible for it if they want to specify fold numbers by themselves.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637742975






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637989881






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638281390






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637391122






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637433018






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637687355


   I'm not super against handling Python separately, but it should be a small change, and logically belongs here. I know we've separated them in the past, and I know we usually get around to adding it, but, I'd lightly prefer doing it all in one go.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637742975






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya closed pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya closed pull request #28704:
URL: https://github.com/apache/spark/pull/28704


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638001266






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637695722






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-640955747


   @srowen @huaxingao any more comments for this change that I need to address?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638068063


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/123470/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637720737


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637390451


   **[Test build #123426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123426/testReport)** for PR 28704 at commit [`baec279`](https://github.com/apache/spark/commit/baec279c45b4ac9782e4d1c3286063fc04146eb1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641656994


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641662475


   **[Test build #123713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123713/testReport)** for PR 28704 at commit [`218db10`](https://github.com/apache/spark/commit/218db10533bb803807e453a9288402b2415eaa30).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638001137


   **[Test build #123464 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123464/testReport)** for PR 28704 at commit [`9c8658b`](https://github.com/apache/spark/commit/9c8658b9a8201bfbfa5140b1a5a8b79aa2ae400b).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28704:
URL: https://github.com/apache/spark/pull/28704#discussion_r437150737



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
##########
@@ -248,6 +248,19 @@ object MLUtils extends Logging {
     }.toArray
   }
 
+  /**
+   * Version of `kFold()` taking a fold column name.
+   */
+  @Since("3.1.0")
+  def kFold(df: DataFrame, numFolds: Int, foldColName: String): Array[(RDD[Row], RDD[Row])] = {
+    val foldCol = df.col(foldColName)
+    val dfWithMod = df.withColumn(foldColName, pmod(foldCol, lit(numFolds)))
+    (0 until numFolds).map { fold =>
+      (dfWithMod.filter(col(foldColName) =!= fold).drop(foldColName).rdd,
+        dfWithMod.filter(col(foldColName) === fold).drop(foldColName).rdd)

Review comment:
       @zhengruifeng Is the check you suggested the same as @huaxingao? I.e., checking for empty datasets. Or checking user-specified fold numbers in the range [0, numFolds)?
   
   For the later, now I take the value mod numFolds, and I think it should be enough for valid fold numbers.

##########
File path: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
##########
@@ -248,6 +248,19 @@ object MLUtils extends Logging {
     }.toArray
   }
 
+  /**
+   * Version of `kFold()` taking a fold column name.
+   */
+  @Since("3.1.0")
+  def kFold(df: DataFrame, numFolds: Int, foldColName: String): Array[(RDD[Row], RDD[Row])] = {
+    val foldCol = df.col(foldColName)
+    val dfWithMod = df.withColumn(foldColName, pmod(foldCol, lit(numFolds)))
+    (0 until numFolds).map { fold =>
+      (dfWithMod.filter(col(foldColName) =!= fold).drop(foldColName).rdd,
+        dfWithMod.filter(col(foldColName) === fold).drop(foldColName).rdd)

Review comment:
       Though I think users should be confident enough to use user-specified fold numbers. :) But it also sounds good to me. I will add some checks in next commit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637742163


   **[Test build #123444 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123444/testReport)** for PR 28704 at commit [`786abaa`](https://github.com/apache/spark/commit/786abaa44376697e862e01744a246f1bfb53d497).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638330429


   **[Test build #123489 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123489/testReport)** for PR 28704 at commit [`9c8658b`](https://github.com/apache/spark/commit/9c8658b9a8201bfbfa5140b1a5a8b79aa2ae400b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638331121






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637989205


   **[Test build #123463 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123463/testReport)** for PR 28704 at commit [`244befc`](https://github.com/apache/spark/commit/244befc91c1138495ac36b6d85c5a8a238cf5172).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637720737






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641662756






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28704:
URL: https://github.com/apache/spark/pull/28704#discussion_r434017713



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala
##########
@@ -56,6 +56,19 @@ private[ml] trait CrossValidatorParams extends ValidatorParams {
   def getNumFolds: Int = $(numFolds)
 
   setDefault(numFolds -> 3)
+
+  /**
+   * Param for the column name of user specified fold number. Once this is specified,
+   * `CrossValidator` won't do random k-fold split. Note that this column should be
+   * integer type with range [0, numFolds) and Spark won't do sanity-check for this

Review comment:
       For doing sanity check, seems we need a UDF doing the check/throwing error? Could be bad for performance? I think mod numFolds sounds a better one. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638331121






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638001266


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638281390






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638068044






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637996117


   **[Test build #123464 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123464/testReport)** for PR 28704 at commit [`9c8658b`](https://github.com/apache/spark/commit/9c8658b9a8201bfbfa5140b1a5a8b79aa2ae400b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-641655003






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637990434


   **[Test build #123463 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123463/testReport)** for PR 28704 at commit [`244befc`](https://github.com/apache/spark/commit/244befc91c1138495ac36b6d85c5a8a238cf5172).
    * This patch **fails Python style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638278883


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637692073






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638021723


   **[Test build #123470 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123470/testReport)** for PR 28704 at commit [`9c8658b`](https://github.com/apache/spark/commit/9c8658b9a8201bfbfa5140b1a5a8b79aa2ae400b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637433018






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28704: [SPARK-31777][ML][PySpark] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-638280800


   **[Test build #123489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123489/testReport)** for PR 28704 at commit [`9c8658b`](https://github.com/apache/spark/commit/9c8658b9a8201bfbfa5140b1a5a8b79aa2ae400b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28704: [SPARK-31777][ML] Add user-specified fold column to CrossValidator

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28704:
URL: https://github.com/apache/spark/pull/28704#issuecomment-637690356


   Ok, sounds good to me. I'm adding a new commit for above comments. Once you think Scala side is good, I will add python change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org