You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/09 11:43:57 UTC

[GitHub] [spark] PhillHenry opened a new pull request #31535: Param random builder

PhillHenry opened a new pull request #31535:
URL: https://github.com/apache/spark/pull/31535


   ### What changes were proposed in this pull request?
   
   Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:
   
   http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html
   
   All code is entirely my own work and I license the work to the project under the project’s open source license.
   
   ### Why are the changes needed?
   
   Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts. 
   
   Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
   
   Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python. 
   
   ### Does this PR introduce _any_ user-facing change?
   
   A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with  its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.
   
   ### How was this patch tested?
   
   Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.
   
   `ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.
   
   `RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575104665



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {
+    def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+      import limits._
+      val lower: Double = math.min(x, y)
+      val upper: Double = math.max(x, y)
+
+      override def randomTLog(n: Int): Double =
+        RandomRanges.randomLog(lower, upper, n)
+
+      override def randomT(): Double =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+    }
+  }
+
+  implicit object FloatGenerator extends Generator[Float] {
+    def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+      import limits._
+      val lower: Float = math.min(x, y)
+      val upper: Float = math.max(x, y)
+
+      override def randomTLog(n: Int): Float =
+        RandomRanges.randomLog(lower, upper, n).toFloat
+
+      override def randomT(): Float =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+    }
+  }
+
+  implicit object IntGenerator extends Generator[Int] {
+    def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+      import limits._
+      val lower: Int = math.min(x, y)
+      val upper: Int = math.max(x, y)
+
+      override def randomTLog(n: Int): Int =
+        RandomRanges.randomLog(lower, upper, n).toInt
+
+      override def randomT(): Int =
+        bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+    }
+  }
+
+  implicit object LongGenerator extends Generator[Long] {

Review comment:
       I included it for completeness. I can remove it if you like.

##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {
+    def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+      import limits._
+      val lower: Double = math.min(x, y)
+      val upper: Double = math.max(x, y)
+
+      override def randomTLog(n: Int): Double =
+        RandomRanges.randomLog(lower, upper, n)
+
+      override def randomT(): Double =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+    }
+  }
+
+  implicit object FloatGenerator extends Generator[Float] {
+    def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+      import limits._
+      val lower: Float = math.min(x, y)
+      val upper: Float = math.max(x, y)
+
+      override def randomTLog(n: Int): Float =
+        RandomRanges.randomLog(lower, upper, n).toFloat
+
+      override def randomT(): Float =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+    }
+  }
+
+  implicit object IntGenerator extends Generator[Int] {
+    def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+      import limits._
+      val lower: Int = math.min(x, y)
+      val upper: Int = math.max(x, y)
+
+      override def randomTLog(n: Int): Int =
+        RandomRanges.randomLog(lower, upper, n).toInt
+
+      override def randomT(): Int =
+        bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+    }
+  }
+
+  implicit object LongGenerator extends Generator[Long] {
+    def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+      import limits._
+      val lower: Long = math.min(x, y)
+      val upper: Long = math.max(x, y)
+
+      override def randomTLog(n: Int): Long =
+        RandomRanges.randomLog(lower, upper, n).toLong
+
+      override def randomT(): Long =
+        bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+    }
+  }
+
+  def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+  def randomLog(lower: Double, upper: Double, n: Int): Double = {
+    val logLower: Double = logN(lower, n)
+    val logUpper: Double = logN(upper, n)
+    val logLimits: Limits[Double] = Limits(logLower, logUpper)
+    val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+    math.pow(n, rndLogged.randomT())
+  }
+
+  def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.1.0")

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31535: Param random builder

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-775898793


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786824822


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135518/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784534776


   **[Test build #135389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135389/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781996174


   It's not an optimizer like hyperopt, right.
   It's a lot like what https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html does, really.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786061733


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40053/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575653003



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {
+    def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+      import limits._
+      val lower: Double = math.min(x, y)
+      val upper: Double = math.max(x, y)
+
+      override def randomTLog(n: Int): Double =
+        RandomRanges.randomLog(lower, upper, n)
+
+      override def randomT(): Double =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+    }
+  }
+
+  implicit object FloatGenerator extends Generator[Float] {
+    def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+      import limits._
+      val lower: Float = math.min(x, y)
+      val upper: Float = math.max(x, y)
+
+      override def randomTLog(n: Int): Float =
+        RandomRanges.randomLog(lower, upper, n).toFloat
+
+      override def randomT(): Float =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+    }
+  }
+
+  implicit object IntGenerator extends Generator[Int] {
+    def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+      import limits._
+      val lower: Int = math.min(x, y)
+      val upper: Int = math.max(x, y)
+
+      override def randomTLog(n: Int): Int =
+        RandomRanges.randomLog(lower, upper, n).toInt
+
+      override def randomT(): Int =
+        bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+    }
+  }
+
+  implicit object LongGenerator extends Generator[Long] {

Review comment:
       Removed.
   The implicits needs to be imported for people using the Scala API.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781460422


   **[Test build #135234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135234/testReport)** for PR 31535 at commit [`e88f907`](https://github.com/apache/spark/commit/e88f907585cef0c44eba27910274d2152811c6ea).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786828438


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40098/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784585918


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39969/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #31535: SPARK-34415 mllib Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-776550856


   @PhillHenry would you mind formatting PR title as guided in http://spark.apache.org/contributing.html?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787081667


   Merged to master


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786064893


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40053/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786789319


   **[Test build #135518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135518/testReport)** for PR 31535 at commit [`ddfe4a9`](https://github.com/apache/spark/commit/ddfe4a9ba7c7d100f5a0d3287a9001cd7fb4e325).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783503322


   Jenkins test this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r574568879



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)

Review comment:
       I think these can be private[ml] or at least private[spark]? best to not expose whatever we dont' have to

##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {

Review comment:
       I doubt BigDecimal or BigInt is worth supporting? I have not seen that used as a hyperparam

##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {
+    def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+      import limits._
+      val lower: Double = math.min(x, y)
+      val upper: Double = math.max(x, y)
+
+      override def randomTLog(n: Int): Double =
+        RandomRanges.randomLog(lower, upper, n)
+
+      override def randomT(): Double =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+    }
+  }
+
+  implicit object FloatGenerator extends Generator[Float] {
+    def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+      import limits._
+      val lower: Float = math.min(x, y)
+      val upper: Float = math.max(x, y)
+
+      override def randomTLog(n: Int): Float =
+        RandomRanges.randomLog(lower, upper, n).toFloat
+
+      override def randomT(): Float =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+    }
+  }
+
+  implicit object IntGenerator extends Generator[Int] {
+    def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+      import limits._
+      val lower: Int = math.min(x, y)
+      val upper: Int = math.max(x, y)
+
+      override def randomTLog(n: Int): Int =
+        RandomRanges.randomLog(lower, upper, n).toInt
+
+      override def randomT(): Int =
+        bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+    }
+  }
+
+  implicit object LongGenerator extends Generator[Long] {

Review comment:
       Likewise not sure Long is even needed; is there a use case?

##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {
+    def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+      import limits._
+      val lower: Double = math.min(x, y)
+      val upper: Double = math.max(x, y)
+
+      override def randomTLog(n: Int): Double =
+        RandomRanges.randomLog(lower, upper, n)
+
+      override def randomT(): Double =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+    }
+  }
+
+  implicit object FloatGenerator extends Generator[Float] {
+    def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+      import limits._
+      val lower: Float = math.min(x, y)
+      val upper: Float = math.max(x, y)
+
+      override def randomTLog(n: Int): Float =
+        RandomRanges.randomLog(lower, upper, n).toFloat
+
+      override def randomT(): Float =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+    }
+  }
+
+  implicit object IntGenerator extends Generator[Int] {
+    def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+      import limits._
+      val lower: Int = math.min(x, y)
+      val upper: Int = math.max(x, y)
+
+      override def randomTLog(n: Int): Int =
+        RandomRanges.randomLog(lower, upper, n).toInt
+
+      override def randomT(): Int =
+        bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+    }
+  }
+
+  implicit object LongGenerator extends Generator[Long] {
+    def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+      import limits._
+      val lower: Long = math.min(x, y)
+      val upper: Long = math.max(x, y)
+
+      override def randomTLog(n: Int): Long =
+        RandomRanges.randomLog(lower, upper, n).toLong
+
+      override def randomT(): Long =
+        bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+    }
+  }
+
+  def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+  def randomLog(lower: Double, upper: Double, n: Int): Double = {
+    val logLower: Double = logN(lower, n)
+    val logUpper: Double = logN(upper, n)
+    val logLimits: Limits[Double] = Limits(logLower, logUpper)
+    val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+    math.pow(n, rndLogged.randomT())
+  }
+
+  def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.1.0")

Review comment:
       3.2.0




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781632635


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135234/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784670739


   **[Test build #135389 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135389/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-775898793


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781530439


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39814/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575103983



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {

Review comment:
       They're not exposed as part of the public facing interface. They're just used internally to support the primitives.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781460422


   **[Test build #135234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135234/testReport)** for PR 31535 at commit [`e88f907`](https://github.com/apache/spark/commit/e88f907585cef0c44eba27910274d2152811c6ea).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781487434


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39814/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786028771


   Jenkins retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784417993


   I get it, but the base ParamGridBuilder is also exposed in Python, and it would definitely make sense to expose in Python too, for all the same reasons maybe it's useful to users who don't want to go to hyperopt. I think it's non-trivial additional work, I apologize, but if it's not crazy hard, I think it's worth it. I mean, we can say "we'll do it later" but I expect it won't get done otherwise.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575653066



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  private val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {
+    def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+      import limits._
+      val lower: Double = math.min(x, y)
+      val upper: Double = math.max(x, y)
+
+      override def randomTLog(n: Int): Double =
+        RandomRanges.randomLog(lower, upper, n)
+
+      override def randomT(): Double =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+    }
+  }
+
+  implicit object FloatGenerator extends Generator[Float] {
+    def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+      import limits._
+      val lower: Float = math.min(x, y)
+      val upper: Float = math.max(x, y)
+
+      override def randomTLog(n: Int): Float =
+        RandomRanges.randomLog(lower, upper, n).toFloat
+
+      override def randomT(): Float =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+    }
+  }
+
+  implicit object IntGenerator extends Generator[Int] {
+    def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+      import limits._
+      val lower: Int = math.min(x, y)
+      val upper: Int = math.max(x, y)
+
+      override def randomTLog(n: Int): Int =
+        RandomRanges.randomLog(lower, upper, n).toInt
+
+      override def randomT(): Int =
+        bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+    }
+  }
+
+  implicit object LongGenerator extends Generator[Long] {
+    def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+      import limits._
+      val lower: Long = math.min(x, y)
+      val upper: Long = math.max(x, y)
+
+      override def randomTLog(n: Int): Long =
+        RandomRanges.randomLog(lower, upper, n).toLong
+
+      override def randomT(): Long =
+        bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+    }
+  }
+
+  private[ml] def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+  private[ml] def randomLog(lower: Double, upper: Double, n: Int): Double = {
+    val logLower: Double = logN(lower, n)
+    val logUpper: Double = logN(upper, n)
+    val logLimits: Limits[Double] = Limits(logLower, logUpper)
+    val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+    math.pow(n, rndLogged.randomT())
+  }
+
+  private[ml] def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.2.0")
+class ParamRandomBuilder extends ParamGridBuilder {
+  @Since("3.2.0")

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784699670


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135389/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r580310114



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  private val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {

Review comment:
       OK that's fine. I guess I mean do we really need implicits to manage a little logic reused twice internally? Then again I'm kind of primitive when it comes to using implicits. If its user-visible is it any problem for Java callers as an additional arg they need to pass - no right? that's why I was wondering how public it needs to be. No big deal.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784699670


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135389/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787503491


   Oh OK let me figure that out - can probably fix forward with a patch. Er, where can I see the output? I don't see it here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786770395


   Jenkins retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781505904


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39814/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786789319


   **[Test build #135518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135518/testReport)** for PR 31535 at commit [`ddfe4a9`](https://github.com/apache/spark/commit/ddfe4a9ba7c7d100f5a0d3287a9001cd7fb4e325).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787508976


   Thanks for the analysis, @srowen !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784534776


   **[Test build #135389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135389/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-778223415


   @srowen Added a Java API (and tests). Needed to make `Limits` available for those using the Scala API. Please let me know what you think. Also, Let me know what you think about `LongGenerator` per your previous review comment. Happy to go either way.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r576770476



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  private val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {
+    def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+      import limits._
+      val lower: Double = math.min(x, y)
+      val upper: Double = math.max(x, y)
+
+      override def randomTLog(n: Int): Double =
+        RandomRanges.randomLog(lower, upper, n)
+
+      override def randomT(): Double =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+    }
+  }
+
+  implicit object FloatGenerator extends Generator[Float] {
+    def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+      import limits._
+      val lower: Float = math.min(x, y)
+      val upper: Float = math.max(x, y)
+
+      override def randomTLog(n: Int): Float =
+        RandomRanges.randomLog(lower, upper, n).toFloat
+
+      override def randomT(): Float =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+    }
+  }
+
+  implicit object IntGenerator extends Generator[Int] {
+    def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+      import limits._
+      val lower: Int = math.min(x, y)
+      val upper: Int = math.max(x, y)
+
+      override def randomTLog(n: Int): Int =
+        RandomRanges.randomLog(lower, upper, n).toInt
+
+      override def randomT(): Int =
+        bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+    }
+  }
+
+  implicit object LongGenerator extends Generator[Long] {
+    def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+      import limits._
+      val lower: Long = math.min(x, y)
+      val upper: Long = math.max(x, y)
+
+      override def randomTLog(n: Int): Long =
+        RandomRanges.randomLog(lower, upper, n).toLong
+
+      override def randomT(): Long =
+        bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+    }
+  }
+
+  private[ml] def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+  private[ml] def randomLog(lower: Double, upper: Double, n: Int): Double = {
+    val logLower: Double = logN(lower, n)
+    val logUpper: Double = logN(upper, n)
+    val logLimits: Limits[Double] = Limits(logLower, logUpper)
+    val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+    math.pow(n, rndLogged.randomT())
+  }
+
+  private[ml] def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.2.0")
+class ParamRandomBuilder extends ParamGridBuilder {
+  @Since("3.2.0")
+  def addRandom[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type = {
+    val gen: RandomT[T] = RandomRanges(lim)
+    addGrid(param, (1 to n).map { _: Int => gen.randomT() })
+  }
+
+  @Since("3.2.0")
+  def addLog10Random[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type =
+    addLogRandom(param, lim, n, 10)
+
+  @Since("3.2.0")
+  def addLog2Random[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type =

Review comment:
       Removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786812969


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40098/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r579843276



##########
File path: mllib/src/test/scala/org/apache/spark/ml/tuning/RandomRangesSuite.scala
##########
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import scala.reflect.runtime.universe.TypeTag
+
+import org.scalacheck.{Arbitrary, Gen}
+import org.scalacheck.Arbitrary._
+import org.scalacheck.Gen.Choose
+import org.scalatest.{Assertion, Succeeded}
+import org.scalatest.matchers.must.Matchers
+import org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks
+
+import org.apache.spark.SparkFunSuite
+
+class RandomRangesSuite extends SparkFunSuite with ScalaCheckDrivenPropertyChecks with Matchers {
+
+  import RandomRanges._
+
+  test("log of any base") {
+    assert(logN(16, 4) == 2d)
+    assert(logN(1000, 10) === (3d +- 0.000001))
+    assert(logN(256, 2) == 8d)
+  }
+
+  test("random doubles in log space") {
+    val gen: Gen[(Double, Double, Int)] = for {
+      x <- Gen.choose(0d, Double.MaxValue)
+      y <- Gen.choose(0d, Double.MaxValue)
+      n <- Gen.choose(0, Int.MaxValue)
+    } yield (x, y, n)
+    forAll(gen) { case (x, y, n) =>
+      val lower = math.min(x, y)
+      val upper = math.max(x, y)
+      val result = randomLog(x, y, n)
+      assert(result >= lower && result <= upper)
+    }
+  }
+
+  test("random BigInt generation does not go into infinite loop") {
+    assert(randomBigInt0To(0) == BigInt(0))
+  }
+
+  test("random ints") {
+    checkRange(Linear[Int])
+  }
+
+  test("random log ints") {
+    checkRange(Log10[Int])
+  }
+
+  test("random int distribution") {
+    checkDistributionOf(1000)
+  }
+
+  test("random doubles") {
+    checkRange(Linear[Double])
+  }
+
+  test("random log doubles") {
+    checkRange(Log10[Double])
+  }
+
+  test("random double distribution") {
+    checkDistributionOf(1000d)
+  }
+
+  test("random floats") {
+    checkRange(Linear[Float])
+  }
+
+  test("random log floats") {
+    checkRange(Log10[Float])
+  }
+
+  test("random float distribution") {
+    checkDistributionOf(1000f)
+  }
+
+  abstract class RandomFn[T: Numeric: Generator] {

Review comment:
       These defs below can probably be private; it's not a big deal

##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  private val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {

Review comment:
       Can the implicits be private? probably doesn't matter but if they can be hidden, good.
   Do we need implicits here vs just inlining this or reusing logic?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784585918


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39969/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r580197772



##########
File path: mllib/src/test/scala/org/apache/spark/ml/tuning/RandomRangesSuite.scala
##########
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import scala.reflect.runtime.universe.TypeTag
+
+import org.scalacheck.{Arbitrary, Gen}
+import org.scalacheck.Arbitrary._
+import org.scalacheck.Gen.Choose
+import org.scalatest.{Assertion, Succeeded}
+import org.scalatest.matchers.must.Matchers
+import org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks
+
+import org.apache.spark.SparkFunSuite
+
+class RandomRangesSuite extends SparkFunSuite with ScalaCheckDrivenPropertyChecks with Matchers {
+
+  import RandomRanges._
+
+  test("log of any base") {
+    assert(logN(16, 4) == 2d)
+    assert(logN(1000, 10) === (3d +- 0.000001))
+    assert(logN(256, 2) == 8d)
+  }
+
+  test("random doubles in log space") {
+    val gen: Gen[(Double, Double, Int)] = for {
+      x <- Gen.choose(0d, Double.MaxValue)
+      y <- Gen.choose(0d, Double.MaxValue)
+      n <- Gen.choose(0, Int.MaxValue)
+    } yield (x, y, n)
+    forAll(gen) { case (x, y, n) =>
+      val lower = math.min(x, y)
+      val upper = math.max(x, y)
+      val result = randomLog(x, y, n)
+      assert(result >= lower && result <= upper)
+    }
+  }
+
+  test("random BigInt generation does not go into infinite loop") {
+    assert(randomBigInt0To(0) == BigInt(0))
+  }
+
+  test("random ints") {
+    checkRange(Linear[Int])
+  }
+
+  test("random log ints") {
+    checkRange(Log10[Int])
+  }
+
+  test("random int distribution") {
+    checkDistributionOf(1000)
+  }
+
+  test("random doubles") {
+    checkRange(Linear[Double])
+  }
+
+  test("random log doubles") {
+    checkRange(Log10[Double])
+  }
+
+  test("random double distribution") {
+    checkDistributionOf(1000d)
+  }
+
+  test("random floats") {
+    checkRange(Linear[Float])
+  }
+
+  test("random log floats") {
+    checkRange(Log10[Float])
+  }
+
+  test("random float distribution") {
+    checkDistributionOf(1000f)
+  }
+
+  abstract class RandomFn[T: Numeric: Generator] {

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786046756


   **[Test build #135473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135473/testReport)** for PR 31535 at commit [`183c2cd`](https://github.com/apache/spark/commit/183c2cd5911d2f6d72020674f59fd74b5d983b38).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786828467


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40098/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784441835


   **[Test build #135384 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135384/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781530439


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39814/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783533793


   **[Test build #135354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135354/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r580190282



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  private val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {

Review comment:
       The implicits cannot be private as the user's call sites need access to them.
   Implicits are needed so the user does not need to write verbose code when defining the Limits. 
   Not sure how inlining or reusing logic applies... Will resolve the conversation as (I think) I've addressed your main concerns but let me know if this is not satisfactory.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787503304


   Hi, @PhillHenry and @srowen .
   This seems to break GitHub Action linter job.
   Could you check the doc generation part?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-779800220


   @srowen Regarding documentation, I've added a paragraph to `ml-tuning.md` plus links to Scala and Java examples.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783533793


   **[Test build #135354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135354/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783588401


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39934/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] shaneknapp commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
shaneknapp commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784496621


   test this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786071414


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40053/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787531158


   @srowen Odd. The file does not seem to have been pushed (multiple IntelliJ's open on the same codebase - one for Python one for Scala?). My bad. I've created a new PR at:
   https://github.com/apache/spark/pull/31687


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781632635


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135234/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783576688


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39934/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783573819


   **[Test build #135354 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135354/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786828467


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40098/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783574270


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135354/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781631516


   **[Test build #135234 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135234/testReport)** for PR 31535 at commit [`e88f907`](https://github.com/apache/spark/commit/e88f907585cef0c44eba27910274d2152811c6ea).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784896833


   @srowen Cool. Will do.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575103983



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {

Review comment:
       They're not supported as part of the public facing interface. They're just used internally to support the primitives.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786824413


   **[Test build #135518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135518/testReport)** for PR 31535 at commit [`ddfe4a9`](https://github.com/apache/spark/commit/ddfe4a9ba7c7d100f5a0d3287a9001cd7fb4e325).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen closed pull request #31535:
URL: https://github.com/apache/spark/pull/31535


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781447549


   Jenkins test this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783574270


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135354/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781946987


   I have two generic questions:
   1, this PR is a random param generator, not a hyper optimizer like `hyperopt`?
   2, is there a counterpart implementation in other existing libraries?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784411506


   @srowen I thought this was going to be JVM only as Python users had Hyperopt and we didn't want to re-invent the whieel? I put in the documentation: "Python users are recommended to look at Python libraries that are specifically for hyperparameter tuning such as Hyperopt" (`ml-tuning.md`). 
   But happy to take your steer.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575287498



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  private val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {
+    def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+      import limits._
+      val lower: Double = math.min(x, y)
+      val upper: Double = math.max(x, y)
+
+      override def randomTLog(n: Int): Double =
+        RandomRanges.randomLog(lower, upper, n)
+
+      override def randomT(): Double =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+    }
+  }
+
+  implicit object FloatGenerator extends Generator[Float] {
+    def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+      import limits._
+      val lower: Float = math.min(x, y)
+      val upper: Float = math.max(x, y)
+
+      override def randomTLog(n: Int): Float =
+        RandomRanges.randomLog(lower, upper, n).toFloat
+
+      override def randomT(): Float =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+    }
+  }
+
+  implicit object IntGenerator extends Generator[Int] {
+    def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+      import limits._
+      val lower: Int = math.min(x, y)
+      val upper: Int = math.max(x, y)
+
+      override def randomTLog(n: Int): Int =
+        RandomRanges.randomLog(lower, upper, n).toInt
+
+      override def randomT(): Int =
+        bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+    }
+  }
+
+  implicit object LongGenerator extends Generator[Long] {
+    def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+      import limits._
+      val lower: Long = math.min(x, y)
+      val upper: Long = math.max(x, y)
+
+      override def randomTLog(n: Int): Long =
+        RandomRanges.randomLog(lower, upper, n).toLong
+
+      override def randomT(): Long =
+        bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+    }
+  }
+
+  private[ml] def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+  private[ml] def randomLog(lower: Double, upper: Double, n: Int): Double = {
+    val logLower: Double = logN(lower, n)
+    val logUpper: Double = logN(upper, n)
+    val logLimits: Limits[Double] = Limits(logLower, logUpper)
+    val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+    math.pow(n, rndLogged.randomT())
+  }
+
+  private[ml] def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.2.0")
+class ParamRandomBuilder extends ParamGridBuilder {
+  @Since("3.2.0")

Review comment:
       You don't have to annotate all the methods - the class annotation implies it's 'since' 3.2.0 already. OK to remove to keep it simpler.

##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  private val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {
+    def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+      import limits._
+      val lower: Double = math.min(x, y)
+      val upper: Double = math.max(x, y)
+
+      override def randomTLog(n: Int): Double =
+        RandomRanges.randomLog(lower, upper, n)
+
+      override def randomT(): Double =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+    }
+  }
+
+  implicit object FloatGenerator extends Generator[Float] {
+    def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+      import limits._
+      val lower: Float = math.min(x, y)
+      val upper: Float = math.max(x, y)
+
+      override def randomTLog(n: Int): Float =
+        RandomRanges.randomLog(lower, upper, n).toFloat
+
+      override def randomT(): Float =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+    }
+  }
+
+  implicit object IntGenerator extends Generator[Int] {
+    def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+      import limits._
+      val lower: Int = math.min(x, y)
+      val upper: Int = math.max(x, y)
+
+      override def randomTLog(n: Int): Int =
+        RandomRanges.randomLog(lower, upper, n).toInt
+
+      override def randomT(): Int =
+        bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+    }
+  }
+
+  implicit object LongGenerator extends Generator[Long] {
+    def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+      import limits._
+      val lower: Long = math.min(x, y)
+      val upper: Long = math.max(x, y)
+
+      override def randomTLog(n: Int): Long =
+        RandomRanges.randomLog(lower, upper, n).toLong
+
+      override def randomT(): Long =
+        bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+    }
+  }
+
+  private[ml] def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+  private[ml] def randomLog(lower: Double, upper: Double, n: Int): Double = {
+    val logLower: Double = logN(lower, n)
+    val logUpper: Double = logN(upper, n)
+    val logLimits: Limits[Double] = Limits(logLower, logUpper)
+    val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+    math.pow(n, rndLogged.randomT())
+  }
+
+  private[ml] def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.2.0")
+class ParamRandomBuilder extends ParamGridBuilder {
+  @Since("3.2.0")
+  def addRandom[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type = {
+    val gen: RandomT[T] = RandomRanges(lim)
+    addGrid(param, (1 to n).map { _: Int => gen.randomT() })
+  }
+
+  @Since("3.2.0")
+  def addLog10Random[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type =
+    addLogRandom(param, lim, n, 10)
+
+  @Since("3.2.0")
+  def addLog2Random[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type =

Review comment:
       I was going to say just go with one, natural log, but, I kind of like base-10 as more useful. Base 2 I'm not as sure about but I get it - batch size 16, 32, 64 etc

##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+  def randomT(): T
+  def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+  def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+  val rnd = new scala.util.Random
+
+  private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+    var randVal = BigInt(x.bitLength, rnd)
+    while (randVal > x) {
+      randVal = BigInt(x.bitLength, rnd)
+    }
+    randVal
+  }
+
+  def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+    val diff: BigInt = upper - lower
+    randomBigInt0To(diff) + lower
+  }
+
+  private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+    val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+    val range: BigDecimal = upper - lower
+    val halfWay: BigDecimal = lower + range / 2
+    (zeroCenteredRnd * range) + halfWay
+  }
+
+  implicit object DoubleGenerator extends Generator[Double] {
+    def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+      import limits._
+      val lower: Double = math.min(x, y)
+      val upper: Double = math.max(x, y)
+
+      override def randomTLog(n: Int): Double =
+        RandomRanges.randomLog(lower, upper, n)
+
+      override def randomT(): Double =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+    }
+  }
+
+  implicit object FloatGenerator extends Generator[Float] {
+    def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+      import limits._
+      val lower: Float = math.min(x, y)
+      val upper: Float = math.max(x, y)
+
+      override def randomTLog(n: Int): Float =
+        RandomRanges.randomLog(lower, upper, n).toFloat
+
+      override def randomT(): Float =
+        randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+    }
+  }
+
+  implicit object IntGenerator extends Generator[Int] {
+    def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+      import limits._
+      val lower: Int = math.min(x, y)
+      val upper: Int = math.max(x, y)
+
+      override def randomTLog(n: Int): Int =
+        RandomRanges.randomLog(lower, upper, n).toInt
+
+      override def randomT(): Int =
+        bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+    }
+  }
+
+  implicit object LongGenerator extends Generator[Long] {

Review comment:
       It's no big deal either way, but yeah I'd remove it to keep this simple. Do the implicits need to be public?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783609171


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39934/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786071414


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40053/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-776560091


   @HyukjinKwon Sorry. Didn't realise the square brackets in the documentation were literal (first time contributor).
   Hope that's better now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on pull request #31535: SPARK-34415 mllib Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-776547181


   > Would you mind fixing the PR title and filing a JIRA? See also http://spark.apache.org/contributing.html,
   
   Done. 
   HTH.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786824822


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135518/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783609171


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39934/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787507179


   Ah I see it: 
   https://github.com/apache/spark/pull/31681/checks?check_run_id=1998307491
   `No such file or directory @ rb_sysopen - /__w/spark/spark/docs/../examples/src/main/python/ml/model_selection_random_hyperparameters_example.py`
   
   @PhillHenry was there an additional example file that was meant to be included in the PR? if so just open another PR and I'll add it. If necessary to restore the linter soon I can temporarily remove the reference to this example.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #31535: Param random builder

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-776320229


   Would you mind fixing the PR title and filing a JIRA? See also http://spark.apache.org/contributing.html,


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784264677


   OK looking good; one thing I forgot to consider is the Pyspark version of this. It should be fairly easy to make the same API in Python, but will maybe require some different classes to implement on the Python side, to represent ranges and so on. Have a look at ParamGridBuilder in Python and see what it would take to adapt that to this new class?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] shaneknapp commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
shaneknapp commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784385517


   test this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization

Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575100930



##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)

Review comment:
       Good point. Anything not public-facing that wasn't private before is now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org