You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/09 11:43:57 UTC
[GitHub] [spark] PhillHenry opened a new pull request #31535: Param random builder
PhillHenry opened a new pull request #31535:
URL: https://github.com/apache/spark/pull/31535
### What changes were proposed in this pull request?
Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html
All code is entirely my own work and I license the work to the project under the project’s open source license.
### Why are the changes needed?
Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts.
Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python.
### Does this PR introduce _any_ user-facing change?
A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined.
### How was this patch tested?
Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added.
`ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface.
`RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575104665
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
+ def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+ import limits._
+ val lower: Double = math.min(x, y)
+ val upper: Double = math.max(x, y)
+
+ override def randomTLog(n: Int): Double =
+ RandomRanges.randomLog(lower, upper, n)
+
+ override def randomT(): Double =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+ }
+ }
+
+ implicit object FloatGenerator extends Generator[Float] {
+ def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+ import limits._
+ val lower: Float = math.min(x, y)
+ val upper: Float = math.max(x, y)
+
+ override def randomTLog(n: Int): Float =
+ RandomRanges.randomLog(lower, upper, n).toFloat
+
+ override def randomT(): Float =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+ }
+ }
+
+ implicit object IntGenerator extends Generator[Int] {
+ def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+ import limits._
+ val lower: Int = math.min(x, y)
+ val upper: Int = math.max(x, y)
+
+ override def randomTLog(n: Int): Int =
+ RandomRanges.randomLog(lower, upper, n).toInt
+
+ override def randomT(): Int =
+ bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+ }
+ }
+
+ implicit object LongGenerator extends Generator[Long] {
Review comment:
I included it for completeness. I can remove it if you like.
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
+ def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+ import limits._
+ val lower: Double = math.min(x, y)
+ val upper: Double = math.max(x, y)
+
+ override def randomTLog(n: Int): Double =
+ RandomRanges.randomLog(lower, upper, n)
+
+ override def randomT(): Double =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+ }
+ }
+
+ implicit object FloatGenerator extends Generator[Float] {
+ def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+ import limits._
+ val lower: Float = math.min(x, y)
+ val upper: Float = math.max(x, y)
+
+ override def randomTLog(n: Int): Float =
+ RandomRanges.randomLog(lower, upper, n).toFloat
+
+ override def randomT(): Float =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+ }
+ }
+
+ implicit object IntGenerator extends Generator[Int] {
+ def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+ import limits._
+ val lower: Int = math.min(x, y)
+ val upper: Int = math.max(x, y)
+
+ override def randomTLog(n: Int): Int =
+ RandomRanges.randomLog(lower, upper, n).toInt
+
+ override def randomT(): Int =
+ bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+ }
+ }
+
+ implicit object LongGenerator extends Generator[Long] {
+ def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+ import limits._
+ val lower: Long = math.min(x, y)
+ val upper: Long = math.max(x, y)
+
+ override def randomTLog(n: Int): Long =
+ RandomRanges.randomLog(lower, upper, n).toLong
+
+ override def randomT(): Long =
+ bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+ }
+ }
+
+ def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+ def randomLog(lower: Double, upper: Double, n: Int): Double = {
+ val logLower: Double = logN(lower, n)
+ val logUpper: Double = logN(upper, n)
+ val logLimits: Limits[Double] = Limits(logLower, logUpper)
+ val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+ math.pow(n, rndLogged.randomT())
+ }
+
+ def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.1.0")
Review comment:
Done.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31535: Param random builder
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-775898793
Can one of the admins verify this patch?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786824822
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135518/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784534776
**[Test build #135389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135389/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781996174
It's not an optimizer like hyperopt, right.
It's a lot like what https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html does, really.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786061733
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40053/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575653003
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
+ def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+ import limits._
+ val lower: Double = math.min(x, y)
+ val upper: Double = math.max(x, y)
+
+ override def randomTLog(n: Int): Double =
+ RandomRanges.randomLog(lower, upper, n)
+
+ override def randomT(): Double =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+ }
+ }
+
+ implicit object FloatGenerator extends Generator[Float] {
+ def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+ import limits._
+ val lower: Float = math.min(x, y)
+ val upper: Float = math.max(x, y)
+
+ override def randomTLog(n: Int): Float =
+ RandomRanges.randomLog(lower, upper, n).toFloat
+
+ override def randomT(): Float =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+ }
+ }
+
+ implicit object IntGenerator extends Generator[Int] {
+ def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+ import limits._
+ val lower: Int = math.min(x, y)
+ val upper: Int = math.max(x, y)
+
+ override def randomTLog(n: Int): Int =
+ RandomRanges.randomLog(lower, upper, n).toInt
+
+ override def randomT(): Int =
+ bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+ }
+ }
+
+ implicit object LongGenerator extends Generator[Long] {
Review comment:
Removed.
The implicits needs to be imported for people using the Scala API.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781460422
**[Test build #135234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135234/testReport)** for PR 31535 at commit [`e88f907`](https://github.com/apache/spark/commit/e88f907585cef0c44eba27910274d2152811c6ea).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786828438
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40098/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784585918
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39969/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #31535: SPARK-34415 mllib Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-776550856
@PhillHenry would you mind formatting PR title as guided in http://spark.apache.org/contributing.html?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787081667
Merged to master
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786064893
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40053/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786789319
**[Test build #135518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135518/testReport)** for PR 31535 at commit [`ddfe4a9`](https://github.com/apache/spark/commit/ddfe4a9ba7c7d100f5a0d3287a9001cd7fb4e325).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783503322
Jenkins test this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r574568879
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
Review comment:
I think these can be private[ml] or at least private[spark]? best to not expose whatever we dont' have to
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
Review comment:
I doubt BigDecimal or BigInt is worth supporting? I have not seen that used as a hyperparam
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
+ def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+ import limits._
+ val lower: Double = math.min(x, y)
+ val upper: Double = math.max(x, y)
+
+ override def randomTLog(n: Int): Double =
+ RandomRanges.randomLog(lower, upper, n)
+
+ override def randomT(): Double =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+ }
+ }
+
+ implicit object FloatGenerator extends Generator[Float] {
+ def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+ import limits._
+ val lower: Float = math.min(x, y)
+ val upper: Float = math.max(x, y)
+
+ override def randomTLog(n: Int): Float =
+ RandomRanges.randomLog(lower, upper, n).toFloat
+
+ override def randomT(): Float =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+ }
+ }
+
+ implicit object IntGenerator extends Generator[Int] {
+ def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+ import limits._
+ val lower: Int = math.min(x, y)
+ val upper: Int = math.max(x, y)
+
+ override def randomTLog(n: Int): Int =
+ RandomRanges.randomLog(lower, upper, n).toInt
+
+ override def randomT(): Int =
+ bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+ }
+ }
+
+ implicit object LongGenerator extends Generator[Long] {
Review comment:
Likewise not sure Long is even needed; is there a use case?
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
+ def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+ import limits._
+ val lower: Double = math.min(x, y)
+ val upper: Double = math.max(x, y)
+
+ override def randomTLog(n: Int): Double =
+ RandomRanges.randomLog(lower, upper, n)
+
+ override def randomT(): Double =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+ }
+ }
+
+ implicit object FloatGenerator extends Generator[Float] {
+ def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+ import limits._
+ val lower: Float = math.min(x, y)
+ val upper: Float = math.max(x, y)
+
+ override def randomTLog(n: Int): Float =
+ RandomRanges.randomLog(lower, upper, n).toFloat
+
+ override def randomT(): Float =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+ }
+ }
+
+ implicit object IntGenerator extends Generator[Int] {
+ def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+ import limits._
+ val lower: Int = math.min(x, y)
+ val upper: Int = math.max(x, y)
+
+ override def randomTLog(n: Int): Int =
+ RandomRanges.randomLog(lower, upper, n).toInt
+
+ override def randomT(): Int =
+ bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+ }
+ }
+
+ implicit object LongGenerator extends Generator[Long] {
+ def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+ import limits._
+ val lower: Long = math.min(x, y)
+ val upper: Long = math.max(x, y)
+
+ override def randomTLog(n: Int): Long =
+ RandomRanges.randomLog(lower, upper, n).toLong
+
+ override def randomT(): Long =
+ bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+ }
+ }
+
+ def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+ def randomLog(lower: Double, upper: Double, n: Int): Double = {
+ val logLower: Double = logN(lower, n)
+ val logUpper: Double = logN(upper, n)
+ val logLimits: Limits[Double] = Limits(logLower, logUpper)
+ val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+ math.pow(n, rndLogged.randomT())
+ }
+
+ def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.1.0")
Review comment:
3.2.0
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781632635
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135234/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784670739
**[Test build #135389 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135389/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-775898793
Can one of the admins verify this patch?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781530439
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39814/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575103983
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
Review comment:
They're not exposed as part of the public facing interface. They're just used internally to support the primitives.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781460422
**[Test build #135234 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135234/testReport)** for PR 31535 at commit [`e88f907`](https://github.com/apache/spark/commit/e88f907585cef0c44eba27910274d2152811c6ea).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781487434
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39814/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786028771
Jenkins retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784417993
I get it, but the base ParamGridBuilder is also exposed in Python, and it would definitely make sense to expose in Python too, for all the same reasons maybe it's useful to users who don't want to go to hyperopt. I think it's non-trivial additional work, I apologize, but if it's not crazy hard, I think it's worth it. I mean, we can say "we'll do it later" but I expect it won't get done otherwise.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575653066
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ private val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
+ def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+ import limits._
+ val lower: Double = math.min(x, y)
+ val upper: Double = math.max(x, y)
+
+ override def randomTLog(n: Int): Double =
+ RandomRanges.randomLog(lower, upper, n)
+
+ override def randomT(): Double =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+ }
+ }
+
+ implicit object FloatGenerator extends Generator[Float] {
+ def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+ import limits._
+ val lower: Float = math.min(x, y)
+ val upper: Float = math.max(x, y)
+
+ override def randomTLog(n: Int): Float =
+ RandomRanges.randomLog(lower, upper, n).toFloat
+
+ override def randomT(): Float =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+ }
+ }
+
+ implicit object IntGenerator extends Generator[Int] {
+ def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+ import limits._
+ val lower: Int = math.min(x, y)
+ val upper: Int = math.max(x, y)
+
+ override def randomTLog(n: Int): Int =
+ RandomRanges.randomLog(lower, upper, n).toInt
+
+ override def randomT(): Int =
+ bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+ }
+ }
+
+ implicit object LongGenerator extends Generator[Long] {
+ def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+ import limits._
+ val lower: Long = math.min(x, y)
+ val upper: Long = math.max(x, y)
+
+ override def randomTLog(n: Int): Long =
+ RandomRanges.randomLog(lower, upper, n).toLong
+
+ override def randomT(): Long =
+ bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+ }
+ }
+
+ private[ml] def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+ private[ml] def randomLog(lower: Double, upper: Double, n: Int): Double = {
+ val logLower: Double = logN(lower, n)
+ val logUpper: Double = logN(upper, n)
+ val logLimits: Limits[Double] = Limits(logLower, logUpper)
+ val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+ math.pow(n, rndLogged.randomT())
+ }
+
+ private[ml] def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.2.0")
+class ParamRandomBuilder extends ParamGridBuilder {
+ @Since("3.2.0")
Review comment:
Done.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784699670
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135389/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r580310114
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ private val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
Review comment:
OK that's fine. I guess I mean do we really need implicits to manage a little logic reused twice internally? Then again I'm kind of primitive when it comes to using implicits. If its user-visible is it any problem for Java callers as an additional arg they need to pass - no right? that's why I was wondering how public it needs to be. No big deal.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784699670
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135389/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787503491
Oh OK let me figure that out - can probably fix forward with a patch. Er, where can I see the output? I don't see it here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786770395
Jenkins retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781505904
Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39814/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786789319
**[Test build #135518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135518/testReport)** for PR 31535 at commit [`ddfe4a9`](https://github.com/apache/spark/commit/ddfe4a9ba7c7d100f5a0d3287a9001cd7fb4e325).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787508976
Thanks for the analysis, @srowen !
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784534776
**[Test build #135389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135389/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-778223415
@srowen Added a Java API (and tests). Needed to make `Limits` available for those using the Scala API. Please let me know what you think. Also, Let me know what you think about `LongGenerator` per your previous review comment. Happy to go either way.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r576770476
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ private val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
+ def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+ import limits._
+ val lower: Double = math.min(x, y)
+ val upper: Double = math.max(x, y)
+
+ override def randomTLog(n: Int): Double =
+ RandomRanges.randomLog(lower, upper, n)
+
+ override def randomT(): Double =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+ }
+ }
+
+ implicit object FloatGenerator extends Generator[Float] {
+ def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+ import limits._
+ val lower: Float = math.min(x, y)
+ val upper: Float = math.max(x, y)
+
+ override def randomTLog(n: Int): Float =
+ RandomRanges.randomLog(lower, upper, n).toFloat
+
+ override def randomT(): Float =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+ }
+ }
+
+ implicit object IntGenerator extends Generator[Int] {
+ def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+ import limits._
+ val lower: Int = math.min(x, y)
+ val upper: Int = math.max(x, y)
+
+ override def randomTLog(n: Int): Int =
+ RandomRanges.randomLog(lower, upper, n).toInt
+
+ override def randomT(): Int =
+ bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+ }
+ }
+
+ implicit object LongGenerator extends Generator[Long] {
+ def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+ import limits._
+ val lower: Long = math.min(x, y)
+ val upper: Long = math.max(x, y)
+
+ override def randomTLog(n: Int): Long =
+ RandomRanges.randomLog(lower, upper, n).toLong
+
+ override def randomT(): Long =
+ bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+ }
+ }
+
+ private[ml] def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+ private[ml] def randomLog(lower: Double, upper: Double, n: Int): Double = {
+ val logLower: Double = logN(lower, n)
+ val logUpper: Double = logN(upper, n)
+ val logLimits: Limits[Double] = Limits(logLower, logUpper)
+ val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+ math.pow(n, rndLogged.randomT())
+ }
+
+ private[ml] def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.2.0")
+class ParamRandomBuilder extends ParamGridBuilder {
+ @Since("3.2.0")
+ def addRandom[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type = {
+ val gen: RandomT[T] = RandomRanges(lim)
+ addGrid(param, (1 to n).map { _: Int => gen.randomT() })
+ }
+
+ @Since("3.2.0")
+ def addLog10Random[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type =
+ addLogRandom(param, lim, n, 10)
+
+ @Since("3.2.0")
+ def addLog2Random[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type =
Review comment:
Removed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786812969
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40098/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r579843276
##########
File path: mllib/src/test/scala/org/apache/spark/ml/tuning/RandomRangesSuite.scala
##########
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import scala.reflect.runtime.universe.TypeTag
+
+import org.scalacheck.{Arbitrary, Gen}
+import org.scalacheck.Arbitrary._
+import org.scalacheck.Gen.Choose
+import org.scalatest.{Assertion, Succeeded}
+import org.scalatest.matchers.must.Matchers
+import org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks
+
+import org.apache.spark.SparkFunSuite
+
+class RandomRangesSuite extends SparkFunSuite with ScalaCheckDrivenPropertyChecks with Matchers {
+
+ import RandomRanges._
+
+ test("log of any base") {
+ assert(logN(16, 4) == 2d)
+ assert(logN(1000, 10) === (3d +- 0.000001))
+ assert(logN(256, 2) == 8d)
+ }
+
+ test("random doubles in log space") {
+ val gen: Gen[(Double, Double, Int)] = for {
+ x <- Gen.choose(0d, Double.MaxValue)
+ y <- Gen.choose(0d, Double.MaxValue)
+ n <- Gen.choose(0, Int.MaxValue)
+ } yield (x, y, n)
+ forAll(gen) { case (x, y, n) =>
+ val lower = math.min(x, y)
+ val upper = math.max(x, y)
+ val result = randomLog(x, y, n)
+ assert(result >= lower && result <= upper)
+ }
+ }
+
+ test("random BigInt generation does not go into infinite loop") {
+ assert(randomBigInt0To(0) == BigInt(0))
+ }
+
+ test("random ints") {
+ checkRange(Linear[Int])
+ }
+
+ test("random log ints") {
+ checkRange(Log10[Int])
+ }
+
+ test("random int distribution") {
+ checkDistributionOf(1000)
+ }
+
+ test("random doubles") {
+ checkRange(Linear[Double])
+ }
+
+ test("random log doubles") {
+ checkRange(Log10[Double])
+ }
+
+ test("random double distribution") {
+ checkDistributionOf(1000d)
+ }
+
+ test("random floats") {
+ checkRange(Linear[Float])
+ }
+
+ test("random log floats") {
+ checkRange(Log10[Float])
+ }
+
+ test("random float distribution") {
+ checkDistributionOf(1000f)
+ }
+
+ abstract class RandomFn[T: Numeric: Generator] {
Review comment:
These defs below can probably be private; it's not a big deal
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ private val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
Review comment:
Can the implicits be private? probably doesn't matter but if they can be hidden, good.
Do we need implicits here vs just inlining this or reusing logic?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784585918
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39969/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r580197772
##########
File path: mllib/src/test/scala/org/apache/spark/ml/tuning/RandomRangesSuite.scala
##########
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import scala.reflect.runtime.universe.TypeTag
+
+import org.scalacheck.{Arbitrary, Gen}
+import org.scalacheck.Arbitrary._
+import org.scalacheck.Gen.Choose
+import org.scalatest.{Assertion, Succeeded}
+import org.scalatest.matchers.must.Matchers
+import org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks
+
+import org.apache.spark.SparkFunSuite
+
+class RandomRangesSuite extends SparkFunSuite with ScalaCheckDrivenPropertyChecks with Matchers {
+
+ import RandomRanges._
+
+ test("log of any base") {
+ assert(logN(16, 4) == 2d)
+ assert(logN(1000, 10) === (3d +- 0.000001))
+ assert(logN(256, 2) == 8d)
+ }
+
+ test("random doubles in log space") {
+ val gen: Gen[(Double, Double, Int)] = for {
+ x <- Gen.choose(0d, Double.MaxValue)
+ y <- Gen.choose(0d, Double.MaxValue)
+ n <- Gen.choose(0, Int.MaxValue)
+ } yield (x, y, n)
+ forAll(gen) { case (x, y, n) =>
+ val lower = math.min(x, y)
+ val upper = math.max(x, y)
+ val result = randomLog(x, y, n)
+ assert(result >= lower && result <= upper)
+ }
+ }
+
+ test("random BigInt generation does not go into infinite loop") {
+ assert(randomBigInt0To(0) == BigInt(0))
+ }
+
+ test("random ints") {
+ checkRange(Linear[Int])
+ }
+
+ test("random log ints") {
+ checkRange(Log10[Int])
+ }
+
+ test("random int distribution") {
+ checkDistributionOf(1000)
+ }
+
+ test("random doubles") {
+ checkRange(Linear[Double])
+ }
+
+ test("random log doubles") {
+ checkRange(Log10[Double])
+ }
+
+ test("random double distribution") {
+ checkDistributionOf(1000d)
+ }
+
+ test("random floats") {
+ checkRange(Linear[Float])
+ }
+
+ test("random log floats") {
+ checkRange(Log10[Float])
+ }
+
+ test("random float distribution") {
+ checkDistributionOf(1000f)
+ }
+
+ abstract class RandomFn[T: Numeric: Generator] {
Review comment:
Done.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786046756
**[Test build #135473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135473/testReport)** for PR 31535 at commit [`183c2cd`](https://github.com/apache/spark/commit/183c2cd5911d2f6d72020674f59fd74b5d983b38).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786828467
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40098/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784441835
**[Test build #135384 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135384/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781530439
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39814/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783533793
**[Test build #135354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135354/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r580190282
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ private val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
Review comment:
The implicits cannot be private as the user's call sites need access to them.
Implicits are needed so the user does not need to write verbose code when defining the Limits.
Not sure how inlining or reusing logic applies... Will resolve the conversation as (I think) I've addressed your main concerns but let me know if this is not satisfactory.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787503304
Hi, @PhillHenry and @srowen .
This seems to break GitHub Action linter job.
Could you check the doc generation part?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-779800220
@srowen Regarding documentation, I've added a paragraph to `ml-tuning.md` plus links to Scala and Java examples.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783533793
**[Test build #135354 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135354/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783588401
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39934/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] shaneknapp commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
shaneknapp commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784496621
test this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786071414
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40053/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787531158
@srowen Odd. The file does not seem to have been pushed (multiple IntelliJ's open on the same codebase - one for Python one for Scala?). My bad. I've created a new PR at:
https://github.com/apache/spark/pull/31687
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781632635
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135234/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783576688
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39934/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783573819
**[Test build #135354 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135354/testReport)** for PR 31535 at commit [`259edfe`](https://github.com/apache/spark/commit/259edfe1a78d0bb63c7b2cff03dc3a46746ecef6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786828467
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40098/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783574270
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135354/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781631516
**[Test build #135234 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135234/testReport)** for PR 31535 at commit [`e88f907`](https://github.com/apache/spark/commit/e88f907585cef0c44eba27910274d2152811c6ea).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784896833
@srowen Cool. Will do.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575103983
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
Review comment:
They're not supported as part of the public facing interface. They're just used internally to support the primitives.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786824413
**[Test build #135518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135518/testReport)** for PR 31535 at commit [`ddfe4a9`](https://github.com/apache/spark/commit/ddfe4a9ba7c7d100f5a0d3287a9001cd7fb4e325).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen closed pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen closed pull request #31535:
URL: https://github.com/apache/spark/pull/31535
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781447549
Jenkins test this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783574270
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135354/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-781946987
I have two generic questions:
1, this PR is a random param generator, not a hyper optimizer like `hyperopt`?
2, is there a counterpart implementation in other existing libraries?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784411506
@srowen I thought this was going to be JVM only as Python users had Hyperopt and we didn't want to re-invent the whieel? I put in the documentation: "Python users are recommended to look at Python libraries that are specifically for hyperparameter tuning such as Hyperopt" (`ml-tuning.md`).
But happy to take your steer.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575287498
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ private val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
+ def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+ import limits._
+ val lower: Double = math.min(x, y)
+ val upper: Double = math.max(x, y)
+
+ override def randomTLog(n: Int): Double =
+ RandomRanges.randomLog(lower, upper, n)
+
+ override def randomT(): Double =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+ }
+ }
+
+ implicit object FloatGenerator extends Generator[Float] {
+ def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+ import limits._
+ val lower: Float = math.min(x, y)
+ val upper: Float = math.max(x, y)
+
+ override def randomTLog(n: Int): Float =
+ RandomRanges.randomLog(lower, upper, n).toFloat
+
+ override def randomT(): Float =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+ }
+ }
+
+ implicit object IntGenerator extends Generator[Int] {
+ def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+ import limits._
+ val lower: Int = math.min(x, y)
+ val upper: Int = math.max(x, y)
+
+ override def randomTLog(n: Int): Int =
+ RandomRanges.randomLog(lower, upper, n).toInt
+
+ override def randomT(): Int =
+ bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+ }
+ }
+
+ implicit object LongGenerator extends Generator[Long] {
+ def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+ import limits._
+ val lower: Long = math.min(x, y)
+ val upper: Long = math.max(x, y)
+
+ override def randomTLog(n: Int): Long =
+ RandomRanges.randomLog(lower, upper, n).toLong
+
+ override def randomT(): Long =
+ bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+ }
+ }
+
+ private[ml] def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+ private[ml] def randomLog(lower: Double, upper: Double, n: Int): Double = {
+ val logLower: Double = logN(lower, n)
+ val logUpper: Double = logN(upper, n)
+ val logLimits: Limits[Double] = Limits(logLower, logUpper)
+ val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+ math.pow(n, rndLogged.randomT())
+ }
+
+ private[ml] def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.2.0")
+class ParamRandomBuilder extends ParamGridBuilder {
+ @Since("3.2.0")
Review comment:
You don't have to annotate all the methods - the class annotation implies it's 'since' 3.2.0 already. OK to remove to keep it simpler.
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,210 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.tuning.RandomRanges._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+private[ml] abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ private val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
+ def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+ import limits._
+ val lower: Double = math.min(x, y)
+ val upper: Double = math.max(x, y)
+
+ override def randomTLog(n: Int): Double =
+ RandomRanges.randomLog(lower, upper, n)
+
+ override def randomT(): Double =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+ }
+ }
+
+ implicit object FloatGenerator extends Generator[Float] {
+ def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+ import limits._
+ val lower: Float = math.min(x, y)
+ val upper: Float = math.max(x, y)
+
+ override def randomTLog(n: Int): Float =
+ RandomRanges.randomLog(lower, upper, n).toFloat
+
+ override def randomT(): Float =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+ }
+ }
+
+ implicit object IntGenerator extends Generator[Int] {
+ def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+ import limits._
+ val lower: Int = math.min(x, y)
+ val upper: Int = math.max(x, y)
+
+ override def randomTLog(n: Int): Int =
+ RandomRanges.randomLog(lower, upper, n).toInt
+
+ override def randomT(): Int =
+ bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+ }
+ }
+
+ implicit object LongGenerator extends Generator[Long] {
+ def apply(limits: Limits[Long]): RandomT[Long] = new RandomT[Long] {
+ import limits._
+ val lower: Long = math.min(x, y)
+ val upper: Long = math.max(x, y)
+
+ override def randomTLog(n: Int): Long =
+ RandomRanges.randomLog(lower, upper, n).toLong
+
+ override def randomT(): Long =
+ bigIntBetween(BigInt(lower), BigInt(upper)).longValue
+ }
+ }
+
+ private[ml] def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
+
+ private[ml] def randomLog(lower: Double, upper: Double, n: Int): Double = {
+ val logLower: Double = logN(lower, n)
+ val logUpper: Double = logN(upper, n)
+ val logLimits: Limits[Double] = Limits(logLower, logUpper)
+ val rndLogged: RandomT[Double] = RandomRanges(logLimits)
+ math.pow(n, rndLogged.randomT())
+ }
+
+ private[ml] def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
+
+}
+
+/**
+ * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
+ * observations lies within the top 5% of the true maximum, with 95% probability"
+ * - Evaluating Machine Learning Models by Alice Zheng
+ * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
+ *
+ * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
+ * such as Hyperopt.
+ */
+@Since("3.2.0")
+class ParamRandomBuilder extends ParamGridBuilder {
+ @Since("3.2.0")
+ def addRandom[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type = {
+ val gen: RandomT[T] = RandomRanges(lim)
+ addGrid(param, (1 to n).map { _: Int => gen.randomT() })
+ }
+
+ @Since("3.2.0")
+ def addLog10Random[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type =
+ addLogRandom(param, lim, n, 10)
+
+ @Since("3.2.0")
+ def addLog2Random[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type =
Review comment:
I was going to say just go with one, natural log, but, I kind of like base-10 as more useful. Base 2 I'm not as sure about but I get it - batch size 16, 32, 64 etc
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
+
+abstract class RandomT[T: Numeric] {
+ def randomT(): T
+ def randomTLog(n: Int): T
+}
+
+abstract class Generator[T: Numeric] {
+ def apply(lim: Limits[T]): RandomT[T]
+}
+
+object RandomRanges {
+
+ val rnd = new scala.util.Random
+
+ private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
+ var randVal = BigInt(x.bitLength, rnd)
+ while (randVal > x) {
+ randVal = BigInt(x.bitLength, rnd)
+ }
+ randVal
+ }
+
+ def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
+ val diff: BigInt = upper - lower
+ randomBigInt0To(diff) + lower
+ }
+
+ private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
+ val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
+ val range: BigDecimal = upper - lower
+ val halfWay: BigDecimal = lower + range / 2
+ (zeroCenteredRnd * range) + halfWay
+ }
+
+ implicit object DoubleGenerator extends Generator[Double] {
+ def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
+ import limits._
+ val lower: Double = math.min(x, y)
+ val upper: Double = math.max(x, y)
+
+ override def randomTLog(n: Int): Double =
+ RandomRanges.randomLog(lower, upper, n)
+
+ override def randomT(): Double =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
+ }
+ }
+
+ implicit object FloatGenerator extends Generator[Float] {
+ def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
+ import limits._
+ val lower: Float = math.min(x, y)
+ val upper: Float = math.max(x, y)
+
+ override def randomTLog(n: Int): Float =
+ RandomRanges.randomLog(lower, upper, n).toFloat
+
+ override def randomT(): Float =
+ randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
+ }
+ }
+
+ implicit object IntGenerator extends Generator[Int] {
+ def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
+ import limits._
+ val lower: Int = math.min(x, y)
+ val upper: Int = math.max(x, y)
+
+ override def randomTLog(n: Int): Int =
+ RandomRanges.randomLog(lower, upper, n).toInt
+
+ override def randomT(): Int =
+ bigIntBetween(BigInt(lower), BigInt(upper)).intValue
+ }
+ }
+
+ implicit object LongGenerator extends Generator[Long] {
Review comment:
It's no big deal either way, but yeah I'd remove it to keep this simple. Do the implicits need to be public?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783609171
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39934/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786071414
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/40053/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-776560091
@HyukjinKwon Sorry. Didn't realise the square brackets in the documentation were literal (first time contributor).
Hope that's better now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on pull request #31535: SPARK-34415 mllib Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-776547181
> Would you mind fixing the PR title and filing a JIRA? See also http://spark.apache.org/contributing.html,
Done.
HTH.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-786824822
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135518/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-783609171
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39934/
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-787507179
Ah I see it:
https://github.com/apache/spark/pull/31681/checks?check_run_id=1998307491
`No such file or directory @ rb_sysopen - /__w/spark/spark/docs/../examples/src/main/python/ml/model_selection_random_hyperparameters_example.py`
@PhillHenry was there an additional example file that was meant to be included in the PR? if so just open another PR and I'll add it. If necessary to restore the linter soon I can temporarily remove the reference to this example.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #31535: Param random builder
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-776320229
Would you mind fixing the PR title and filing a JIRA? See also http://spark.apache.org/contributing.html,
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
srowen commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784264677
OK looking good; one thing I forgot to consider is the Pyspark version of this. It should be fairly easy to make the same API in Python, but will maybe require some different classes to implement on the Python side, to represent ranges and so on. Have a look at ParamGridBuilder in Python and see what it would take to adapt that to this new class?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] shaneknapp commented on pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
shaneknapp commented on pull request #31535:
URL: https://github.com/apache/spark/pull/31535#issuecomment-784385517
test this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] PhillHenry commented on a change in pull request #31535: [SPARK-34415][ML] Randomization in hyperparameter optimization
Posted by GitBox <gi...@apache.org>.
PhillHenry commented on a change in pull request #31535:
URL: https://github.com/apache/spark/pull/31535#discussion_r575100930
##########
File path: mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.tuning
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.param._
+
+case class Limits[T: Numeric](x: T, y: T)
Review comment:
Good point. Anything not public-facing that wasn't private before is now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org