You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by do...@apache.org on 2021/08/24 20:39:11 UTC
[spark] branch master updated: Revert "[SPARK-34415][ML]
Randomization in hyperparameter optimization"
This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new de932f5 Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization"
de932f5 is described below
commit de932f51ceb8b9805c26c7bd13c1cfb628d8128d
Author: Gengliang Wang <ge...@apache.org>
AuthorDate: Tue Aug 24 13:38:14 2021 -0700
Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization"
### What changes were proposed in this pull request?
Revert https://github.com/apache/spark/commit/397b843890db974a0534394b1907d33d62c2b888 and https://github.com/apache/spark/commit/5a48eb8d00faee3a7c8f023c0699296e22edb893
### Why are the changes needed?
As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ci tests
Closes #33819 from gengliangwang/revert-SPARK-34415.
Authored-by: Gengliang Wang <ge...@apache.org>
Signed-off-by: Dongjoon Hyun <do...@apache.org>
---
docs/ml-tuning.md | 36 +----
...elSelectionViaRandomHyperparametersExample.java | 83 ----------
...del_selection_random_hyperparameters_example.py | 66 --------
...lSelectionViaRandomHyperparametersExample.scala | 79 ----------
.../spark/ml/tuning/ParamRandomBuilder.scala | 160 --------------------
.../spark/ml/tuning/ParamRandomBuilderSuite.scala | 123 ---------------
.../apache/spark/ml/tuning/RandomRangesSuite.scala | 168 ---------------------
python/docs/source/reference/pyspark.ml.rst | 1 -
python/pyspark/ml/tests/test_tuning.py | 105 +------------
python/pyspark/ml/tuning.py | 48 +-----
python/pyspark/ml/tuning.pyi | 5 -
11 files changed, 3 insertions(+), 871 deletions(-)
diff --git a/docs/ml-tuning.md b/docs/ml-tuning.md
index e7940a3..3ddd185 100644
--- a/docs/ml-tuning.md
+++ b/docs/ml-tuning.md
@@ -71,44 +71,10 @@ for multiclass problems, a [`MultilabelClassificationEvaluator`](api/scala/org/a
[`RankingEvaluator`](api/scala/org/apache/spark/ml/evaluation/RankingEvaluator.html) for ranking problems. The default metric used to
choose the best `ParamMap` can be overridden by the `setMetricName` method in each of these evaluators.
-To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html) utility (see the *Cross-Validation* section below for an example).
+To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html) utility.
By default, sets of parameters from the parameter grid are evaluated in serial. Parameter evaluation can be done in parallel by setting `parallelism` with a value of 2 or more (a value of 1 will be serial) before running model selection with `CrossValidator` or `TrainValidationSplit`.
The value of `parallelism` should be chosen carefully to maximize parallelism without exceeding cluster resources, and larger values may not always lead to improved performance. Generally speaking, a value up to 10 should be sufficient for most clusters.
-Alternatively, users can use the [`ParamRandomBuilder`](api/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.html) utility.
-This has the same properties of `ParamGridBuilder` mentioned above, but hyperparameters are chosen at random within a user-defined range.
-The mathematical principle behind this is that given enough samples, the probability of at least one sample *not* being near the optimum within a range tends to zero.
-Irrespective of machine learning model, the expected number of samples needed to have at least one within 5% of the optimum is about 60.
-If this 5% volume lies between the parameters defined in a grid search, it will *never* be found by `ParamGridBuilder`.
-
-<div class="codetabs">
-
-<div data-lang="scala" markdown="1">
-
-Refer to the [`ParamRandomBuilder` Scala docs](api/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.html) for details on the API.
-
-{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaRandomHyperparametersExample.scala %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-Refer to the [`ParamRandomBuilder` Java docs](api/java/org/apache/spark/ml/tuning/ParamRandomBuilder.html) for details on the API.
-
-{% include_example java/org/apache/spark/examples/ml/JavaModelSelectionViaRandomHyperparametersExample.java %}
-</div>
-
-<div data-lang="python" markdown="1">
-
-Python users are recommended to look at Python libraries that are specifically for hyperparameter tuning such as Hyperopt.
-
-Refer to the [`ParamRandomBuilder` Java docs](api/python/reference/api/pyspark.ml.tuning.ParamRandomBuilder.html) for details on the API.
-
-{% include_example python/ml/model_selection_random_hyperparameters_example.py %}
-
-</div>
-
-</div>
-
# Cross-Validation
`CrossValidator` begins by splitting the dataset into a set of *folds* which are used as separate training and test datasets. E.g., with `$k=3$` folds, `CrossValidator` will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. To evaluate a particular `ParamMap`, `CrossValidator` computes the average evaluation metric for the 3 `Model`s produced by fitting the `Estimator` on the 3 different (training, test) dataset pairs.
diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaModelSelectionViaRandomHyperparametersExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaModelSelectionViaRandomHyperparametersExample.java
deleted file mode 100644
index 086920f..0000000
--- a/examples/src/main/java/org/apache/spark/examples/ml/JavaModelSelectionViaRandomHyperparametersExample.java
+++ /dev/null
@@ -1,83 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.examples.ml;
-
-// $example on$
-import org.apache.spark.ml.evaluation.RegressionEvaluator;
-import org.apache.spark.ml.param.ParamMap;
-import org.apache.spark.ml.regression.LinearRegression;
-import org.apache.spark.ml.tuning.*;
-import org.apache.spark.sql.Dataset;
-import org.apache.spark.sql.Row;
-import org.apache.spark.sql.SparkSession;
-// $example off$
-
-/**
- * A simple example demonstrating model selection using ParamRandomBuilder.
- *
- * Run with
- * {{{
- * bin/run-example ml.JavaModelSelectionViaRandomHyperparametersExample
- * }}}
- */
-public class JavaModelSelectionViaRandomHyperparametersExample {
-
- public static void main(String[] args) {
- SparkSession spark = SparkSession
- .builder()
- .appName("JavaModelSelectionViaTrainValidationSplitExample")
- .getOrCreate();
-
- // $example on$
- Dataset<Row> data = spark.read().format("libsvm")
- .load("data/mllib/sample_linear_regression_data.txt");
-
- LinearRegression lr = new LinearRegression();
-
- // We sample the regularization parameter logarithmically over the range [0.01, 1.0].
- // This means that values around 0.01, 0.1 and 1.0 are roughly equally likely.
- // Note that both parameters must be greater than zero as otherwise we'll get an infinity.
- // We sample the the ElasticNet mixing parameter uniformly over the range [0, 1]
- // Note that in real life, you'd choose more than the 5 samples we see below.
- ParamMap[] hyperparameters = new ParamRandomBuilder()
- .addLog10Random(lr.regParam(), 0.01, 1.0, 5)
- .addRandom(lr.elasticNetParam(), 0.0, 1.0, 5)
- .addGrid(lr.fitIntercept())
- .build();
-
- System.out.println("hyperparameters:");
- for (ParamMap param : hyperparameters) {
- System.out.println(param);
- }
-
- CrossValidator cv = new CrossValidator()
- .setEstimator(lr)
- .setEstimatorParamMaps(hyperparameters)
- .setEvaluator(new RegressionEvaluator())
- .setNumFolds(3);
- CrossValidatorModel cvModel = cv.fit(data);
- LinearRegression parent = (LinearRegression)cvModel.bestModel().parent();
-
- System.out.println("Optimal model has\n" + lr.regParam() + " = " + parent.getRegParam()
- + "\n" + lr.elasticNetParam() + " = "+ parent.getElasticNetParam()
- + "\n" + lr.fitIntercept() + " = " + parent.getFitIntercept());
- // $example off$
-
- spark.stop();
- }
-}
diff --git a/examples/src/main/python/ml/model_selection_random_hyperparameters_example.py b/examples/src/main/python/ml/model_selection_random_hyperparameters_example.py
deleted file mode 100644
index b436341..0000000
--- a/examples/src/main/python/ml/model_selection_random_hyperparameters_example.py
+++ /dev/null
@@ -1,66 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements. See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License. You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-"""
-This example uses random hyperparameters to perform model selection.
-Run with:
-
- bin/spark-submit examples/src/main/python/ml/model_selection_random_hyperparameters_example.py
-"""
-# $example on$
-from pyspark.ml.evaluation import RegressionEvaluator
-from pyspark.ml.regression import LinearRegression
-from pyspark.ml.tuning import ParamRandomBuilder, CrossValidator
-# $example off$
-from pyspark.sql import SparkSession
-
-if __name__ == "__main__":
- spark = SparkSession \
- .builder \
- .appName("TrainValidationSplit") \
- .getOrCreate()
-
- # $example on$
- data = spark.read.format("libsvm") \
- .load("data/mllib/sample_linear_regression_data.txt")
-
- lr = LinearRegression(maxIter=10)
-
- # We sample the regularization parameter logarithmically over the range [0.01, 1.0].
- # This means that values around 0.01, 0.1 and 1.0 are roughly equally likely.
- # Note that both parameters must be greater than zero as otherwise we'll get an infinity.
- # We sample the the ElasticNet mixing parameter uniformly over the range [0, 1]
- # Note that in real life, you'd choose more than the 5 samples we see below.
- hyperparameters = ParamRandomBuilder() \
- .addLog10Random(lr.regParam, 0.01, 1.0, 5) \
- .addRandom(lr.elasticNetParam, 0.0, 1.0, 5) \
- .addGrid(lr.fitIntercept, [False, True]) \
- .build()
-
- cv = CrossValidator(estimator=lr,
- estimatorParamMaps=hyperparameters,
- evaluator=RegressionEvaluator(),
- numFolds=2)
-
- model = cv.fit(data)
- bestModel = model.bestModel
- print("Optimal model has regParam = {}, elasticNetParam = {}, fitIntercept = {}"
- .format(bestModel.getRegParam(), bestModel.getElasticNetParam(),
- bestModel.getFitIntercept()))
-
- # $example off$
- spark.stop()
diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaRandomHyperparametersExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaRandomHyperparametersExample.scala
deleted file mode 100644
index 9d2c58bb..0000000
--- a/examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaRandomHyperparametersExample.scala
+++ /dev/null
@@ -1,79 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.examples.ml
-
-// $example on$
-import org.apache.spark.ml.evaluation.RegressionEvaluator
-import org.apache.spark.ml.regression.LinearRegression
-import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, Limits, ParamRandomBuilder}
-import org.apache.spark.ml.tuning.RandomRanges._
-// $example off$
-import org.apache.spark.sql.SparkSession
-
-/**
- * A simple example demonstrating model selection using ParamRandomBuilder.
- *
- * Run with
- * {{{
- * bin/run-example ml.ModelSelectionViaRandomHyperparametersExample
- * }}}
- */
-object ModelSelectionViaRandomHyperparametersExample {
- def main(args: Array[String]): Unit = {
- val spark = SparkSession
- .builder
- .appName("ModelSelectionViaTrainValidationSplitExample")
- .getOrCreate()
- // scalastyle:off println
- // $example on$
- // Prepare training and test data.
- val data = spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt")
-
- val lr = new LinearRegression().setMaxIter(10)
-
- // We sample the regularization parameter logarithmically over the range [0.01, 1.0].
- // This means that values around 0.01, 0.1 and 1.0 are roughly equally likely.
- // Note that both parameters must be greater than zero as otherwise we'll get an infinity.
- // We sample the the ElasticNet mixing parameter uniformly over the range [0, 1]
- // Note that in real life, you'd choose more than the 5 samples we see below.
- val hyperparameters = new ParamRandomBuilder()
- .addLog10Random(lr.regParam, Limits(0.01, 1.0), 5)
- .addGrid(lr.fitIntercept)
- .addRandom(lr.elasticNetParam, Limits(0.0, 1.0), 5)
- .build()
-
- println(s"hyperparameters:\n${hyperparameters.mkString("\n")}")
-
- val cv: CrossValidator = new CrossValidator()
- .setEstimator(lr)
- .setEstimatorParamMaps(hyperparameters)
- .setEvaluator(new RegressionEvaluator)
- .setNumFolds(3)
- val cvModel: CrossValidatorModel = cv.fit(data)
- val parent: LinearRegression = cvModel.bestModel.parent.asInstanceOf[LinearRegression]
-
- println(s"""Optimal model has:
- |${lr.regParam} = ${parent.getRegParam}
- |${lr.elasticNetParam} = ${parent.getElasticNetParam}
- |${lr.fitIntercept} = ${parent.getFitIntercept}""".stripMargin)
- // $example off$
-
- spark.stop()
- }
- // scalastyle:on println
-}
diff --git a/mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala b/mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
deleted file mode 100644
index 9c296bb..0000000
--- a/mllib/src/main/scala/org/apache/spark/ml/tuning/ParamRandomBuilder.scala
+++ /dev/null
@@ -1,160 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.ml.tuning
-
-import org.apache.spark.annotation.Since
-import org.apache.spark.ml.param._
-import org.apache.spark.ml.tuning.RandomRanges._
-
-case class Limits[T: Numeric](x: T, y: T)
-
-private[ml] abstract class RandomT[T: Numeric] {
- def randomT(): T
- def randomTLog(n: Int): T
-}
-
-abstract class Generator[T: Numeric] {
- def apply(lim: Limits[T]): RandomT[T]
-}
-
-object RandomRanges {
-
- private val rnd = new scala.util.Random
-
- private[tuning] def randomBigInt0To(x: BigInt): BigInt = {
- var randVal = BigInt(x.bitLength, rnd)
- while (randVal > x) {
- randVal = BigInt(x.bitLength, rnd)
- }
- randVal
- }
-
- private[ml] def bigIntBetween(lower: BigInt, upper: BigInt): BigInt = {
- val diff: BigInt = upper - lower
- randomBigInt0To(diff) + lower
- }
-
- private def randomBigDecimalBetween(lower: BigDecimal, upper: BigDecimal): BigDecimal = {
- val zeroCenteredRnd: BigDecimal = BigDecimal(rnd.nextDouble() - 0.5)
- val range: BigDecimal = upper - lower
- val halfWay: BigDecimal = lower + range / 2
- (zeroCenteredRnd * range) + halfWay
- }
-
- implicit object DoubleGenerator extends Generator[Double] {
- def apply(limits: Limits[Double]): RandomT[Double] = new RandomT[Double] {
- import limits._
- val lower: Double = math.min(x, y)
- val upper: Double = math.max(x, y)
-
- override def randomTLog(n: Int): Double =
- RandomRanges.randomLog(lower, upper, n)
-
- override def randomT(): Double =
- randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).doubleValue
- }
- }
-
- implicit object FloatGenerator extends Generator[Float] {
- def apply(limits: Limits[Float]): RandomT[Float] = new RandomT[Float] {
- import limits._
- val lower: Float = math.min(x, y)
- val upper: Float = math.max(x, y)
-
- override def randomTLog(n: Int): Float =
- RandomRanges.randomLog(lower, upper, n).toFloat
-
- override def randomT(): Float =
- randomBigDecimalBetween(BigDecimal(lower), BigDecimal(upper)).floatValue
- }
- }
-
- implicit object IntGenerator extends Generator[Int] {
- def apply(limits: Limits[Int]): RandomT[Int] = new RandomT[Int] {
- import limits._
- val lower: Int = math.min(x, y)
- val upper: Int = math.max(x, y)
-
- override def randomTLog(n: Int): Int =
- RandomRanges.randomLog(lower, upper, n).toInt
-
- override def randomT(): Int =
- bigIntBetween(BigInt(lower), BigInt(upper)).intValue
- }
- }
-
- private[ml] def logN(x: Double, base: Int): Double = math.log(x) / math.log(base)
-
- private[ml] def randomLog(lower: Double, upper: Double, n: Int): Double = {
- val logLower: Double = logN(lower, n)
- val logUpper: Double = logN(upper, n)
- val logLimits: Limits[Double] = Limits(logLower, logUpper)
- val rndLogged: RandomT[Double] = RandomRanges(logLimits)
- math.pow(n, rndLogged.randomT())
- }
-
- private[ml] def apply[T: Generator](lim: Limits[T])(implicit t: Generator[T]): RandomT[T] = t(lim)
-
-}
-
-/**
- * "For any distribution over a sample space with a finite maximum, the maximum of 60 random
- * observations lies within the top 5% of the true maximum, with 95% probability"
- * - Evaluating Machine Learning Models by Alice Zheng
- * https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html
- *
- * Note: if you want more sophisticated hyperparameter tuning, consider Python libraries
- * such as Hyperopt.
- */
-@Since("3.2.0")
-class ParamRandomBuilder extends ParamGridBuilder {
- def addRandom[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type = {
- val gen: RandomT[T] = RandomRanges(lim)
- addGrid(param, (1 to n).map { _: Int => gen.randomT() })
- }
-
- def addLog10Random[T: Generator](param: Param[T], lim: Limits[T], n: Int): this.type =
- addLogRandom(param, lim, n, 10)
-
- private def addLogRandom[T: Generator](param: Param[T], lim: Limits[T],
- n: Int, base: Int): this.type = {
- val gen: RandomT[T] = RandomRanges(lim)
- addGrid(param, (1 to n).map { _: Int => gen.randomTLog(base) })
- }
-
- // specialized versions for Java.
-
- def addRandom(param: DoubleParam, x: Double, y: Double, n: Int): this.type =
- addRandom(param, Limits(x, y), n)(DoubleGenerator)
-
- def addLog10Random(param: DoubleParam, x: Double, y: Double, n: Int): this.type =
- addLogRandom(param, Limits(x, y), n, 10)(DoubleGenerator)
-
- def addRandom(param: FloatParam, x: Float, y: Float, n: Int): this.type =
- addRandom(param, Limits(x, y), n)(FloatGenerator)
-
- def addLog10Random(param: FloatParam, x: Float, y: Float, n: Int): this.type =
- addLogRandom(param, Limits(x, y), n, 10)(FloatGenerator)
-
- def addRandom(param: IntParam, x: Int, y: Int, n: Int): this.type =
- addRandom(param, Limits(x, y), n)(IntGenerator)
-
- def addLog10Random(param: IntParam, x: Int, y: Int, n: Int): this.type =
- addLogRandom(param, Limits(x, y), n, 10)(IntGenerator)
-
-}
diff --git a/mllib/src/test/scala/org/apache/spark/ml/tuning/ParamRandomBuilderSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/tuning/ParamRandomBuilderSuite.scala
deleted file mode 100644
index e17c48e..0000000
--- a/mllib/src/test/scala/org/apache/spark/ml/tuning/ParamRandomBuilderSuite.scala
+++ /dev/null
@@ -1,123 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.ml.tuning
-
-import org.scalatest.matchers.must.Matchers
-import org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks
-
-import org.apache.spark.SparkFunSuite
-import org.apache.spark.ml.param._
-
-class ParamRandomBuilderSuite extends SparkFunSuite with ScalaCheckDrivenPropertyChecks
- with Matchers {
-
- val solver = new TestParams() {
- private val randomColName = "randomVal"
- val DummyDoubleParam = new DoubleParam(this, randomColName, "doc")
- val DummyFloatParam = new FloatParam(this, randomColName, "doc")
- val DummyIntParam = new IntParam(this, randomColName, "doc")
- }
- import solver._
-
- val DoubleLimits: Limits[Double] = Limits(1d, 100d)
- val FloatLimits: Limits[Float] = Limits(1f, 100f)
- val IntLimits: Limits[Int] = Limits(1, 100)
- val nRandoms: Int = 5
-
- // Java API
-
- test("Java API random Double linear params mixed with fixed values") {
- checkRangeAndCardinality(
- _.addRandom(DummyDoubleParam, DoubleLimits.x, DoubleLimits.y, nRandoms),
- DoubleLimits,
- DummyDoubleParam)
- }
-
- test("Java API random Double log10 params mixed with fixed values") {
- checkRangeAndCardinality(
- _.addLog10Random(DummyDoubleParam, DoubleLimits.x, DoubleLimits.y, nRandoms),
- DoubleLimits,
- DummyDoubleParam)
- }
-
- test("Java API random Float linear params mixed with fixed values") {
- checkRangeAndCardinality(
- _.addRandom(DummyFloatParam, FloatLimits.x, FloatLimits.y, nRandoms),
- FloatLimits,
- DummyFloatParam)
- }
-
- test("Java API random Float log10 params mixed with fixed values") {
- checkRangeAndCardinality(
- _.addLog10Random(DummyFloatParam, FloatLimits.x, FloatLimits.y, nRandoms),
- FloatLimits,
- DummyFloatParam)
- }
-
- test("Java API random Int linear params mixed with fixed values") {
- checkRangeAndCardinality(
- _.addRandom(DummyIntParam, IntLimits.x, IntLimits.y, nRandoms),
- IntLimits,
- DummyIntParam)
- }
-
- test("Java API random Int log10 params mixed with fixed values") {
- checkRangeAndCardinality(
- _.addLog10Random(DummyIntParam, IntLimits.x, IntLimits.y, nRandoms),
- IntLimits,
- DummyIntParam)
- }
-
- // Scala API
-
- test("random linear params mixed with fixed values") {
- import RandomRanges._
- checkRangeAndCardinality(_.addRandom(DummyDoubleParam, DoubleLimits, nRandoms),
- DoubleLimits,
- DummyDoubleParam)
- }
-
- test("random log10 params mixed with fixed values") {
- import RandomRanges._
- checkRangeAndCardinality(_.addLog10Random(DummyDoubleParam, DoubleLimits, nRandoms),
- DoubleLimits,
- DummyDoubleParam)
- }
-
- def checkRangeAndCardinality[T: Numeric](addFn: ParamRandomBuilder => ParamRandomBuilder,
- lim: Limits[T],
- randomCol: Param[T]): Unit = {
- val maxIterations: Int = 10
- val basedOn: Array[ParamPair[_]] = Array(maxIter -> maxIterations)
- val inputCols: Array[String] = Array("input0", "input1")
- val ops: Numeric[T] = implicitly[Numeric[T]]
-
- val builder: ParamRandomBuilder = new ParamRandomBuilder()
- .baseOn(basedOn: _*)
- .addGrid(inputCol, inputCols)
- val paramMap: Array[ParamMap] = addFn(builder).build()
- assert(paramMap.length == inputCols.length * nRandoms * basedOn.length)
- paramMap.foreach { m: ParamMap =>
- assert(m(maxIter) == maxIterations)
- assert(inputCols contains m(inputCol))
- assert(ops.gteq(m(randomCol), lim.x))
- assert(ops.lteq(m(randomCol), lim.y))
- }
- }
-
-}
diff --git a/mllib/src/test/scala/org/apache/spark/ml/tuning/RandomRangesSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/tuning/RandomRangesSuite.scala
deleted file mode 100644
index afcbc03..0000000
--- a/mllib/src/test/scala/org/apache/spark/ml/tuning/RandomRangesSuite.scala
+++ /dev/null
@@ -1,168 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.spark.ml.tuning
-
-import scala.reflect.runtime.universe.TypeTag
-
-import org.scalacheck.{Arbitrary, Gen}
-import org.scalacheck.Arbitrary._
-import org.scalacheck.Gen.Choose
-import org.scalatest.{Assertion, Succeeded}
-import org.scalatest.matchers.must.Matchers
-import org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks
-
-import org.apache.spark.SparkFunSuite
-
-class RandomRangesSuite extends SparkFunSuite with ScalaCheckDrivenPropertyChecks with Matchers {
-
- import RandomRanges._
-
- test("log of any base") {
- assert(logN(16, 4) == 2d)
- assert(logN(1000, 10) === (3d +- 0.000001))
- assert(logN(256, 2) == 8d)
- }
-
- test("random doubles in log space") {
- val gen: Gen[(Double, Double, Int)] = for {
- x <- Gen.choose(0d, Double.MaxValue)
- y <- Gen.choose(0d, Double.MaxValue)
- n <- Gen.choose(0, Int.MaxValue)
- } yield (x, y, n)
- forAll(gen) { case (x, y, n) =>
- val lower = math.min(x, y)
- val upper = math.max(x, y)
- val result = randomLog(x, y, n)
- assert(result >= lower && result <= upper)
- }
- }
-
- test("random BigInt generation does not go into infinite loop") {
- assert(randomBigInt0To(0) == BigInt(0))
- }
-
- test("random ints") {
- checkRange(Linear[Int])
- }
-
- test("random log ints") {
- checkRange(Log10[Int])
- }
-
- test("random int distribution") {
- checkDistributionOf(1000)
- }
-
- test("random doubles") {
- checkRange(Linear[Double])
- }
-
- test("random log doubles") {
- checkRange(Log10[Double])
- }
-
- test("random double distribution") {
- checkDistributionOf(1000d)
- }
-
- test("random floats") {
- checkRange(Linear[Float])
- }
-
- test("random log floats") {
- checkRange(Log10[Float])
- }
-
- test("random float distribution") {
- checkDistributionOf(1000f)
- }
-
- private abstract class RandomFn[T: Numeric: Generator] {
- def apply(genRandom: RandomT[T]): T = genRandom.randomT()
- def appropriate(x: T, y: T): Boolean
- }
-
- private def Linear[T: Numeric: Generator]: RandomFn[T] = new RandomFn {
- override def apply(genRandom: RandomT[T]): T = genRandom.randomT()
- override def appropriate(x: T, y: T): Boolean = true
- }
-
- private def Log10[T: Numeric: Generator]: RandomFn[T] = new RandomFn {
- override def apply(genRandom: RandomT[T]): T = genRandom.randomTLog(10)
- val ops: Numeric[T] = implicitly[Numeric[T]]
- override def appropriate(x: T, y: T): Boolean = {
- ops.gt(x, ops.zero) && ops.gt(y, ops.zero) && x != y
- }
- }
-
- private def checkRange[T: Numeric: Generator: Choose: TypeTag: Arbitrary]
- (rand: RandomFn[T]): Assertion =
- forAll { (x: T, y: T) =>
- if (rand.appropriate(x, y)) {
- val ops: Numeric[T] = implicitly[Numeric[T]]
- val limit: Limits[T] = Limits(x, y)
- val gen: RandomT[T] = RandomRanges(limit)
- val result: T = rand(gen)
- val ordered: (T, T) = lowerUpper(x, y)
- assert(ops.gteq(result, ordered._1) && ops.lteq(result, ordered._2))
- } else Succeeded
- }
-
- private def checkDistributionOf[T: Numeric: Generator: Choose](range: T): Unit = {
- val ops: Numeric[T] = implicitly[Numeric[T]]
- import ops._
- val gen: Gen[(T, T)] = for {
- x <- Gen.choose(negate(range), range)
- y <- Gen.choose(range, times(range, plus(one, one)))
- } yield (x, y)
- forAll(gen) { case (x, y) =>
- assertEvenDistribution(10000, Limits(x, y))
- }
- }
-
- private def meanAndStandardDeviation[T: Numeric](xs: Seq[T]): (Double, Double) = {
- val ops: Numeric[T] = implicitly[Numeric[T]]
- val n: Int = xs.length
- val mean: Double = ops.toDouble(xs.sum) / n
- val squaredDiff: Seq[Double] = xs.map { x: T => math.pow(ops.toDouble(x) - mean, 2) }
- val stdDev: Double = math.pow(squaredDiff.sum / n - 1, 0.5)
- (mean, stdDev)
- }
-
- private def lowerUpper[T: Numeric](x: T, y: T): (T, T) = {
- val ops: Numeric[T] = implicitly[Numeric[T]]
- (ops.min(x, y), ops.max(x, y))
- }
-
- private def midPointOf[T: Numeric : Generator](lim: Limits[T]): Double = {
- val ordered: (T, T) = lowerUpper(lim.x, lim.y)
- val ops: Numeric[T] = implicitly[Numeric[T]]
- val range: T = ops.minus(ordered._2, ordered._1)
- (ops.toDouble(range) / 2) + ops.toDouble(ordered._1)
- }
-
- private def assertEvenDistribution[T: Numeric: Generator](n: Int, lim: Limits[T]): Assertion = {
- val gen: RandomT[T] = RandomRanges(lim)
- val xs: Seq[T] = (0 to n).map { _: Int => gen.randomT() }
- val (mean, stdDev) = meanAndStandardDeviation(xs)
- val tolerance: Double = 4 * stdDev
- val halfWay: Double = midPointOf(lim)
- assert(mean > halfWay - tolerance && mean < halfWay + tolerance)
- }
-
-}
diff --git a/python/docs/source/reference/pyspark.ml.rst b/python/docs/source/reference/pyspark.ml.rst
index fc6060c..7837d60 100644
--- a/python/docs/source/reference/pyspark.ml.rst
+++ b/python/docs/source/reference/pyspark.ml.rst
@@ -288,7 +288,6 @@ Tuning
:toctree: api/
ParamGridBuilder
- ParamRandomBuilder
CrossValidator
CrossValidatorModel
TrainValidationSplit
diff --git a/python/pyspark/ml/tests/test_tuning.py b/python/pyspark/ml/tests/test_tuning.py
index 21551fd..83baf3e 100644
--- a/python/pyspark/ml/tests/test_tuning.py
+++ b/python/pyspark/ml/tests/test_tuning.py
@@ -16,7 +16,6 @@
#
import tempfile
-import math
import unittest
import numpy as np
@@ -28,7 +27,7 @@ from pyspark.ml.evaluation import BinaryClassificationEvaluator, \
from pyspark.ml.linalg import Vectors
from pyspark.ml.param import Param, Params
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder, \
- TrainValidationSplit, TrainValidationSplitModel, ParamRandomBuilder
+ TrainValidationSplit, TrainValidationSplitModel
from pyspark.sql.functions import rand
from pyspark.testing.mlutils import DummyEvaluator, DummyLogisticRegression, \
DummyLogisticRegressionModel, SparkSessionTestCase
@@ -67,108 +66,6 @@ class InducedErrorEstimator(Estimator, HasInducedError):
return model
-class DummyParams(Params):
-
- def __init__(self):
- super(DummyParams, self).__init__()
- self.test_param = Param(self, "test_param", "dummy parameter for testing")
- self.another_test_param = Param(self, "another_test_param", "second parameter for testing")
-
-
-class ParamRandomBuilderTests(unittest.TestCase):
-
- def __init__(self, methodName):
- super(ParamRandomBuilderTests, self).__init__(methodName=methodName)
- self.dummy_params = DummyParams()
- self.to_test = ParamRandomBuilder()
- self.n = 100
-
- def check_ranges(self, params, lowest, highest, expected_type):
- self.assertEqual(self.n, len(params))
- for param in params:
- for v in param.values():
- self.assertGreaterEqual(v, lowest)
- self.assertLessEqual(v, highest)
- self.assertEqual(type(v), expected_type)
-
- def check_addRandom_ranges(self, x, y, expected_type):
- params = self.to_test.addRandom(self.dummy_params.test_param, x, y, self.n).build()
- self.check_ranges(params, x, y, expected_type)
-
- def check_addLog10Random_ranges(self, x, y, expected_type):
- params = self.to_test.addLog10Random(self.dummy_params.test_param, x, y, self.n).build()
- self.check_ranges(params, x, y, expected_type)
-
- @staticmethod
- def counts(xs):
- key_to_count = {}
- for v in xs:
- k = int(v)
- if key_to_count.get(k) is None:
- key_to_count[k] = 1
- else:
- key_to_count[k] = key_to_count[k] + 1
- return key_to_count
-
- @staticmethod
- def raw_values_of(params):
- values = []
- for param in params:
- for v in param.values():
- values.append(v)
- return values
-
- def check_even_distribution(self, vs, bin_function):
- binned = map(lambda x: bin_function(x), vs)
- histogram = self.counts(binned)
- values = list(histogram.values())
- sd = np.std(values)
- mu = np.mean(values)
- for k, v in histogram.items():
- self.assertLess(abs(v - mu), 5 * sd, "{} values for bucket {} is unlikely "
- "when the mean is {} and standard deviation {}"
- .format(v, k, mu, sd))
-
- def test_distribution(self):
- params = self.to_test.addRandom(self.dummy_params.test_param, 0, 20000, 10000).build()
- values = self.raw_values_of(params)
- self.check_even_distribution(values, lambda x: x // 1000)
-
- def test_logarithmic_distribution(self):
- params = self.to_test.addLog10Random(self.dummy_params.test_param, 1, 1e10, 10000).build()
- values = self.raw_values_of(params)
- self.check_even_distribution(values, lambda x: math.log10(x))
-
- def test_param_cardinality(self):
- num_random_params = 7
- values = [1, 2, 3]
- self.to_test.addRandom(self.dummy_params.test_param, 1, 10, num_random_params)
- self.to_test.addGrid(self.dummy_params.another_test_param, values)
- self.assertEqual(len(self.to_test.build()), num_random_params * len(values))
-
- def test_add_random_integer_logarithmic_range(self):
- self.check_addLog10Random_ranges(100, 200, int)
-
- def test_add_logarithmic_random_float_and_integer_yields_floats(self):
- self.check_addLog10Random_ranges(100, 200., float)
-
- def test_add_random_float_logarithmic_range(self):
- self.check_addLog10Random_ranges(100., 200., float)
-
- def test_add_random_integer_range(self):
- self.check_addRandom_ranges(100, 200, int)
-
- def test_add_random_float_and_integer_yields_floats(self):
- self.check_addRandom_ranges(100, 200., float)
-
- def test_add_random_float_range(self):
- self.check_addRandom_ranges(100., 200., float)
-
- def test_unexpected_type(self):
- with self.assertRaises(TypeError):
- self.to_test.addRandom(self.dummy_params.test_param, 1, "wrong type", 1).build()
-
-
class ParamGridBuilderTests(SparkSessionTestCase):
def test_addGrid(self):
diff --git a/python/pyspark/ml/tuning.py b/python/pyspark/ml/tuning.py
index 2c8b9d8..2436abb 100644
--- a/python/pyspark/ml/tuning.py
+++ b/python/pyspark/ml/tuning.py
@@ -18,8 +18,6 @@
import os
import sys
import itertools
-import random
-import math
from multiprocessing.pool import ThreadPool
import numpy as np
@@ -37,7 +35,7 @@ from pyspark.sql.functions import col, lit, rand, UserDefinedFunction
from pyspark.sql.types import BooleanType
__all__ = ['ParamGridBuilder', 'CrossValidator', 'CrossValidatorModel', 'TrainValidationSplit',
- 'TrainValidationSplitModel', 'ParamRandomBuilder']
+ 'TrainValidationSplitModel']
def _parallelFitTasks(est, train, eva, validation, epm, collectSubModel):
@@ -154,50 +152,6 @@ class ParamGridBuilder(object):
return [dict(to_key_value_pairs(keys, prod)) for prod in itertools.product(*grid_values)]
-class ParamRandomBuilder(ParamGridBuilder):
- r"""
- Builder for random value parameters used in search-based model selection.
-
-
- .. versionadded:: 3.2.0
- """
-
- @since("3.2.0")
- def addRandom(self, param, x, y, n):
- """
- Adds n random values between x and y.
- The arguments x and y can be integers, floats or a combination of the two. If either
- x or y is a float, the domain of the random value will be float.
- """
- if type(x) == int and type(y) == int:
- values = map(lambda _: random.randrange(x, y), range(n))
- elif type(x) == float or type(y) == float:
- values = map(lambda _: random.uniform(x, y), range(n))
- else:
- raise TypeError("unable to make range for types %s and %s" % type(x) % type(y))
- self.addGrid(param, values)
- return self
-
- @since("3.2.0")
- def addLog10Random(self, param, x, y, n):
- """
- Adds n random values scaled logarithmically (base 10) between x and y.
- For instance, a distribution for x=1.0, y=10000.0 and n=5 might reasonably look like
- [1.6, 65.3, 221.9, 1024.3, 8997.5]
- """
- def logarithmic_random():
- rand = random.uniform(math.log10(x), math.log10(y))
- value = 10 ** rand
- if type(x) == int and type(y) == int:
- value = int(value)
- return value
-
- values = map(lambda _: logarithmic_random(), range(n))
- self.addGrid(param, values)
-
- return self
-
-
class _ValidatorParams(HasSeed):
"""
Common params for TrainValidationSplit and CrossValidator.
diff --git a/python/pyspark/ml/tuning.pyi b/python/pyspark/ml/tuning.pyi
index 028cebd..912abd4 100644
--- a/python/pyspark/ml/tuning.pyi
+++ b/python/pyspark/ml/tuning.pyi
@@ -35,11 +35,6 @@ class ParamGridBuilder:
def baseOn(self, *args: Tuple[Param, Any]) -> ParamGridBuilder: ...
def build(self) -> List[ParamMap]: ...
-class ParamRandomBuilder(ParamGridBuilder):
- def __init__(self) -> None: ...
- def addRandom(self, param: Param, x: Any, y: Any, n: int) -> ParamRandomBuilder: ...
- def addLog10Random(self, param: Param, x: Any, y: Any, n: int) -> ParamRandomBuilder: ...
-
class _ValidatorParams(HasSeed):
estimator: Param[Estimator]
estimatorParamMaps: Param[List[ParamMap]]
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org