You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aakash Basu <aa...@gmail.com> on 2018/04/16 14:52:38 UTC
PySpark ML: Get best set of parameters from TrainValidationSplit
Hi,
I am running a Random Forest model on a dataset using hyper parameter
tuning with Spark's paramGrid and Train Validation Split.
Can anyone tell me how to get the best set for all the four parameters?
I used:
model.bestModel()
model.metrics()
But none of them seem to work.
Below is the code chunk:
paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [50, 100, 150, 200]) \
.addGrid(rf.maxDepth, [5, 10, 15, 20]) \
.addGrid(rf.minInfoGain, [0.001, 0.01, 0.1, 0.6]) \
.addGrid(rf.minInstancesPerNode, [5, 15, 30, 50, 100]) \
.build()
tvs = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(),
# 80% of the data will be used for
training, 20% for validation.
trainRatio=0.8)
model = tvs.fit(trainingData)
predictions = model.transform(testData)
evaluator = MulticlassClassificationEvaluator(
labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))
Any help?
Thanks,
Aakash.
Re: PySpark ML: Get best set of parameters from TrainValidationSplit
Posted by Bryan Cutler <cu...@gmail.com>.
Hi Aakash,
First you will want to get the the random forest model stage from the best
pipeline model result, for example if RF is the first stage:
rfModel = model.bestModel.stages[0]
Then you can check the values of the params you tuned like this:
rfModel.getNumTrees
On Mon, Apr 16, 2018 at 7:52 AM, Aakash Basu <aa...@gmail.com>
wrote:
> Hi,
>
> I am running a Random Forest model on a dataset using hyper parameter
> tuning with Spark's paramGrid and Train Validation Split.
>
> Can anyone tell me how to get the best set for all the four parameters?
>
> I used:
>
> model.bestModel()
> model.metrics()
>
>
> But none of them seem to work.
>
>
> Below is the code chunk:
>
> paramGrid = ParamGridBuilder() \
> .addGrid(rf.numTrees, [50, 100, 150, 200]) \
> .addGrid(rf.maxDepth, [5, 10, 15, 20]) \
> .addGrid(rf.minInfoGain, [0.001, 0.01, 0.1, 0.6]) \
> .addGrid(rf.minInstancesPerNode, [5, 15, 30, 50, 100]) \
> .build()
>
> tvs = TrainValidationSplit(estimator=pipeline,
> estimatorParamMaps=paramGrid,
> evaluator=MulticlassClassificationEvaluator(),
> # 80% of the data will be used for training, 20% for validation.
> trainRatio=0.8)
>
> model = tvs.fit(trainingData)
>
> predictions = model.transform(testData)
>
> evaluator = MulticlassClassificationEvaluator(
> labelCol="label", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Accuracy = %g" % accuracy)
> print("Test Error = %g" % (1.0 - accuracy))
>
>
> Any help?
>
>
> Thanks,
> Aakash.
>