You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dirceu Semighini Filho <di...@gmail.com> on 2016/04/27 14:29:19 UTC

Duplicated fit into TrainValidationSplit

Hi guys, I was testing a pipeline here, and found a possible duplicated
call to fit method into the
org.apache.spark.ml.tuning.TrainValidationSplit
<https://github.com/apache/spark/blob/18c2c92580bdc27aa5129d9e7abda418a3633ea6/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala>
class
In line 110 there is a call to est.fit method that call fit in all
parameter combinations that we have setup.
Down in the line 128, after discovering which is the bestmodel, we call fit
aggain using the bestIndex, wouldn't be better to just access the result of
the already call fit method stored in the models val?

Kind regards,
Dirceu

Re: Duplicated fit into TrainValidationSplit

Posted by Dirceu Semighini Filho <di...@gmail.com>.

Ok, thank you.

2016-04-27 11:37 GMT-03:00 Nick Pentreath <ni...@gmail.com>:

> You should find that the first set of fits are called on the training set,
> and the resulting models evaluated on the validation set. The final best
> model is then retrained on the entire dataset. This is standard practice -
> usually the dataset passed to the train validation split is itself further
> split into a training and test set, where the final best model is evaluated
> against the test set.
>
> On Wed, 27 Apr 2016 at 14:30, Dirceu Semighini Filho <
> dirceu.semighini@gmail.com> wrote:
>
>> Hi guys, I was testing a pipeline here, and found a possible duplicated
>> call to fit method into the
>> org.apache.spark.ml.tuning.TrainValidationSplit
>> <https://github.com/apache/spark/blob/18c2c92580bdc27aa5129d9e7abda418a3633ea6/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala>
>> class
>> In line 110 there is a call to est.fit method that call fit in all
>> parameter combinations that we have setup.
>> Down in the line 128, after discovering which is the bestmodel, we call
>> fit aggain using the bestIndex, wouldn't be better to just access the
>> result of the already call fit method stored in the models val?
>>
>> Kind regards,
>> Dirceu
>>
>

Re: Duplicated fit into TrainValidationSplit

Posted by Nick Pentreath <ni...@gmail.com>.

You should find that the first set of fits are called on the training set,
and the resulting models evaluated on the validation set. The final best
model is then retrained on the entire dataset. This is standard practice -
usually the dataset passed to the train validation split is itself further
split into a training and test set, where the final best model is evaluated
against the test set.
On Wed, 27 Apr 2016 at 14:30, Dirceu Semighini Filho <
dirceu.semighini@gmail.com> wrote:

> Hi guys, I was testing a pipeline here, and found a possible duplicated
> call to fit method into the
> org.apache.spark.ml.tuning.TrainValidationSplit
> <https://github.com/apache/spark/blob/18c2c92580bdc27aa5129d9e7abda418a3633ea6/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala>
> class
> In line 110 there is a call to est.fit method that call fit in all
> parameter combinations that we have setup.
> Down in the line 128, after discovering which is the bestmodel, we call
> fit aggain using the bestIndex, wouldn't be better to just access the
> result of the already call fit method stored in the models val?
>
> Kind regards,
> Dirceu
>