You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2019/03/02 20:05:00 UTC
[jira] [Commented] (SPARK-26166) CrossValidator.fit() bug,training
and validation dataset may overlap
[ https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782502#comment-16782502 ]
Sean Owen commented on SPARK-26166:
-----------------------------------
I agree there's a problem there. I don't think checkpoint() is appropriate as it would try to save the whole RDD to a local file. Although you would generally get the same order of evaluation for the same dataset, it's not guaranteed. cache()-ing the whole thing takes memory and as you say even that isn't guaranteed to work.
The Scala implementation does it differently, and more correctly, in MLUtils.kFold. I think the solution is to call that from Pyspark to get the training, validation splits. Are you open to trying a fix?
> CrossValidator.fit() bug,training and validation dataset may overlap
> --------------------------------------------------------------------
>
> Key: SPARK-26166
> URL: https://issues.apache.org/jira/browse/SPARK-26166
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.3.0
> Reporter: Xinyong Tian
> Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If df is not checkpointed, it will be recomputed each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed) is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. Note , checkpoint() can not be replaced with cached(), because when a node fails, cached table need be recomputed, thus random number could be different.
> This might especially be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org