You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2019/03/02 20:05:00 UTC

[jira] [Commented] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap

    [ https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782502#comment-16782502 ] 

Sean Owen commented on SPARK-26166:
-----------------------------------

I agree there's a problem there. I don't think checkpoint() is appropriate as it would try to save the whole RDD to a local file. Although you would generally get the same order of evaluation for the same dataset, it's not guaranteed. cache()-ing the whole thing takes memory and as you say even that isn't guaranteed to work.

The Scala implementation does it differently, and more correctly, in MLUtils.kFold. I think the solution is to call that from Pyspark to get the training, validation splits. Are you open to trying a fix?

> CrossValidator.fit() bug,training and validation dataset may overlap
> --------------------------------------------------------------------
>
>                 Key: SPARK-26166
>                 URL: https://issues.apache.org/jira/browse/SPARK-26166
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Xinyong Tian
>            Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If  df is  not checkpointed, it will be recomputed each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed)  is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. Note , checkpoint() can not be replaced with cached(), because when a node fails, cached table need be  recomputed, thus random number could be different.
> This might especially  be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org