You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2019/10/26 23:48:00 UTC

[jira] [Resolved] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap

     [ https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean R. Owen resolved SPARK-26166.
----------------------------------
    Resolution: Not A Problem

I would not consider that a bug, but indeed a subtle side effect of trying to random values (which depend on data order and partitioning) in general in Spark. Indeed it can't be guaranteed unless you materialize the dataframe, as you point out.

> CrossValidator.fit() bug,training and validation dataset may overlap
> --------------------------------------------------------------------
>
>                 Key: SPARK-26166
>                 URL: https://issues.apache.org/jira/browse/SPARK-26166
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Xinyong Tian
>            Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If  df is  not checkpointed, it will be recomputed each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed)  is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. Note , checkpoint() can not be replaced with cached(), because when a node fails, cached table need be  recomputed, thus random number could be different.
> This might especially  be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org