You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2017/09/11 21:12:00 UTC
[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

    [ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162004#comment-16162004 ] 

Joseph K. Bradley commented on SPARK-18608:
-------------------------------------------

Hi all, it looks like there has been confusion about what has been agreed on.  This is my current understanding:

There are 2 issues:
1. This JIRA [SPARK-18608], which discusses the bug of double-caching because of misuse of {{dataset.rdd.getStorageLevel}}.  Note that [SPARK-21799] is just a special case of this bug.
2. [SPARK-21972], which discusses adding a parameter handlePersistence to allow user control over whether to cache the input data.

I recommend:
1. We should fix the current double-caching bug in master and branch-2.2.  Going from Spark 2.1 to 2.2, I've only seen a performance regression with K-Means, but I recommend we fix the bug for all cases.  This fix would be like [~podongfeng]'s original PR for https://github.com/apache/spark/pull/17014 (before adding in handlePersistence).
2. We can work on adding handlePersistence to master.  No backporting there of course.  Note that [SPARK-19422] is also related, and it may be blocked by decisions on [SPARK-21972].

> Spark ML algorithms that check RDD cache level for internal caching double-cache data
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-18608
>                 URL: https://issues.apache.org/jira/browse/SPARK-18608
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence internally. They check whether the input dataset is cached, and if not they cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. This will actually always be true, since even if the dataset itself is cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input {{DataSet}}, but now we can, so the checks should be migrated to use {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org