You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/06/20 15:52:05 UTC
[jira] [Resolved] (SPARK-16069) rdd.map(identity).cache very slow

     [ https://issues.apache.org/jira/browse/SPARK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-16069.
-------------------------------
          Resolution: Not A Problem
       Fix Version/s:     (was: 1.6.0)
    Target Version/s:   (was: 1.6.0)

Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to first.

Questions should go to user@. I don't think this is a mystery though. You didn't show the bit here where you call rdd.cache and rdd.count before this starts. That call to .count takes time because rdd is computed and serialized into memory.

Calling .count on the cached rdd is fast of course, as is counting a simple identity mapping of the in memory rdd. The third line is the same as the first; rdd is already cached. In the fourth instance you cache a new RDD and wait for it to serialize. It's not the reading but all the overhead of sending data around to executors and serializing it.

> rdd.map(identity).cache very slow
> ---------------------------------
>
>                 Key: SPARK-16069
>                 URL: https://issues.apache.org/jira/browse/SPARK-16069
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: ubuntu
>            Reporter: Julien Diener
>              Labels: performance
>
> I found out that when using .map( identity ).cache on a rdd, it become very slow if the items are big. While it is pretty much instantaneous otherwise.
> I would really appreciate to know why? (it is potentially critical for an application I am currently developing, if I don't find a workaround) 
> I posted the question on SO but did not get an answer:
> http://stackoverflow.com/q/37859386/1206998
> Basically, from an in-memory cached rdd containing big items, `map(identity).cache` is very slow. Eg:
>     profile( rdd.count )                 // around 12 ms
>     profile( rdd.map(identity).count )   // same
>     profile( rdd.cache.count )           // same
>     profile( rdd.map(identity).cache.count ) // 5700 ms !!!
> While, if the rdd content is little, this is very fast. So the creation of the rdd is not at cause. 
> I don't understand why this would take time. In my understanding, in-memory cache should "simply" keep a reference to the data, no copy, no serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org