You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Julien Diener (JIRA)" <ji...@apache.org> on 2016/06/21 07:22:57 UTC

[jira] [Comment Edited] (SPARK-16069) rdd.map(identity).cache very slow

    [ https://issues.apache.org/jira/browse/SPARK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341284#comment-15341284 ] 

Julien Diener edited comment on SPARK-16069 at 6/21/16 7:22 AM:
----------------------------------------------------------------

Why would data be send to executors? I understood that cache means to keep intermediate results in memory, for later use. No need to move data around (?)


was (Author: juh):
Why would data be send to executors? I understood that cache means to keep intermediate results in memory, for later use

> rdd.map(identity).cache very slow
> ---------------------------------
>
>                 Key: SPARK-16069
>                 URL: https://issues.apache.org/jira/browse/SPARK-16069
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: ubuntu
>            Reporter: Julien Diener
>              Labels: performance
>
> I found out that when using .map( identity ).cache on a rdd, it become very slow if the items are big. While it is pretty much instantaneous otherwise.
> I would really appreciate to know why? (it is potentially critical for an application I am currently developing, if I don't find a workaround) 
> I posted the question on SO but did not get an answer:
> http://stackoverflow.com/q/37859386/1206998
> Basically, from an in-memory cached rdd containing big items, `map(identity).cache` is very slow. Eg:
>     profile( rdd.count )                 // around 12 ms
>     profile( rdd.map(identity).count )   // same
>     profile( rdd.cache.count )           // same
>     profile( rdd.map(identity).cache.count ) // 5700 ms !!!
> While, if the rdd content is little, this is very fast. So the creation of the rdd is not at cause. 
> I don't understand why this would take time. In my understanding, in-memory cache should "simply" keep a reference to the data, no copy, no serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org