You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Julien Diener (JIRA)" <ji...@apache.org> on 2016/06/20 15:40:05 UTC
[jira] [Updated] (SPARK-16069) rdd.map(identity).cache very slow

     [ https://issues.apache.org/jira/browse/SPARK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Diener updated SPARK-16069:
----------------------------------
    Description: 
I found out that when using .map( identity ).cache on a rdd, it become very slow if the items are big. While it is pretty much instantaneous otherwise.
I would really appreciate to know why? (it is potentially critical for an application I am currently developing, if I don't find a workaround) 

I posted the question on SO but did not get an answer:
http://stackoverflow.com/q/37859386/1206998

Basically, from an in-memory cached rdd containing big items, `map(identity).cache` is very slow. Eg:

    profile( rdd.count )                 // around 12 ms
    profile( rdd.map(identity).count )   // same
    profile( rdd.cache.count )           // same
    profile( rdd.map(identity).cache.count ) // 5700 ms !!!

While, if the rdd content is little, this is very fast. So the creation of the rdd is not at cause. 
I don't understand why this would take time. In my understanding, in-memory cache should "simply" keep a reference to the data, no copy, no serialization.

  was:
I found out that when using .map( identity ).cache on a rdd, it become very slow if the items are big. While it is pretty much instantaneous otherwise.

I posted the question on SO but did not get an answer:
http://stackoverflow.com/q/37859386/1206998

Basically, from an in-memory cached rdd with big item content, `map(identity).cache` is very slow:

    profile( rdd.count )                 // around 12 ms
    profile( rdd.map(identity).count )   // same
    profile( rdd.cache.count )           // same
    profile( rdd.map(identity).cache.count ) // 5700 ms !!!

While, if the rdd content is little, this is very fast. I don't understand why this would take time. In my understanding, in-memory cache should "simply" keep a reference to the data, no copy, no serialization.


> rdd.map(identity).cache very slow
> ---------------------------------
>
>                 Key: SPARK-16069
>                 URL: https://issues.apache.org/jira/browse/SPARK-16069
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: ubuntu
>            Reporter: Julien Diener
>              Labels: performance
>             Fix For: 1.6.0
>
>
> I found out that when using .map( identity ).cache on a rdd, it become very slow if the items are big. While it is pretty much instantaneous otherwise.
> I would really appreciate to know why? (it is potentially critical for an application I am currently developing, if I don't find a workaround) 
> I posted the question on SO but did not get an answer:
> http://stackoverflow.com/q/37859386/1206998
> Basically, from an in-memory cached rdd containing big items, `map(identity).cache` is very slow. Eg:
>     profile( rdd.count )                 // around 12 ms
>     profile( rdd.map(identity).count )   // same
>     profile( rdd.cache.count )           // same
>     profile( rdd.map(identity).cache.count ) // 5700 ms !!!
> While, if the rdd content is little, this is very fast. So the creation of the rdd is not at cause. 
> I don't understand why this would take time. In my understanding, in-memory cache should "simply" keep a reference to the data, no copy, no serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org