You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Io...@nomura.com on 2016/11/25 14:23:17 UTC

RDD persist() not honoured

Hi,

I have run into a weird caching problem (Using Spark 1.3.1 + Java 1.8.0) that I can only explain as a bug.

In summary, I source the RDD from an Avro file, I apply a mapToPair Function, count & cache. However, the RDD is not cached nor it appears in Spark UI Storage. (This is not cached at all, not even partially)
                JavaSparkContext ctx = …;
JavaRDD a = ….;
JavaPairRDD b =  a.mapToPaiR(..).cache();
b.count(); //RDD is not cached.

I looked around but could not find any known bugs around this.

I debugged the b RDD and it is set as cached:
(80) MapPartitionsRDD[31] at mapToPair at ABC.java:684 [Memory Deserialized 1x Replicated]
|   RDD1 MapPartitionsRDD[22] at map at XXXAvroDao.java:xx [Memory Deserialized 1x Replicated]
|   MapPartitionsRDD[21] at keys at XXXAvroDao.java:xx [Memory Deserialized 1x Replicated]
|   maprfs:/mapr/XXX NewHadoopRDD[20] at newAPIHadoopFile at XXXAvroDao.java:xx [Memory Deserialized 1x Replicated]

I also checked the b RDD storage level using a debugger and it seems correctly set as well.
StorageLevel(false, true, false, true, 1)

Now thing get more interesting as the following does result in cached rdd:
               a.cache().count();

Also the following works:
                ctx.parallelise(b.take(1000)).cache().count();

However, any attempts to “fool” b.cache() fail as well(action completes but data are not cached at all). E.g.
                b.repartition(150).cache().count();
b.values().cache().count();
b.keys().cache().count();
                b.persist(StorageLevel.DISK_ONLY()).count();
                b.persist(StorageLevel.MEMORY_ONLY()).count();
                b.persist(StorageLevel.MEMORY_ONLY_SER()).count();
b.unpersist().cache().count();


I haven’t managed to replicate the issue without the exact data, to be able to provide a reproducible example as it works just fine in any other data types I have or any example I tried.

Any ideas on where I should look?

Thanks.


This e-mail (including any attachments) is private and confidential, may contain proprietary or privileged information and is intended for the named recipient(s) only. Unintended recipients are strictly prohibited from taking action on the basis of information in this e-mail and must contact the sender immediately, delete this e-mail (and all attachments) and destroy any hard copies. Nomura will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in, this e-mail. If verification is sought please request a hard copy. Any reference to the terms of executed transactions should be treated as preliminary only and subject to formal written confirmation by Nomura. Nomura reserves the right to retain, monitor and intercept e-mail communications through its networks (subject to and in accordance with applicable laws). No confidentiality or privilege is waived or lost by Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, Inc. group. Please read our Electronic Communications Legal Notice which forms part of this e-mail: http://www.Nomura.com/email_disclaimer.htm