You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Wei Tan <wt...@us.ibm.com> on 2014/06/18 05:36:28 UTC

rdd.cache() is not faster?

Hi, I have a 40G file which is a concatenation of multiple documents, I 
want to extract two features (title and tables) from each doc, so the 
program is like this:

-------------------------------------------------------------
val file = sc.textFile("/path/to/40G/file")
//file.cache()   //to enable or disable cache


val titles = file.map(line => (doc_key, getTitle())  // reduce 1; here I 
use text utility functions written in Java
  {
  }).reduceByKey(_ + _,1)


val tables = file.flatMap(line => {
 
  for (table <- all_tables)
        yield (doc_key, getTableTitle())  // reduce 2; here I use text 
utility functions written in Java
}).reduceByKey(_ + _,1)

titles.saveAsTextFile("titles.out")   //save_1, will trigger reduce_1
tables.saveAsTextFile("tables.out") //save_2, will trigger reduce_2
-------------------------------------------------------------

I expect that with file.cache(), (the later) reduce_2 should be faster 
since it will read from cached data. However, results repeatedly shows 
that, reduce_2 takes 3 min when with cache and 1.4 min without cache. Why 
reading from cache does not help in this case?

Stage GUI shows that, with cache, reduce_2 always has a wave of "outlier 
tasks", where the median latency is 2s but max is 1.7 min. 

Metric
Min
25th percentile
Median
75th percentile
Max
Result serialization time
0 ms
0 ms
0 ms
0 ms
1 ms
Duration
0.6 s
2 s
2 s
2 s
1.7 min

But these tasks are not with a long GC pause (26 ms as shown)

173
1210
SUCCESS
PROCESS_LOCAL
localhost
2014/06/17 17:49:43
1.7 min 
26 ms 


9.4 KB 



BTW: it is a single machine with 32 cores, 192 GB RAM, SSD, with these 
lines in spark-env.sh

SPARK_WORKER_MEMORY=180g
SPARK_MEM=180g
SPARK_JAVA_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=500 
-XX:MaxPermSize=256m"


Thanks,

Wei

---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan

Re: rdd.cache() is not faster?

Posted by Gaurav Jain <ja...@student.ethz.ch>.

" if I do have big data (40GB, cached size is 60GB) and even big memory (192
GB), I cannot benefit from RDD cache, and should persist on disk and
leverage filesystem cache?"

The answer to the question of whether to persist (spill-over) data on disk
is not always immediately clear, because generally the functions to compute
RDD partitions are not as expensive as retrieving the saved partition from
disk. That's why, the default STORAGE_LEVEL never stores RDD partitions on
disk, and instead computes them on the fly. Also, you can try using Kryo
serialization (if not using it already) to reduce memory usage. Playing
around with different Storage levels (MEMORY_ONLY_SER, for example) might
also help. 

Best
Gaurav Jain
Master's Student, D-INFK
ETH Zurich
Email: jaing at student dot ethz dot ch



-----
Gaurav Jain
Master's Student, D-INFK
ETH Zurich
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/rdd-cache-is-not-faster-tp7804p7846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: rdd.cache() is not faster?

Posted by Wei Tan <wt...@us.ibm.com>.

Hi Gaurav, thanks for your pointer. The observation in the link is (at 
least qualitatively) similar to mine.

Now the question is, if I do have big data (40GB, cached size is 60GB) and 
even big memory (192 GB), I cannot benefit from RDD cache, and should 
persist on disk and leverage filesystem cache?

I will try more workers so that each JVM has a smaller heap.

Best regards,
Wei

---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Gaurav Jain <ja...@student.ethz.ch>
To:     user@spark.incubator.apache.org, 
Date:   06/18/2014 06:30 AM
Subject:        Re: rdd.cache() is not faster?



You cannot assume that caching would always reduce the execution time,
especially if the data-set is large. It appears that if too much memory is
used for caching, then less memory is left for the actual computation
itself. There has to be a balance between the two. 

Page 33 of this thesis from KTH talks about this:
http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf

Best



-----
Gaurav Jain
Master's Student, D-INFK
ETH Zurich
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/rdd-cache-is-not-faster-tp7804p7835.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: rdd.cache() is not faster?

Posted by Gaurav Jain <ja...@student.ethz.ch>.

You cannot assume that caching would always reduce the execution time,
especially if the data-set is large. It appears that if too much memory is
used for caching, then less memory is left for the actual computation
itself. There has to be a balance between the two. 

Page 33 of this thesis from KTH talks about this:
http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf

Best



-----
Gaurav Jain
Master's Student, D-INFK
ETH Zurich
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/rdd-cache-is-not-faster-tp7804p7835.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.