You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nathan Kronenfeld <nk...@oculusinfo.com> on 2014/10/17 08:46:21 UTC

rdd caching and use thereof

I'm trying to understand two things about how spark is working.

(1) When I try to cache an rdd that fits well within memory (about 60g with
about 600g of memory), I get seemingly random levels of caching, from
around 60% to 100%, given the same tuning parameters.  What governs how
much of an RDD gets cached when there is enough memory?

(2) Even when cached, when I run some tasks over the data, I get various
locality states.  Sometimes it works perfectly, with everything
PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and
the task takes minutes instead of seconds); often this will vary if I run
the task twice in a row in the same shell.  Is there anything I can do to
affect this?  I tried caching with replication, but that caused everything
to run out of memory nearly instantly (with the same 60g data set in 4-600g
of memory)

Thanks for the help,

                -Nathan


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenfeld@oculusinfo.com

Re: rdd caching and use thereof

Posted by Nathan Kronenfeld <nk...@oculusinfo.com>.

Oh, I forgot - I've set the following parameters at the moment (besides the
standard location, memory, and core setup):

spark.logConf                  true
spark.shuffle.consolidateFiles true
spark.ui.port                  4042
spark.io.compression.codec     org.apache.spark.io.SnappyCompressionCodec
spark.shuffle.file.buffer.kb   500
spark.speculation              true



On Fri, Oct 17, 2014 at 2:46 AM, Nathan Kronenfeld <
nkronenfeld@oculusinfo.com> wrote:

> I'm trying to understand two things about how spark is working.
>
> (1) When I try to cache an rdd that fits well within memory (about 60g
> with about 600g of memory), I get seemingly random levels of caching, from
> around 60% to 100%, given the same tuning parameters.  What governs how
> much of an RDD gets cached when there is enough memory?
>
> (2) Even when cached, when I run some tasks over the data, I get various
> locality states.  Sometimes it works perfectly, with everything
> PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and
> the task takes minutes instead of seconds); often this will vary if I run
> the task twice in a row in the same shell.  Is there anything I can do to
> affect this?  I tried caching with replication, but that caused everything
> to run out of memory nearly instantly (with the same 60g data set in 4-600g
> of memory)
>
> Thanks for the help,
>
>                 -Nathan
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenfeld@oculusinfo.com
>



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenfeld@oculusinfo.com