You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Nathan Kronenfeld <nk...@oculusinfo.com> on 2014/10/17 08:46:21 UTC
rdd caching and use thereof
I'm trying to understand two things about how spark is working.
(1) When I try to cache an rdd that fits well within memory (about 60g with
about 600g of memory), I get seemingly random levels of caching, from
around 60% to 100%, given the same tuning parameters. What governs how
much of an RDD gets cached when there is enough memory?
(2) Even when cached, when I run some tasks over the data, I get various
locality states. Sometimes it works perfectly, with everything
PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and
the task takes minutes instead of seconds); often this will vary if I run
the task twice in a row in the same shell. Is there anything I can do to
affect this? I tried caching with replication, but that caused everything
to run out of memory nearly instantly (with the same 60g data set in 4-600g
of memory)
Thanks for the help,
-Nathan
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone: +1-416-203-3003 x 238
Email: nkronenfeld@oculusinfo.com
Re: rdd caching and use thereof
Posted by Nathan Kronenfeld <nk...@oculusinfo.com>.
Oh, I forgot - I've set the following parameters at the moment (besides the
standard location, memory, and core setup):
spark.logConf true
spark.shuffle.consolidateFiles true
spark.ui.port 4042
spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
spark.shuffle.file.buffer.kb 500
spark.speculation true
On Fri, Oct 17, 2014 at 2:46 AM, Nathan Kronenfeld <
nkronenfeld@oculusinfo.com> wrote:
> I'm trying to understand two things about how spark is working.
>
> (1) When I try to cache an rdd that fits well within memory (about 60g
> with about 600g of memory), I get seemingly random levels of caching, from
> around 60% to 100%, given the same tuning parameters. What governs how
> much of an RDD gets cached when there is enough memory?
>
> (2) Even when cached, when I run some tasks over the data, I get various
> locality states. Sometimes it works perfectly, with everything
> PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and
> the task takes minutes instead of seconds); often this will vary if I run
> the task twice in a row in the same shell. Is there anything I can do to
> affect this? I tried caching with replication, but that caused everything
> to run out of memory nearly instantly (with the same 60g data set in 4-600g
> of memory)
>
> Thanks for the help,
>
> -Nathan
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone: +1-416-203-3003 x 238
> Email: nkronenfeld@oculusinfo.com
>
--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone: +1-416-203-3003 x 238
Email: nkronenfeld@oculusinfo.com