You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jim Carroll <ji...@gmail.com> on 2014/12/16 18:50:38 UTC

No disk single pass RDD aggregation

Okay,

I have an rdd that I want to run an aggregate over but it insists on
spilling to disk even though I structured the processing to only require a
single pass.

In other words, I can do all of my processing one entry in the rdd at a time
without persisting anything.

I set rdd.persist(StorageLevel.NONE) and it had no affect. When I run
locally I get my /tmp directory filled with transient rdd data even though I
never need the data again after the row's been processed. Is there a way to
turn this off?

Thanks
Jim




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: No disk single pass RDD aggregation

Posted by Jim Carroll <ji...@gmail.com>.

Nvm. I'm going to post another question since this has to do with the way
spark handles sc.textFile with a file://.....gz



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: No disk single pass RDD aggregation

Posted by Jim Carroll <ji...@gmail.com>.

In case a little more information is helpful:

the RDD is constructed using sc.textFile(fileUri) where the fileUri is to a
".gz" file (that's too big to fit on my disk).

I do an rdd.persist(StorageLevel.NONE) and it seems to have no affect.

This rdd is what I'm calling aggregate on and I expect to only use it once.
Each row in the rdd never has to be revisited. The aggregate seqOp is
modifying a "current state" and returning it so there's no need to store the
results of the seqOp on a row-by-row basis, and give the fact that there's
one partition the comboOp doesn't even need to be called (since there would
be nothing to combine across partitions).

Thanks for any help.
Jim




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20724.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: No disk single pass RDD aggregation

Posted by Jim Carroll <ji...@gmail.com>.

Hi,

This was all my fault. It turned out I had a line of code buried in a
library that did a "repartition." I used this library to wrap an RDD to
present it to legacy code as a different interface. That's what was causing
the data to spill to disk.

The really stupid thing is it took me the better part of a day to find and
several misguided emails to this list (including the one that started this
thread).

Sorry about that.

Jim




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20763.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: No disk single pass RDD aggregation

Posted by thanhtien522 <th...@gmail.com>.

Jim Carroll wrote
> Okay,
> 
> I have an rdd that I want to run an aggregate over but it insists on
> spilling to disk even though I structured the processing to only require a
> single pass.
> 
> In other words, I can do all of my processing one entry in the rdd at a
> time without persisting anything.
> 
> I set rdd.persist(StorageLevel.NONE) and it had no affect. When I run
> locally I get my /tmp directory filled with transient rdd data even though
> I never need the data again after the row's been processed. Is there a way
> to turn this off?
> 
> Thanks
> Jim

hi,
Did you have many input file?
If it is, try to use 

conf.set("spark.shuffle.consolidateFiles", "true");

Hope this help.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20753.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org