You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jasleen Kaur <ja...@gmail.com> on 2015/08/03 21:49:17 UTC
Writing to HDFS
I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
an option due to permission issues).
- num-executors 800
- spark.akka.frameSize=1024
- spark.default.parallelism=25600
- driver-memory=4G
- executor-memory=32G.
- My input size is around 1.5TB.
My problem is when I execute rdd.saveAsTextFile(outputPath,
classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not
an option, I have tried saveAsSequenceFile with GZIP,
saveAsNewAPIHadoopFile with same result), I get heap space issue. On the
other hand if I execute rdd.take(1). I get no such issue. So I am assuming
that issue is due to write.
Re: Writing to HDFS
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Just to add rdd.take(1) won't trigger the entire computation, it will just
pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files
to trigger the complete pipeline. How many partitions do you see in the
last stage?
Thanks
Best Regards
On Tue, Aug 4, 2015 at 7:10 AM, ayan guha <gu...@gmail.com> wrote:
> Is your data skewed? What happens if you do rdd.count()?
> On 4 Aug 2015 05:49, "Jasleen Kaur" <ja...@gmail.com> wrote:
>
>> I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
>> an option due to permission issues).
>>
>> - num-executors 800
>> - spark.akka.frameSize=1024
>> - spark.default.parallelism=25600
>> - driver-memory=4G
>> - executor-memory=32G.
>> - My input size is around 1.5TB.
>>
>> My problem is when I execute rdd.saveAsTextFile(outputPath,
>> classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not
>> an option, I have tried saveAsSequenceFile with GZIP,
>> saveAsNewAPIHadoopFile with same result), I get heap space issue. On the
>> other hand if I execute rdd.take(1). I get no such issue. So I am assuming
>> that issue is due to write.
>>
>
Re: Writing to HDFS
Posted by ayan guha <gu...@gmail.com>.
Is your data skewed? What happens if you do rdd.count()?
On 4 Aug 2015 05:49, "Jasleen Kaur" <ja...@gmail.com> wrote:
> I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
> an option due to permission issues).
>
> - num-executors 800
> - spark.akka.frameSize=1024
> - spark.default.parallelism=25600
> - driver-memory=4G
> - executor-memory=32G.
> - My input size is around 1.5TB.
>
> My problem is when I execute rdd.saveAsTextFile(outputPath,
> classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not
> an option, I have tried saveAsSequenceFile with GZIP,
> saveAsNewAPIHadoopFile with same result), I get heap space issue. On the
> other hand if I execute rdd.take(1). I get no such issue. So I am assuming
> that issue is due to write.
>