You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jasleen Kaur <ja...@gmail.com> on 2015/08/03 21:49:17 UTC

Writing to HDFS

I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
an option due to permission issues).

   - num-executors 800
   - spark.akka.frameSize=1024
   - spark.default.parallelism=25600
   - driver-memory=4G
   - executor-memory=32G.
   - My input size is around 1.5TB.

My problem is when I execute rdd.saveAsTextFile(outputPath,
classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not
an option, I have tried saveAsSequenceFile with GZIP,
saveAsNewAPIHadoopFile with same result), I get heap space issue. On the
other hand if I execute rdd.take(1). I get no such issue. So I am assuming
that issue is due to write.

Re: Writing to HDFS

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Just to add rdd.take(1) won't trigger the entire computation, it will just
pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files
to trigger the complete pipeline. How many partitions do you see in the
last stage?

Thanks
Best Regards

On Tue, Aug 4, 2015 at 7:10 AM, ayan guha <gu...@gmail.com> wrote:

> Is your data skewed? What happens if you do rdd.count()?
> On 4 Aug 2015 05:49, "Jasleen Kaur" <ja...@gmail.com> wrote:
>
>> I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
>> an option due to permission issues).
>>
>>    - num-executors 800
>>    - spark.akka.frameSize=1024
>>    - spark.default.parallelism=25600
>>    - driver-memory=4G
>>    - executor-memory=32G.
>>    - My input size is around 1.5TB.
>>
>> My problem is when I execute rdd.saveAsTextFile(outputPath,
>> classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not
>> an option, I have tried saveAsSequenceFile with GZIP,
>> saveAsNewAPIHadoopFile with same result), I get heap space issue. On the
>> other hand if I execute rdd.take(1). I get no such issue. So I am assuming
>> that issue is due to write.
>>
>

Re: Writing to HDFS

Posted by ayan guha <gu...@gmail.com>.

Is your data skewed? What happens if you do rdd.count()?
On 4 Aug 2015 05:49, "Jasleen Kaur" <ja...@gmail.com> wrote:

> I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
> an option due to permission issues).
>
>    - num-executors 800
>    - spark.akka.frameSize=1024
>    - spark.default.parallelism=25600
>    - driver-memory=4G
>    - executor-memory=32G.
>    - My input size is around 1.5TB.
>
> My problem is when I execute rdd.saveAsTextFile(outputPath,
> classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not
> an option, I have tried saveAsSequenceFile with GZIP,
> saveAsNewAPIHadoopFile with same result), I get heap space issue. On the
> other hand if I execute rdd.take(1). I get no such issue. So I am assuming
> that issue is due to write.
>