You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Gary Malouf <ma...@gmail.com> on 2014/11/06 23:10:21 UTC

Wrong temp directory when compressing before sending text file to S3

We have some data that we are exporting from our HDFS cluster to S3 with
some help from Spark.  The final RDD command we run is:

csvData.saveAsTextFile("s3n://data/mess/2014/11/dump-oct-30-to-nov-5-gzip",
classOf[GzipCodec])


We have our 'spark.local.dir' set to our large ephemeral partition on
each slave (on EC2), but with compression on an intermediate format
seems to be written to /tmp/hadoop-root/s3.  Is this a bug in Spark or
are we missing a configuration property?


It's a problem for us because the root disks on EC2 xls are small (~ 5GB).

Re: Wrong temp directory when compressing before sending text file to S3

Posted by Josh Rosen <ro...@gmail.com>.

Hi Gary,

Could you create a Spark JIRA ticket for this so that it doesn't fall
through the cracks?  Thanks!

On Thu, Nov 6, 2014 at 2:10 PM, Gary Malouf <ma...@gmail.com> wrote:

> We have some data that we are exporting from our HDFS cluster to S3 with
> some help from Spark.  The final RDD command we run is:
>
> csvData.saveAsTextFile("s3n://data/mess/2014/11/dump-oct-30-to-nov-5-gzip",
> classOf[GzipCodec])
>
>
> We have our 'spark.local.dir' set to our large ephemeral partition on
> each slave (on EC2), but with compression on an intermediate format
> seems to be written to /tmp/hadoop-root/s3.  Is this a bug in Spark or
> are we missing a configuration property?
>
>
> It's a problem for us because the root disks on EC2 xls are small (~ 5GB).
>