You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andrew Ash <an...@andrewash.com> on 2014/01/30 11:21:13 UTC

Stream RDD to local disk

Hi Spark users,

I'm often using Spark for ETL type tasks, where the input is a large file
on-disk and the output is another large file on-disk.  I've loaded
everything into HDFS, but still need to produce files out on the other side.

Right now I produce these processed files in a 2-step process:

1) in a single spark job, read from HDFS location A, process, and write to
HDFS location B
2) run hadoop fs -cat hdfs:///path/to/* > /path/tomyfile to get it onto the
local disk.

It would be great to get this down to a 1-step process.

If I run .saveAsTextFile("...") on my RDD, then the shards of the file are
scattered onto the local disk across the cluster.  But if I .collect() on
the driver and then save to disk using normal Scala disk IO utilities, I'll
certainly OOM the driver.

*So the question*: is there a way to get an iterator for an RDD that I can
scan through the contents on the driver and flush to disk?

I found the RDD.iterator() method but it looks to be intended for use by
RDD subclasses not end users (requires a Partition and TaskContext
parameter).  The .foreach() method executes on each worker also, rather
than on the driver, so would also scatter files across the cluster if I
saved from there.

Any suggestions?

Thanks!
Andrew

Re: Stream RDD to local disk

Posted by Andrew Ash <an...@andrewash.com>.
Hadn't thought of calling the hadoop process from within the scala code but
that is an improvement over my current process. Thanks for the suggestion
Chris!

It still requires saving to HDFS, dumping out to a file, and then cleaning
that temp directory out of HDFS though so isn't quite my ideal process.

Sent from my mobile phone
On Jan 30, 2014 2:37 AM, "Christopher Nguyen" <ct...@adatao.com> wrote:

> Andrew, couldn't you do in the Scala code:
>
>   scala.sys.process.Process("hadoop fs -copyToLocal ...")!
>
> or is that still considered a second step?
>
> "hadoop fs" is almost certainly going to be better at copying these files
> than some memory-to-disk-to-memory serdes within Spark.
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctnguyen
>
>
>
> On Thu, Jan 30, 2014 at 2:21 AM, Andrew Ash <an...@andrewash.com> wrote:
>
>> Hi Spark users,
>>
>> I'm often using Spark for ETL type tasks, where the input is a large file
>> on-disk and the output is another large file on-disk.  I've loaded
>> everything into HDFS, but still need to produce files out on the other side.
>>
>> Right now I produce these processed files in a 2-step process:
>>
>> 1) in a single spark job, read from HDFS location A, process, and write
>> to HDFS location B
>> 2) run hadoop fs -cat hdfs:///path/to/* > /path/tomyfile to get it onto
>> the local disk.
>>
>> It would be great to get this down to a 1-step process.
>>
>> If I run .saveAsTextFile("...") on my RDD, then the shards of the file
>> are scattered onto the local disk across the cluster.  But if I .collect()
>> on the driver and then save to disk using normal Scala disk IO utilities,
>> I'll certainly OOM the driver.
>>
>> *So the question*: is there a way to get an iterator for an RDD that I
>> can scan through the contents on the driver and flush to disk?
>>
>> I found the RDD.iterator() method but it looks to be intended for use by
>> RDD subclasses not end users (requires a Partition and TaskContext
>> parameter).  The .foreach() method executes on each worker also, rather
>> than on the driver, so would also scatter files across the cluster if I
>> saved from there.
>>
>> Any suggestions?
>>
>> Thanks!
>> Andrew
>>
>
>

Re: Stream RDD to local disk

Posted by Christopher Nguyen <ct...@adatao.com>.
Andrew, couldn't you do in the Scala code:

  scala.sys.process.Process("hadoop fs -copyToLocal ...")!

or is that still considered a second step?

"hadoop fs" is almost certainly going to be better at copying these files
than some memory-to-disk-to-memory serdes within Spark.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Jan 30, 2014 at 2:21 AM, Andrew Ash <an...@andrewash.com> wrote:

> Hi Spark users,
>
> I'm often using Spark for ETL type tasks, where the input is a large file
> on-disk and the output is another large file on-disk.  I've loaded
> everything into HDFS, but still need to produce files out on the other side.
>
> Right now I produce these processed files in a 2-step process:
>
> 1) in a single spark job, read from HDFS location A, process, and write to
> HDFS location B
> 2) run hadoop fs -cat hdfs:///path/to/* > /path/tomyfile to get it onto
> the local disk.
>
> It would be great to get this down to a 1-step process.
>
> If I run .saveAsTextFile("...") on my RDD, then the shards of the file are
> scattered onto the local disk across the cluster.  But if I .collect() on
> the driver and then save to disk using normal Scala disk IO utilities, I'll
> certainly OOM the driver.
>
> *So the question*: is there a way to get an iterator for an RDD that I
> can scan through the contents on the driver and flush to disk?
>
> I found the RDD.iterator() method but it looks to be intended for use by
> RDD subclasses not end users (requires a Partition and TaskContext
> parameter).  The .foreach() method executes on each worker also, rather
> than on the driver, so would also scatter files across the cluster if I
> saved from there.
>
> Any suggestions?
>
> Thanks!
> Andrew
>