You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shay Seng <sh...@1618labs.com> on 2013/10/31 02:34:31 UTC

Save RDDs as CSV

What's the recommended way to save a RDD as a CSV on say HDFS?
Do I have to collect the RDD and save it from the master, or is there
someway I can write out the CSV file in parallel to HDFS?


tks
shay

Re: Save RDDs as CSV

Posted by Patrick Wendell <pw...@gmail.com>.

I don't think HDFS supports concurrent appends to a single file, so
I'm not sure if this is possible with any framework (Spark/MapReduce)
that creates new HDFS connections per reducer.

On Wed, Oct 30, 2013 at 9:16 PM, Shay Seng <sh...@1618labs.com> wrote:
> Doing a coalesce will be kind of a problem... I was hoping that would be a
> utility or command option  that could concat all the files together for
> me...
>
> Thanks for the replies though!
>
>
>
> On Wed, Oct 30, 2013 at 9:07 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>
>>  You can do this if you coalesce the data first. However, this will
>> put all of your final data through a single reduce tasks (so you get
>> no parallelism and may overload a node):
>>
>> myrdd.coalesce(1).saveAsTextFile("hdfs://..../my.csv")
>>
>> Basically you have to chose, either you do the write in parallel and
>> get a lot of files, or you do the write on one node/reducer and get a
>> single file.
>>
>> - Patrick
>>
>> On Wed, Oct 30, 2013 at 8:05 PM, Shay Seng <sh...@1618labs.com> wrote:
>> > Well that almost works... when I call
>> > myrdd.saveAsTextFile("hdfs://..../my.csv")
>> >
>> > Instead of getting a single my.csv file, as I expect, my.csv is a
>> > directory
>> > with a bunch parts - all of which are csv.
>> > Is there some way have those files concatenated automatically?
>> >
>> >
>> >
>> >
>> > On Wed, Oct 30, 2013 at 7:13 PM, Josh Rosen <ro...@gmail.com>
>> > wrote:
>> >>
>> >> saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat,
>> >> which writes one record per line:
>> >>
>> >> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816
>> >>
>> >> You could map() each entry in your RDD into a comma-separated string,
>> >> then
>> >> write those strings using saveAsTextFile().
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher
>> >> <sc...@icsi.berkeley.edu> wrote:
>> >>>
>> >>>
>> >>> Hi,
>> >>>
>> >>> Can you use saveAsTextFile? See
>> >>>
>> >>>
>> >>>
>> >>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
>> >>>
>> >>> I'm not sure what the default field separator is (Tab probably) but if
>> >>> you don't mind that may work? No need to collect it to the master.
>> >>>
>> >>> Andre
>> >>>
>> >>> On 10/30/2013 06:34 PM, Shay Seng wrote:
>> >>> > What's the recommended way to save a RDD as a CSV on say HDFS?
>> >>> > Do I have to collect the RDD and save it from the master, or is
>> >>> > there
>> >>> > someway I can write out the CSV file in parallel to HDFS?
>> >>> >
>> >>> >
>> >>> > tks
>> >>> > shay
>> >>> >
>> >>>
>> >>
>> >
>
>

Re: Save RDDs as CSV

Posted by Stephen Haberman <st...@gmail.com>.

> Doing a coalesce will be kind of a problem... I was hoping that would
> be a utility or command option  that could concat all the files
> together for me...

If you do rdd.coalesce(1, shuffle = true), then rdd itself will still
be processed in parallel (with each of its partitions' output getting
written to disk), and only the final saveAsTextFile task will be
non-parallel (it will sequentially pull in each upstream partition's
output and write it to the single output file).

In other words, coalesce(1, shuffle = true) for all intents and
purposes is concat.

Or is there a reason you would not find this sufficient?

- Stephen

Re: Save RDDs as CSV

Posted by Josh Rosen <ro...@gmail.com>.

It looks like you might be able to combine the output files using the HDFS
-getmerge command:
http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase


On Wed, Oct 30, 2013 at 9:16 PM, Shay Seng <sh...@1618labs.com> wrote:

> Doing a coalesce will be kind of a problem... I was hoping that would be a
> utility or command option  that could concat all the files together for
> me...
>
> Thanks for the replies though!
>
>
>
> On Wed, Oct 30, 2013 at 9:07 PM, Patrick Wendell <pw...@gmail.com>wrote:
>
>>  You can do this if you coalesce the data first. However, this will
>> put all of your final data through a single reduce tasks (so you get
>> no parallelism and may overload a node):
>>
>> myrdd.coalesce(1).saveAsTextFile("hdfs://..../my.csv")
>>
>> Basically you have to chose, either you do the write in parallel and
>> get a lot of files, or you do the write on one node/reducer and get a
>> single file.
>>
>> - Patrick
>>
>> On Wed, Oct 30, 2013 at 8:05 PM, Shay Seng <sh...@1618labs.com> wrote:
>> > Well that almost works... when I call
>> > myrdd.saveAsTextFile("hdfs://..../my.csv")
>> >
>> > Instead of getting a single my.csv file, as I expect, my.csv is a
>> directory
>> > with a bunch parts - all of which are csv.
>> > Is there some way have those files concatenated automatically?
>> >
>> >
>> >
>> >
>> > On Wed, Oct 30, 2013 at 7:13 PM, Josh Rosen <ro...@gmail.com>
>> wrote:
>> >>
>> >> saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat,
>> >> which writes one record per line:
>> >>
>> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816
>> >>
>> >> You could map() each entry in your RDD into a comma-separated string,
>> then
>> >> write those strings using saveAsTextFile().
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher
>> >> <sc...@icsi.berkeley.edu> wrote:
>> >>>
>> >>>
>> >>> Hi,
>> >>>
>> >>> Can you use saveAsTextFile? See
>> >>>
>> >>>
>> >>>
>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
>> >>>
>> >>> I'm not sure what the default field separator is (Tab probably) but if
>> >>> you don't mind that may work? No need to collect it to the master.
>> >>>
>> >>> Andre
>> >>>
>> >>> On 10/30/2013 06:34 PM, Shay Seng wrote:
>> >>> > What's the recommended way to save a RDD as a CSV on say HDFS?
>> >>> > Do I have to collect the RDD and save it from the master, or is
>> there
>> >>> > someway I can write out the CSV file in parallel to HDFS?
>> >>> >
>> >>> >
>> >>> > tks
>> >>> > shay
>> >>> >
>> >>>
>> >>
>> >
>>
>
>

Re: Save RDDs as CSV

Posted by Shay Seng <sh...@1618labs.com>.

Doing a coalesce will be kind of a problem... I was hoping that would be a
utility or command option  that could concat all the files together for
me...

Thanks for the replies though!



On Wed, Oct 30, 2013 at 9:07 PM, Patrick Wendell <pw...@gmail.com> wrote:

>  You can do this if you coalesce the data first. However, this will
> put all of your final data through a single reduce tasks (so you get
> no parallelism and may overload a node):
>
> myrdd.coalesce(1).saveAsTextFile("hdfs://..../my.csv")
>
> Basically you have to chose, either you do the write in parallel and
> get a lot of files, or you do the write on one node/reducer and get a
> single file.
>
> - Patrick
>
> On Wed, Oct 30, 2013 at 8:05 PM, Shay Seng <sh...@1618labs.com> wrote:
> > Well that almost works... when I call
> > myrdd.saveAsTextFile("hdfs://..../my.csv")
> >
> > Instead of getting a single my.csv file, as I expect, my.csv is a
> directory
> > with a bunch parts - all of which are csv.
> > Is there some way have those files concatenated automatically?
> >
> >
> >
> >
> > On Wed, Oct 30, 2013 at 7:13 PM, Josh Rosen <ro...@gmail.com>
> wrote:
> >>
> >> saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat,
> >> which writes one record per line:
> >>
> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816
> >>
> >> You could map() each entry in your RDD into a comma-separated string,
> then
> >> write those strings using saveAsTextFile().
> >>
> >>
> >>
> >>
> >> On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher
> >> <sc...@icsi.berkeley.edu> wrote:
> >>>
> >>>
> >>> Hi,
> >>>
> >>> Can you use saveAsTextFile? See
> >>>
> >>>
> >>>
> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
> >>>
> >>> I'm not sure what the default field separator is (Tab probably) but if
> >>> you don't mind that may work? No need to collect it to the master.
> >>>
> >>> Andre
> >>>
> >>> On 10/30/2013 06:34 PM, Shay Seng wrote:
> >>> > What's the recommended way to save a RDD as a CSV on say HDFS?
> >>> > Do I have to collect the RDD and save it from the master, or is there
> >>> > someway I can write out the CSV file in parallel to HDFS?
> >>> >
> >>> >
> >>> > tks
> >>> > shay
> >>> >
> >>>
> >>
> >
>

Re: Save RDDs as CSV

Posted by Andre Schumacher <sc...@icsi.berkeley.edu>.

There is also the getmerge command of the HDFS shell which lets you
merge and fetch the contents of a directory, which may be exactly what
you want. From the docs:

Usage: hdfs dfs -getmerge <src> <localdst> [addnl]

Takes a source directory and a destination file as input and
concatenates files in src into the destination local file. Optionally
addnl can be set to enable adding a newline character at the end of each
file.

On 10/30/2013 09:07 PM, Patrick Wendell wrote:
>  You can do this if you coalesce the data first. However, this will
> put all of your final data through a single reduce tasks (so you get
> no parallelism and may overload a node):
> 
> myrdd.coalesce(1).saveAsTextFile("hdfs://..../my.csv")
> 
> Basically you have to chose, either you do the write in parallel and
> get a lot of files, or you do the write on one node/reducer and get a
> single file.
> 
> - Patrick
> 
> On Wed, Oct 30, 2013 at 8:05 PM, Shay Seng <sh...@1618labs.com> wrote:
>> Well that almost works... when I call
>> myrdd.saveAsTextFile("hdfs://..../my.csv")
>>
>> Instead of getting a single my.csv file, as I expect, my.csv is a directory
>> with a bunch parts - all of which are csv.
>> Is there some way have those files concatenated automatically?
>>
>>
>>
>>
>> On Wed, Oct 30, 2013 at 7:13 PM, Josh Rosen <ro...@gmail.com> wrote:
>>>
>>> saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat,
>>> which writes one record per line:
>>> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816
>>>
>>> You could map() each entry in your RDD into a comma-separated string, then
>>> write those strings using saveAsTextFile().
>>>
>>>
>>>
>>>
>>> On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher
>>> <sc...@icsi.berkeley.edu> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Can you use saveAsTextFile? See
>>>>
>>>>
>>>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
>>>>
>>>> I'm not sure what the default field separator is (Tab probably) but if
>>>> you don't mind that may work? No need to collect it to the master.
>>>>
>>>> Andre
>>>>
>>>> On 10/30/2013 06:34 PM, Shay Seng wrote:
>>>>> What's the recommended way to save a RDD as a CSV on say HDFS?
>>>>> Do I have to collect the RDD and save it from the master, or is there
>>>>> someway I can write out the CSV file in parallel to HDFS?
>>>>>
>>>>>
>>>>> tks
>>>>> shay
>>>>>
>>>>
>>>
>>

Re: Save RDDs as CSV

Posted by Patrick Wendell <pw...@gmail.com>.

 You can do this if you coalesce the data first. However, this will
put all of your final data through a single reduce tasks (so you get
no parallelism and may overload a node):

myrdd.coalesce(1).saveAsTextFile("hdfs://..../my.csv")

Basically you have to chose, either you do the write in parallel and
get a lot of files, or you do the write on one node/reducer and get a
single file.

- Patrick

On Wed, Oct 30, 2013 at 8:05 PM, Shay Seng <sh...@1618labs.com> wrote:
> Well that almost works... when I call
> myrdd.saveAsTextFile("hdfs://..../my.csv")
>
> Instead of getting a single my.csv file, as I expect, my.csv is a directory
> with a bunch parts - all of which are csv.
> Is there some way have those files concatenated automatically?
>
>
>
>
> On Wed, Oct 30, 2013 at 7:13 PM, Josh Rosen <ro...@gmail.com> wrote:
>>
>> saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat,
>> which writes one record per line:
>> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816
>>
>> You could map() each entry in your RDD into a comma-separated string, then
>> write those strings using saveAsTextFile().
>>
>>
>>
>>
>> On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher
>> <sc...@icsi.berkeley.edu> wrote:
>>>
>>>
>>> Hi,
>>>
>>> Can you use saveAsTextFile? See
>>>
>>>
>>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
>>>
>>> I'm not sure what the default field separator is (Tab probably) but if
>>> you don't mind that may work? No need to collect it to the master.
>>>
>>> Andre
>>>
>>> On 10/30/2013 06:34 PM, Shay Seng wrote:
>>> > What's the recommended way to save a RDD as a CSV on say HDFS?
>>> > Do I have to collect the RDD and save it from the master, or is there
>>> > someway I can write out the CSV file in parallel to HDFS?
>>> >
>>> >
>>> > tks
>>> > shay
>>> >
>>>
>>
>

Re: Save RDDs as CSV

Posted by Shay Seng <sh...@1618labs.com>.

Well that almost works... when I call
myrdd.saveAsTextFile("hdfs://..../my.csv")

Instead of getting a single my.csv file, as I expect, my.csv is a directory
with a bunch parts - all of which are csv.
Is there some way have those files concatenated automatically?




On Wed, Oct 30, 2013 at 7:13 PM, Josh Rosen <ro...@gmail.com> wrote:

> saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat,
> which writes one record per line:
> https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816
>
> You could map() each entry in your RDD into a comma-separated string, then
> write those strings using saveAsTextFile().
>
>
>
>
> On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher <
> schumach@icsi.berkeley.edu> wrote:
>
>>
>> Hi,
>>
>> Can you use saveAsTextFile? See
>>
>>
>> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
>>
>> I'm not sure what the default field separator is (Tab probably) but if
>> you don't mind that may work? No need to collect it to the master.
>>
>> Andre
>>
>> On 10/30/2013 06:34 PM, Shay Seng wrote:
>> > What's the recommended way to save a RDD as a CSV on say HDFS?
>> > Do I have to collect the RDD and save it from the master, or is there
>> > someway I can write out the CSV file in parallel to HDFS?
>> >
>> >
>> > tks
>> > shay
>> >
>>
>>
>

Re: Save RDDs as CSV

Posted by Josh Rosen <ro...@gmail.com>.

saveAsTextFile() is implemented in terms of Hadoop's TextOutputFormat,
which writes one record per line:
https://github.com/apache/incubator-spark/blob/v0.8.0-incubating/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L816

You could map() each entry in your RDD into a comma-separated string, then
write those strings using saveAsTextFile().




On Wed, Oct 30, 2013 at 7:10 PM, Andre Schumacher <
schumach@icsi.berkeley.edu> wrote:

>
> Hi,
>
> Can you use saveAsTextFile? See
>
>
> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
>
> I'm not sure what the default field separator is (Tab probably) but if
> you don't mind that may work? No need to collect it to the master.
>
> Andre
>
> On 10/30/2013 06:34 PM, Shay Seng wrote:
> > What's the recommended way to save a RDD as a CSV on say HDFS?
> > Do I have to collect the RDD and save it from the master, or is there
> > someway I can write out the CSV file in parallel to HDFS?
> >
> >
> > tks
> > shay
> >
>
>

Re: Save RDDs as CSV

Posted by Andre Schumacher <sc...@icsi.berkeley.edu>.

Hi,

Can you use saveAsTextFile? See

http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD

I'm not sure what the default field separator is (Tab probably) but if
you don't mind that may work? No need to collect it to the master.

Andre

On 10/30/2013 06:34 PM, Shay Seng wrote:
> What's the recommended way to save a RDD as a CSV on say HDFS?
> Do I have to collect the RDD and save it from the master, or is there
> someway I can write out the CSV file in parallel to HDFS?
> 
> 
> tks
> shay
>