You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by dmpour23 <dm...@gmail.com> on 2014/04/04 15:01:01 UTC

how to save RDD partitions in different folders?

Hi all,
Say I have an input file which I would like to partition using
HashPartitioner k times.

Calling  rdd.saveAsTextFile(""hdfs://"); will save k files as part-00000
part-0000k	
Is there a way to save each partition in specific folders?

i.e. src
      part0/part-00000 
      part1/part-00001
      part1/part-0000k

thanks
Dimitri





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to save RDD partitions in different folders?

Posted by dmpour23 <dm...@gmail.com>.
I am not exacly sure how to use MultipleOutput in Spark. Have been looking
into Apache Crunch ? in its guide http://crunch.apache.org/user-guide.html
it states that:

Multiple outputs: Spark doesn't have a concept of multiple outputs; when you
write a data set to disk, the pipeline that creates that data set runs
immediately. This means that you need to be a little bit clever about
caching intermediate stages so you don't end up re-running a big long
pipeline multiple times in order to write a couple of outputs. Crunch does
that for you, along with the same output format and parameter wrapping you
get for multiple inputs.

Is this correct or is there another way of solving the problem?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p4591.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to save RDD partitions in different folders?

Posted by Konstantin Kudryavtsev <ku...@gmail.com>.
Hi Evan,

Could you please provide a code-snippet? Because it not clear for me, in
Hadoop you need to engage addNamedOutput method and I'm in stuck how to use
it from Spark

Thank you,
Konstantin Kudryavtsev


On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks <ev...@gmail.com> wrote:

> Have a look at MultipleOutputs in the hadoop API. Spark can read and write
> to arbitrary hadoop formats.
>
> > On Apr 4, 2014, at 6:01 AM, dmpour23 <dm...@gmail.com> wrote:
> >
> > Hi all,
> > Say I have an input file which I would like to partition using
> > HashPartitioner k times.
> >
> > Calling  rdd.saveAsTextFile(""hdfs://"); will save k files as part-00000
> > part-0000k
> > Is there a way to save each partition in specific folders?
> >
> > i.e. src
> >      part0/part-00000
> >      part1/part-00001
> >      part1/part-0000k
> >
> > thanks
> > Dimitri
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: how to save RDD partitions in different folders?

Posted by dmpour23 <dm...@gmail.com>.
Can you provide an example?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to save RDD partitions in different folders?

Posted by Evan Sparks <ev...@gmail.com>.
Have a look at MultipleOutputs in the hadoop API. Spark can read and write to arbitrary hadoop formats. 

> On Apr 4, 2014, at 6:01 AM, dmpour23 <dm...@gmail.com> wrote:
> 
> Hi all,
> Say I have an input file which I would like to partition using
> HashPartitioner k times.
> 
> Calling  rdd.saveAsTextFile(""hdfs://"); will save k files as part-00000
> part-0000k    
> Is there a way to save each partition in specific folders?
> 
> i.e. src
>      part0/part-00000 
>      part1/part-00001
>      part1/part-0000k
> 
> thanks
> Dimitri
> 
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.