You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Kapil Malik <km...@adobe.com> on 2014/01/05 16:20:43 UTC

hdfs replication on saving RDD

Hi all,

I've a spark cluster on top of an HDFS cluster (3 nodes). The hdfs replication is 2. So if I upload a file  : hadoop fs -put something.txt, it is replicated to 2 nodes.
However, when I do rdd.saveAsTextFile ( .. ), it's saved with replication factor 3 (i.e. on all nodes). How do I configure to save a text file with the same replication factor as specified for hadoop ?

Thanks,

Kapil Malik | kmalik@adobe.com<ma...@adobe.com> | 33430 / 8800836581
Go Corona : http://harrypotter:4502/CoronaClient.html


RE: hdfs replication on saving RDD

Posted by Kapil Malik <km...@adobe.com>.
Sending again without the intranet link. (Probably got into spam)

Hi all,

I've a spark cluster on top of an HDFS cluster (3 nodes). The hdfs replication is 2. So if I upload a file  : hadoop fs -put something.txt, it is replicated to 2 nodes.
However, when I do rdd.saveAsTextFile ( .. ), it's saved with replication factor 3 (i.e. on all nodes). How do I configure to save a text file with the same replication factor as specified for hadoop ?

Thanks,

Kapil Malik

Re: hdfs replication on saving RDD

Posted by Kan Zhang <kz...@apache.org>.
Andrew, there are overloaded versions of saveAsHadoopFile or
saveAsNewAPIHadoopFile that allow you to pass in a per-job Hadoop conf.
saveAsTextFile is just a convenience wrapper on top of saveAsHadoopFile.


On Mon, Jul 14, 2014 at 11:22 PM, Andrew Ash <an...@andrewash.com> wrote:

> In general it would be nice to be able to configure replication on a
> per-job basis.  Is there a way to do that without changing the config
> values in the Hadoop conf/ directory between jobs?  Maybe by modifying
> OutputFormats or the JobConf ?
>
>
> On Mon, Jul 14, 2014 at 11:12 PM, Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> You can change this setting through SparkContext.hadoopConfiguration, or
>> put the conf/ directory of your Hadoop installation on the CLASSPATH when
>> you launch your app so that it reads the config values from there.
>>
>> Matei
>>
>> On Jul 14, 2014, at 8:06 PM, valgrind_girl <12...@qq.com> wrote:
>>
>> > eager to know this issue too,does any one knows how?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/hdfs-replication-on-saving-RDD-tp289p9700.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>
>

Re: hdfs replication on saving RDD

Posted by Andrew Ash <an...@andrewash.com>.
In general it would be nice to be able to configure replication on a
per-job basis.  Is there a way to do that without changing the config
values in the Hadoop conf/ directory between jobs?  Maybe by modifying
OutputFormats or the JobConf ?


On Mon, Jul 14, 2014 at 11:12 PM, Matei Zaharia <ma...@gmail.com>
wrote:

> You can change this setting through SparkContext.hadoopConfiguration, or
> put the conf/ directory of your Hadoop installation on the CLASSPATH when
> you launch your app so that it reads the config values from there.
>
> Matei
>
> On Jul 14, 2014, at 8:06 PM, valgrind_girl <12...@qq.com> wrote:
>
> > eager to know this issue too,does any one knows how?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/hdfs-replication-on-saving-RDD-tp289p9700.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Re: hdfs replication on saving RDD

Posted by Matei Zaharia <ma...@gmail.com>.
You can change this setting through SparkContext.hadoopConfiguration, or put the conf/ directory of your Hadoop installation on the CLASSPATH when you launch your app so that it reads the config values from there.

Matei

On Jul 14, 2014, at 8:06 PM, valgrind_girl <12...@qq.com> wrote:

> eager to know this issue too,does any one knows how?
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hdfs-replication-on-saving-RDD-tp289p9700.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: hdfs replication on saving RDD

Posted by valgrind_girl <12...@qq.com>.
eager to know this issue too,does any one knows how?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hdfs-replication-on-saving-RDD-tp289p9700.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.