You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Shao, Saisai" <sa...@intel.com> on 2014/12/25 07:52:41 UTC

Question on saveAsTextFile with overwrite option

Hi,

We have such requirements to save RDD output to HDFS with saveAsTextFile like API, but need to overwrite the data if existed. I'm not sure if current Spark support such kind of operations, or I need to check this manually?

There's a thread in mailing list discussed about this (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html), I'm not sure this feature is enabled or not, or with some configurations?

Appreciate your suggestions.

Thanks a lot
Jerry

RE: Question on saveAsTextFile with overwrite option

Posted by "Shao, Saisai" <sa...@intel.com>.
Thanks Patrick for your detailed explanation.

BR
Jerry

-----Original Message-----
From: Patrick Wendell [mailto:pwendell@gmail.com] 
Sent: Thursday, December 25, 2014 3:43 PM
To: Cheng, Hao
Cc: Shao, Saisai; user@spark.apache.org; dev@spark.apache.org
Subject: Re: Question on saveAsTextFile with overwrite option

So the behavior of overwriting existing directories IMO is something we don't want to encourage. The reason why the Hadoop client has these checks is that it's very easy for users to do unsafe things without them. For instance, a user could overwrite an RDD that had 100 partitions with an RDD that has 10 partitions... and if they read back the RDD they would get a corrupted RDD that has a combination of data from the old and new RDD.

If users want to circumvent these safety checks, we need to make them explicitly disable them. Given this, I think a config option is as reasonable as any alternatives. This is already pretty easy IMO.

- Patrick

On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao <ha...@intel.com> wrote:
> I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick?
>
> Cheng Hao
>
> -----Original Message-----
> From: Patrick Wendell [mailto:pwendell@gmail.com]
> Sent: Thursday, December 25, 2014 3:22 PM
> To: Shao, Saisai
> Cc: user@spark.apache.org; dev@spark.apache.org
> Subject: Re: Question on saveAsTextFile with overwrite option
>
> Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?
>
> http://spark.apache.org/docs/latest/configuration.html
>
> - Patrick
>
> On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai <sa...@intel.com> wrote:
>> Hi,
>>
>>
>>
>> We have such requirements to save RDD output to HDFS with 
>> saveAsTextFile like API, but need to overwrite the data if existed.
>> I'm not sure if current Spark support such kind of operations, or I need to check this manually?
>>
>>
>>
>> There's a thread in mailing list discussed about this 
>> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-S
>> p ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
>> I'm not sure this feature is enabled or not, or with some configurations?
>>
>>
>>
>> Appreciate your suggestions.
>>
>>
>>
>> Thanks a lot
>>
>> Jerry
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For 
> additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


RE: Question on saveAsTextFile with overwrite option

Posted by "Shao, Saisai" <sa...@intel.com>.
Thanks Patrick for your detailed explanation.

BR
Jerry

-----Original Message-----
From: Patrick Wendell [mailto:pwendell@gmail.com] 
Sent: Thursday, December 25, 2014 3:43 PM
To: Cheng, Hao
Cc: Shao, Saisai; user@spark.apache.org; dev@spark.apache.org
Subject: Re: Question on saveAsTextFile with overwrite option

So the behavior of overwriting existing directories IMO is something we don't want to encourage. The reason why the Hadoop client has these checks is that it's very easy for users to do unsafe things without them. For instance, a user could overwrite an RDD that had 100 partitions with an RDD that has 10 partitions... and if they read back the RDD they would get a corrupted RDD that has a combination of data from the old and new RDD.

If users want to circumvent these safety checks, we need to make them explicitly disable them. Given this, I think a config option is as reasonable as any alternatives. This is already pretty easy IMO.

- Patrick

On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao <ha...@intel.com> wrote:
> I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick?
>
> Cheng Hao
>
> -----Original Message-----
> From: Patrick Wendell [mailto:pwendell@gmail.com]
> Sent: Thursday, December 25, 2014 3:22 PM
> To: Shao, Saisai
> Cc: user@spark.apache.org; dev@spark.apache.org
> Subject: Re: Question on saveAsTextFile with overwrite option
>
> Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?
>
> http://spark.apache.org/docs/latest/configuration.html
>
> - Patrick
>
> On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai <sa...@intel.com> wrote:
>> Hi,
>>
>>
>>
>> We have such requirements to save RDD output to HDFS with 
>> saveAsTextFile like API, but need to overwrite the data if existed.
>> I'm not sure if current Spark support such kind of operations, or I need to check this manually?
>>
>>
>>
>> There's a thread in mailing list discussed about this 
>> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-S
>> p ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
>> I'm not sure this feature is enabled or not, or with some configurations?
>>
>>
>>
>> Appreciate your suggestions.
>>
>>
>>
>> Thanks a lot
>>
>> Jerry
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For 
> additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Question on saveAsTextFile with overwrite option

Posted by Patrick Wendell <pw...@gmail.com>.
So the behavior of overwriting existing directories IMO is something
we don't want to encourage. The reason why the Hadoop client has these
checks is that it's very easy for users to do unsafe things without
them. For instance, a user could overwrite an RDD that had 100
partitions with an RDD that has 10 partitions... and if they read back
the RDD they would get a corrupted RDD that has a combination of data
from the old and new RDD.

If users want to circumvent these safety checks, we need to make them
explicitly disable them. Given this, I think a config option is as
reasonable as any alternatives. This is already pretty easy IMO.

- Patrick

On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao <ha...@intel.com> wrote:
> I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick?
>
> Cheng Hao
>
> -----Original Message-----
> From: Patrick Wendell [mailto:pwendell@gmail.com]
> Sent: Thursday, December 25, 2014 3:22 PM
> To: Shao, Saisai
> Cc: user@spark.apache.org; dev@spark.apache.org
> Subject: Re: Question on saveAsTextFile with overwrite option
>
> Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?
>
> http://spark.apache.org/docs/latest/configuration.html
>
> - Patrick
>
> On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai <sa...@intel.com> wrote:
>> Hi,
>>
>>
>>
>> We have such requirements to save RDD output to HDFS with
>> saveAsTextFile like API, but need to overwrite the data if existed.
>> I'm not sure if current Spark support such kind of operations, or I need to check this manually?
>>
>>
>>
>> There's a thread in mailing list discussed about this
>> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
>> ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
>> I'm not sure this feature is enabled or not, or with some configurations?
>>
>>
>>
>> Appreciate your suggestions.
>>
>>
>>
>> Thanks a lot
>>
>> Jerry
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Question on saveAsTextFile with overwrite option

Posted by Patrick Wendell <pw...@gmail.com>.
So the behavior of overwriting existing directories IMO is something
we don't want to encourage. The reason why the Hadoop client has these
checks is that it's very easy for users to do unsafe things without
them. For instance, a user could overwrite an RDD that had 100
partitions with an RDD that has 10 partitions... and if they read back
the RDD they would get a corrupted RDD that has a combination of data
from the old and new RDD.

If users want to circumvent these safety checks, we need to make them
explicitly disable them. Given this, I think a config option is as
reasonable as any alternatives. This is already pretty easy IMO.

- Patrick

On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao <ha...@intel.com> wrote:
> I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick?
>
> Cheng Hao
>
> -----Original Message-----
> From: Patrick Wendell [mailto:pwendell@gmail.com]
> Sent: Thursday, December 25, 2014 3:22 PM
> To: Shao, Saisai
> Cc: user@spark.apache.org; dev@spark.apache.org
> Subject: Re: Question on saveAsTextFile with overwrite option
>
> Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?
>
> http://spark.apache.org/docs/latest/configuration.html
>
> - Patrick
>
> On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai <sa...@intel.com> wrote:
>> Hi,
>>
>>
>>
>> We have such requirements to save RDD output to HDFS with
>> saveAsTextFile like API, but need to overwrite the data if existed.
>> I'm not sure if current Spark support such kind of operations, or I need to check this manually?
>>
>>
>>
>> There's a thread in mailing list discussed about this
>> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
>> ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
>> I'm not sure this feature is enabled or not, or with some configurations?
>>
>>
>>
>> Appreciate your suggestions.
>>
>>
>>
>> Thanks a lot
>>
>> Jerry
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


RE: Question on saveAsTextFile with overwrite option

Posted by "Cheng, Hao" <ha...@intel.com>.
I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick?

Cheng Hao

-----Original Message-----
From: Patrick Wendell [mailto:pwendell@gmail.com] 
Sent: Thursday, December 25, 2014 3:22 PM
To: Shao, Saisai
Cc: user@spark.apache.org; dev@spark.apache.org
Subject: Re: Question on saveAsTextFile with overwrite option

Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?

http://spark.apache.org/docs/latest/configuration.html

- Patrick

On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai <sa...@intel.com> wrote:
> Hi,
>
>
>
> We have such requirements to save RDD output to HDFS with 
> saveAsTextFile like API, but need to overwrite the data if existed. 
> I'm not sure if current Spark support such kind of operations, or I need to check this manually?
>
>
>
> There's a thread in mailing list discussed about this 
> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
> ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
> I'm not sure this feature is enabled or not, or with some configurations?
>
>
>
> Appreciate your suggestions.
>
>
>
> Thanks a lot
>
> Jerry

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


RE: Question on saveAsTextFile with overwrite option

Posted by "Cheng, Hao" <ha...@intel.com>.
I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick?

Cheng Hao

-----Original Message-----
From: Patrick Wendell [mailto:pwendell@gmail.com] 
Sent: Thursday, December 25, 2014 3:22 PM
To: Shao, Saisai
Cc: user@spark.apache.org; dev@spark.apache.org
Subject: Re: Question on saveAsTextFile with overwrite option

Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?

http://spark.apache.org/docs/latest/configuration.html

- Patrick

On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai <sa...@intel.com> wrote:
> Hi,
>
>
>
> We have such requirements to save RDD output to HDFS with 
> saveAsTextFile like API, but need to overwrite the data if existed. 
> I'm not sure if current Spark support such kind of operations, or I need to check this manually?
>
>
>
> There's a thread in mailing list discussed about this 
> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Sp
> ark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
> I'm not sure this feature is enabled or not, or with some configurations?
>
>
>
> Appreciate your suggestions.
>
>
>
> Thanks a lot
>
> Jerry

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Question on saveAsTextFile with overwrite option

Posted by Patrick Wendell <pw...@gmail.com>.
Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?

http://spark.apache.org/docs/latest/configuration.html

- Patrick

On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai <sa...@intel.com> wrote:
> Hi,
>
>
>
> We have such requirements to save RDD output to HDFS with saveAsTextFile
> like API, but need to overwrite the data if existed. I'm not sure if current
> Spark support such kind of operations, or I need to check this manually?
>
>
>
> There's a thread in mailing list discussed about this
> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
> I'm not sure this feature is enabled or not, or with some configurations?
>
>
>
> Appreciate your suggestions.
>
>
>
> Thanks a lot
>
> Jerry

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Question on saveAsTextFile with overwrite option

Posted by Patrick Wendell <pw...@gmail.com>.
Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false?

http://spark.apache.org/docs/latest/configuration.html

- Patrick

On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai <sa...@intel.com> wrote:
> Hi,
>
>
>
> We have such requirements to save RDD output to HDFS with saveAsTextFile
> like API, but need to overwrite the data if existed. I'm not sure if current
> Spark support such kind of operations, or I need to check this manually?
>
>
>
> There's a thread in mailing list discussed about this
> (http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-make-Spark-1-0-saveAsTextFile-to-overwrite-existing-file-td6696.html),
> I'm not sure this feature is enabled or not, or with some configurations?
>
>
>
> Appreciate your suggestions.
>
>
>
> Thanks a lot
>
> Jerry

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org