You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Chen, Kevin" <Ke...@neustar.biz> on 2016/09/15 18:37:01 UTC

Missing output partition file in S3

Hi,

Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result, one chunk of data was lost. If I rerun the same job, the problem usually goes away. This has been happening pretty random. I observed once or twice a week on a daily run job. I am using Spark 1.2.1.

Very much appreciated on any input, suggestion of fix/workaround.

Re: Missing output partition file in S3

Posted by Tracy Li <lx...@gmail.com>.


Sent from my iPhone

> On Sep 15, 2016, at 1:37 PM, Chen, Kevin <Ke...@neustar.biz> wrote:
> 
> Hi,
> 
> Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result, one chunk of data was lost. If I rerun the same job, the problem usually goes away. This has been happening pretty random. I observed once or twice a week on a daily run job. I am using Spark 1.2.1.
> 
> Very much appreciated on any input, suggestion of fix/workaround.
> 
> 
>

Fwd: Missing output partition file in S3

Posted by Richard Catlin <ri...@gmail.com>.


> Begin forwarded message:
> 
> From: "Chen, Kevin" <Ke...@neustar.biz>
> Subject: Re: Missing output partition file in S3
> Date: September 19, 2016 at 10:54:44 AM PDT
> To: Steve Loughran <st...@hortonworks.com>
> Cc: "user@spark.apache.org" <us...@spark.apache.org>
> 
> Hi Steve,
> 
> Our S3 is on US east. But this issue also occurred when we using a S3 bucket on US west. We are using S3n. We use Spark standalone deployment. We run the job in EC2. The datasets are about 25GB. We did not have speculative execution turned on. We did not use DirectCommiter.
> 
> Thanks,
> Kevin
> 
> From: Steve Loughran <stevel@hortonworks.com <ma...@hortonworks.com>>
> Date: Friday, September 16, 2016 at 3:46 AM
> To: Chen Kevin <kevin.chen@neustar.biz <ma...@neustar.biz>>
> Cc: "user@spark.apache.org <ma...@spark.apache.org>" <user@spark.apache.org <ma...@spark.apache.org>>
> Subject: Re: Missing output partition file in S3
> 
> 
>> On 15 Sep 2016, at 19:37, Chen, Kevin <Kevin.Chen@neustar.biz <ma...@neustar.biz>> wrote:
>> 
>> Hi,
>> 
>> Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result, one chunk of data was lost. If I rerun the same job, the problem usually goes away. This has been happening pretty random. I observed once or twice a week on a daily run job. I am using Spark 1.2.1.
>> 
>> Very much appreciated on any input, suggestion of fix/workaround.
>> 
>> 
>> 
> 
> This doesn't sound good
> 
> Without making any promises about being able to fix this,  I would like to understand the setup to see if there is something that could be done to address this
> Which S3 installation? US East or elsewhere
> Which s3 client: s3n or s3a. If on hadoop 2.7+, can you switch to S3a if you haven't already (exception, if you are using AWS EMR you have to stick with their s3:// client)
> Are you running in-EC2 or remotely?
> How big are the datasets being generated?
> Do you have speculative execution turned on
> which committer? is the external "DirectCommitter", or the classic Hadoop FileOutputCommitter? If so &you are using Hadoop 2.7.x, can you try the v2 algorithm (hadoop.mapreduce.fileoutputcommitter.algorithm.version 2)
> 
> I should warn that the stance of myself and colleagues is "dont commit direct to S3", write to HDFS and do a distcp when you finally copy out the data. S3 itself doesn't have enough consistency for committing output to work in the presence of all race conditions and failure modes. At least here you've noticed the problem; the thing people fear is not noticing that a problem has arisen
> 
> -Steve

Re: Missing output partition file in S3

Posted by Steve Loughran <st...@hortonworks.com>.

On 19 Sep 2016, at 18:54, Chen, Kevin <Ke...@neustar.biz>> wrote:

Hi Steve,

Our S3 is on US east. But this issue also occurred when we using a S3 bucket on US west. We are using S3n. We use Spark standalone deployment. We run the job in EC2. The datasets are about 25GB. We did not have speculative execution turned on. We did not use DirectCommiter.

Thanks,
Kevin

the closest thing I know to that on the version of Spark you are using is : https://issues.apache.org/jira/browse/SPARK-4879 , but that's assuming speculative exection is on


From: Steve Loughran <st...@hortonworks.com>>
Date: Friday, September 16, 2016 at 3:46 AM
To: Chen Kevin <ke...@neustar.biz>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Missing output partition file in S3


On 15 Sep 2016, at 19:37, Chen, Kevin <Ke...@neustar.biz>> wrote:

Hi,

Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result, one chunk of data was lost. If I rerun the same job, the problem usually goes away. This has been happening pretty random. I observed once or twice a week on a daily run job. I am using Spark 1.2.1.

Very much appreciated on any input, suggestion of fix/workaround.




This doesn't sound good

Without making any promises about being able to fix this,  I would like to understand the setup to see if there is something that could be done to address this

  1.  Which S3 installation? US East or elsewhere
  2.  Which s3 client: s3n or s3a. If on hadoop 2.7+, can you switch to S3a if you haven't already (exception, if you are using AWS EMR you have to stick with their s3:// client)
  3.  Are you running in-EC2 or remotely?
  4.  How big are the datasets being generated?
  5.  Do you have speculative execution turned on
  6.  which committer? is the external "DirectCommitter", or the classic Hadoop FileOutputCommitter? If so &you are using Hadoop 2.7.x, can you try the v2 algorithm (hadoop.mapreduce.fileoutputcommitter.algorithm.version 2)

I should warn that the stance of myself and colleagues is "dont commit direct to S3", write to HDFS and do a distcp when you finally copy out the data. S3 itself doesn't have enough consistency for committing output to work in the presence of all race conditions and failure modes. At least here you've noticed the problem; the thing people fear is not noticing that a problem has arisen

-Steve

Re: Missing output partition file in S3

Posted by "Chen, Kevin" <Ke...@neustar.biz>.

Hi Steve,

Our S3 is on US east. But this issue also occurred when we using a S3 bucket on US west. We are using S3n. We use Spark standalone deployment. We run the job in EC2. The datasets are about 25GB. We did not have speculative execution turned on. We did not use DirectCommiter.

Thanks,
Kevin

From: Steve Loughran <st...@hortonworks.com>>
Date: Friday, September 16, 2016 at 3:46 AM
To: Chen Kevin <ke...@neustar.biz>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Missing output partition file in S3

On 15 Sep 2016, at 19:37, Chen, Kevin <Ke...@neustar.biz>> wrote:

Hi,

Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result, one chunk of data was lost. If I rerun the same job, the problem usually goes away. This has been happening pretty random. I observed once or twice a week on a daily run job. I am using Spark 1.2.1.

Very much appreciated on any input, suggestion of fix/workaround.

This doesn't sound good

Without making any promises about being able to fix this,  I would like to understand the setup to see if there is something that could be done to address this

  1.  Which S3 installation? US East or elsewhere
  2.  Which s3 client: s3n or s3a. If on hadoop 2.7+, can you switch to S3a if you haven't already (exception, if you are using AWS EMR you have to stick with their s3:// client)
  3.  Are you running in-EC2 or remotely?
  4.  How big are the datasets being generated?
  5.  Do you have speculative execution turned on
  6.  which committer? is the external "DirectCommitter", or the classic Hadoop FileOutputCommitter? If so &you are using Hadoop 2.7.x, can you try the v2 algorithm (hadoop.mapreduce.fileoutputcommitter.algorithm.version 2)

I should warn that the stance of myself and colleagues is "dont commit direct to S3", write to HDFS and do a distcp when you finally copy out the data. S3 itself doesn't have enough consistency for committing output to work in the presence of all race conditions and failure modes. At least here you've noticed the problem; the thing people fear is not noticing that a problem has arisen

-Steve

Re: Missing output partition file in S3

Posted by Steve Loughran <st...@hortonworks.com>.

On 15 Sep 2016, at 19:37, Chen, Kevin <Ke...@neustar.biz>> wrote:

Hi,

Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result, one chunk of data was lost. If I rerun the same job, the problem usually goes away. This has been happening pretty random. I observed once or twice a week on a daily run job. I am using Spark 1.2.1.

Very much appreciated on any input, suggestion of fix/workaround.

This doesn't sound good

Without making any promises about being able to fix this, I would like to understand the setup to see if there is something that could be done to address this

1. Which S3 installation? US East or elsewhere
2. Which s3 client: s3n or s3a. If on hadoop 2.7+, can you switch to S3a if you haven't already (exception, if you are using AWS EMR you have to stick with their s3:// client)
3. Are you running in-EC2 or remotely?
4. How big are the datasets being generated?
5. Do you have speculative execution turned on
6. which committer? is the external "DirectCommitter", or the classic Hadoop FileOutputCommitter? If so &you are using Hadoop 2.7.x, can you try the v2 algorithm (hadoop.mapreduce.fileoutputcommitter.algorithm.version 2)

I should warn that the stance of myself and colleagues is "dont commit direct to S3", write to HDFS and do a distcp when you finally copy out the data. S3 itself doesn't have enough consistency for committing output to work in the presence of all race conditions and failure modes. At least here you've noticed the problem; the thing people fear is not noticing that a problem has arisen

-Steve

Re: Missing output partition file in S3

Posted by Igor Berman <ig...@gmail.com>.

are you using speculation?

On 15 September 2016 at 21:37, Chen, Kevin <Ke...@neustar.biz> wrote:

> Hi,
>
> Has any one encountered an issue of missing output partition file in S3 ?
> My spark job writes output to a S3 location. Occasionally, I noticed one
> partition file is missing. As a result, one chunk of data was lost. If I
> rerun the same job, the problem usually goes away. This has been happening
> pretty random. I observed once or twice a week on a daily run job. I am
> using Spark 1.2.1.
>
> Very much appreciated on any input, suggestion of fix/workaround.
>
>
>
>