You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Gil Vernik <GI...@il.ibm.com> on 2015/04/01 11:58:24 UTC
Re: One corrupt gzip in a directory of 100s
I actually saw the same issue, where we analyzed some container with few
hundreds of GBs zip files - one was corrupted and Spark exit with
Exception on the entire job.
I like SPARK-6593, since it can cover also additional cases, not just in
case of corrupted zip files.
From: Dale Richardson <da...@hotmail.com>
To: "dev@spark.apache.org" <de...@spark.apache.org>
Date: 29/03/2015 11:48 PM
Subject: One corrupt gzip in a directory of 100s
Recently had an incident reported to me where somebody was analysing a
directory of gzipped log files, and was struggling to load them into spark
because one of the files was corrupted - calling
sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular
executor that was reading that file, which caused the entire job to be
cancelled after the retry count was exceeded, without any way of catching
and recovering from the error. While normally I think it is entirely
appropriate to stop execution if something is wrong with your input,
sometimes it is useful to analyse what you can get (as long as you are
aware that input has been skipped), and treat corrupt files as acceptable
losses.
To cater for this particular case I've added SPARK-6593 (PR at
https://github.com/apache/spark/pull/5250). Which adds an option
(spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
Input format, but to continue on with the next task.
Ideally in this case you would want to report the corrupt file paths back
to the master so they could be dealt with in a particular way (eg moved to
a separate directory), but that would require a public API
change/addition. I was pondering on an addition to Spark's hadoop API that
could report processing status back to the master via an optional
accumulator that collects filepath/Option(exception message) tuples so the
user has some idea of what files are being processed, and what files are
being skipped.
Regards,Dale.
Re: One corrupt gzip in a directory of 100s
Posted by Ted Yu <yu...@gmail.com>.
S3n is governed by the same config parameter.
Cheers
> On Apr 2, 2015, at 7:33 AM, Romi Kuntsman <ro...@totango.com> wrote:
>
> Hi Ted,
> Not sure what's the config value, I'm using s3n filesystem and not s3.
>
> The error that I get is the following:
> (so does that mean it's 4 retries?)
>
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 11, ip.ec2.internal): java.net.UnknownHostException: mybucket.s3.amazonaws.com
> at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:579)
> at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:618)
> at sun.security.ssl.SSLSocketImpl.<init>(SSLSocketImpl.java:451)
> at sun.security.ssl.SSLSocketFactoryImpl.createSocket(SSLSocketFactoryImpl.java:140)
> at org.apache.commons.httpclient.protocol.SSLProtocolSocketFactory.createSocket(SSLProtocolSocketFactory.java:82)
> at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91)
> at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158)
> at java.lang.Thread.run(Thread.java:745)
>
> Romi Kuntsman, Big Data Engineer
> http://www.totango.com
>
>> On Wed, Apr 1, 2015 at 6:46 PM, Ted Yu <yu...@gmail.com> wrote:
>> bq. writing the output (to Amazon S3) failed
>>
>> What's the value of "fs.s3.maxRetries" ?
>> Increasing the value should help.
>>
>> Cheers
>>
>>> On Wed, Apr 1, 2015 at 8:34 AM, Romi Kuntsman <ro...@totango.com> wrote:
>>> What about communication errors and not corrupted files?
>>> Both when reading input and when writing output.
>>> We currently experience a failure of the entire process, if the last stage
>>> of writing the output (to Amazon S3) failed because of a very temporary DNS
>>> resolution issue (easily resolved by retrying).
>>>
>>> *Romi Kuntsman*, *Big Data Engineer*
>>>
>>> http://www.totango.com
>>>
>>> On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik <GI...@il.ibm.com> wrote:
>>>
>>> > I actually saw the same issue, where we analyzed some container with few
>>> > hundreds of GBs zip files - one was corrupted and Spark exit with
>>> > Exception on the entire job.
>>> > I like SPARK-6593, since it can cover also additional cases, not just in
>>> > case of corrupted zip files.
>>> >
>>> >
>>> >
>>> > From: Dale Richardson <da...@hotmail.com>
>>> > To: "dev@spark.apache.org" <de...@spark.apache.org>
>>> > Date: 29/03/2015 11:48 PM
>>> > Subject: One corrupt gzip in a directory of 100s
>>> >
>>> >
>>> >
>>> > Recently had an incident reported to me where somebody was analysing a
>>> > directory of gzipped log files, and was struggling to load them into spark
>>> > because one of the files was corrupted - calling
>>> > sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular
>>> > executor that was reading that file, which caused the entire job to be
>>> > cancelled after the retry count was exceeded, without any way of catching
>>> > and recovering from the error. While normally I think it is entirely
>>> > appropriate to stop execution if something is wrong with your input,
>>> > sometimes it is useful to analyse what you can get (as long as you are
>>> > aware that input has been skipped), and treat corrupt files as acceptable
>>> > losses.
>>> > To cater for this particular case I've added SPARK-6593 (PR at
>>> > https://github.com/apache/spark/pull/5250). Which adds an option
>>> > (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
>>> > Input format, but to continue on with the next task.
>>> > Ideally in this case you would want to report the corrupt file paths back
>>> > to the master so they could be dealt with in a particular way (eg moved to
>>> > a separate directory), but that would require a public API
>>> > change/addition. I was pondering on an addition to Spark's hadoop API that
>>> > could report processing status back to the master via an optional
>>> > accumulator that collects filepath/Option(exception message) tuples so the
>>> > user has some idea of what files are being processed, and what files are
>>> > being skipped.
>>> > Regards,Dale.
>>> >
>
Re: One corrupt gzip in a directory of 100s
Posted by Romi Kuntsman <ro...@totango.com>.
Hi Ted,
Not sure what's the config value, I'm using s3n filesystem and not s3.
The error that I get is the following:
(so does that mean it's 4 retries?)
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task
2.3 in stage 0.0 (TID 11, ip.ec2.internal): java.net.UnknownHostException:
mybucket.s3.amazonaws.com
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:618)
at sun.security.ssl.SSLSocketImpl.<init>(SSLSocketImpl.java:451)
at
sun.security.ssl.SSLSocketFactoryImpl.createSocket(SSLSocketFactoryImpl.java:140)
at
org.apache.commons.httpclient.protocol.SSLProtocolSocketFactory.createSocket(SSLProtocolSocketFactory.java:82)
at
org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91)
at
org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158)
at java.lang.Thread.run(Thread.java:745)
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Wed, Apr 1, 2015 at 6:46 PM, Ted Yu <yu...@gmail.com> wrote:
> bq. writing the output (to Amazon S3) failed
>
> What's the value of "fs.s3.maxRetries" ?
> Increasing the value should help.
>
> Cheers
>
> On Wed, Apr 1, 2015 at 8:34 AM, Romi Kuntsman <ro...@totango.com> wrote:
>
>> What about communication errors and not corrupted files?
>> Both when reading input and when writing output.
>> We currently experience a failure of the entire process, if the last stage
>> of writing the output (to Amazon S3) failed because of a very temporary
>> DNS
>> resolution issue (easily resolved by retrying).
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>>
>> http://www.totango.com
>>
>> On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik <GI...@il.ibm.com> wrote:
>>
>> > I actually saw the same issue, where we analyzed some container with few
>> > hundreds of GBs zip files - one was corrupted and Spark exit with
>> > Exception on the entire job.
>> > I like SPARK-6593, since it can cover also additional cases, not just
>> in
>> > case of corrupted zip files.
>> >
>> >
>> >
>> > From: Dale Richardson <da...@hotmail.com>
>> > To: "dev@spark.apache.org" <de...@spark.apache.org>
>> > Date: 29/03/2015 11:48 PM
>> > Subject: One corrupt gzip in a directory of 100s
>> >
>> >
>> >
>> > Recently had an incident reported to me where somebody was analysing a
>> > directory of gzipped log files, and was struggling to load them into
>> spark
>> > because one of the files was corrupted - calling
>> > sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the
>> particular
>> > executor that was reading that file, which caused the entire job to be
>> > cancelled after the retry count was exceeded, without any way of
>> catching
>> > and recovering from the error. While normally I think it is entirely
>> > appropriate to stop execution if something is wrong with your input,
>> > sometimes it is useful to analyse what you can get (as long as you are
>> > aware that input has been skipped), and treat corrupt files as
>> acceptable
>> > losses.
>> > To cater for this particular case I've added SPARK-6593 (PR at
>> > https://github.com/apache/spark/pull/5250). Which adds an option
>> > (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
>> > Input format, but to continue on with the next task.
>> > Ideally in this case you would want to report the corrupt file paths
>> back
>> > to the master so they could be dealt with in a particular way (eg moved
>> to
>> > a separate directory), but that would require a public API
>> > change/addition. I was pondering on an addition to Spark's hadoop API
>> that
>> > could report processing status back to the master via an optional
>> > accumulator that collects filepath/Option(exception message) tuples so
>> the
>> > user has some idea of what files are being processed, and what files are
>> > being skipped.
>> > Regards,Dale.
>> >
>>
>
>
Re: One corrupt gzip in a directory of 100s
Posted by Ted Yu <yu...@gmail.com>.
bq. writing the output (to Amazon S3) failed
What's the value of "fs.s3.maxRetries" ?
Increasing the value should help.
Cheers
On Wed, Apr 1, 2015 at 8:34 AM, Romi Kuntsman <ro...@totango.com> wrote:
> What about communication errors and not corrupted files?
> Both when reading input and when writing output.
> We currently experience a failure of the entire process, if the last stage
> of writing the output (to Amazon S3) failed because of a very temporary DNS
> resolution issue (easily resolved by retrying).
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik <GI...@il.ibm.com> wrote:
>
> > I actually saw the same issue, where we analyzed some container with few
> > hundreds of GBs zip files - one was corrupted and Spark exit with
> > Exception on the entire job.
> > I like SPARK-6593, since it can cover also additional cases, not just in
> > case of corrupted zip files.
> >
> >
> >
> > From: Dale Richardson <da...@hotmail.com>
> > To: "dev@spark.apache.org" <de...@spark.apache.org>
> > Date: 29/03/2015 11:48 PM
> > Subject: One corrupt gzip in a directory of 100s
> >
> >
> >
> > Recently had an incident reported to me where somebody was analysing a
> > directory of gzipped log files, and was struggling to load them into
> spark
> > because one of the files was corrupted - calling
> > sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular
> > executor that was reading that file, which caused the entire job to be
> > cancelled after the retry count was exceeded, without any way of catching
> > and recovering from the error. While normally I think it is entirely
> > appropriate to stop execution if something is wrong with your input,
> > sometimes it is useful to analyse what you can get (as long as you are
> > aware that input has been skipped), and treat corrupt files as acceptable
> > losses.
> > To cater for this particular case I've added SPARK-6593 (PR at
> > https://github.com/apache/spark/pull/5250). Which adds an option
> > (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
> > Input format, but to continue on with the next task.
> > Ideally in this case you would want to report the corrupt file paths back
> > to the master so they could be dealt with in a particular way (eg moved
> to
> > a separate directory), but that would require a public API
> > change/addition. I was pondering on an addition to Spark's hadoop API
> that
> > could report processing status back to the master via an optional
> > accumulator that collects filepath/Option(exception message) tuples so
> the
> > user has some idea of what files are being processed, and what files are
> > being skipped.
> > Regards,Dale.
> >
>
Re: One corrupt gzip in a directory of 100s
Posted by Romi Kuntsman <ro...@totango.com>.
What about communication errors and not corrupted files?
Both when reading input and when writing output.
We currently experience a failure of the entire process, if the last stage
of writing the output (to Amazon S3) failed because of a very temporary DNS
resolution issue (easily resolved by retrying).
*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com
On Wed, Apr 1, 2015 at 12:58 PM, Gil Vernik <GI...@il.ibm.com> wrote:
> I actually saw the same issue, where we analyzed some container with few
> hundreds of GBs zip files - one was corrupted and Spark exit with
> Exception on the entire job.
> I like SPARK-6593, since it can cover also additional cases, not just in
> case of corrupted zip files.
>
>
>
> From: Dale Richardson <da...@hotmail.com>
> To: "dev@spark.apache.org" <de...@spark.apache.org>
> Date: 29/03/2015 11:48 PM
> Subject: One corrupt gzip in a directory of 100s
>
>
>
> Recently had an incident reported to me where somebody was analysing a
> directory of gzipped log files, and was struggling to load them into spark
> because one of the files was corrupted - calling
> sc.textFiles('hdfs:///logs/*.gz') caused an IOException on the particular
> executor that was reading that file, which caused the entire job to be
> cancelled after the retry count was exceeded, without any way of catching
> and recovering from the error. While normally I think it is entirely
> appropriate to stop execution if something is wrong with your input,
> sometimes it is useful to analyse what you can get (as long as you are
> aware that input has been skipped), and treat corrupt files as acceptable
> losses.
> To cater for this particular case I've added SPARK-6593 (PR at
> https://github.com/apache/spark/pull/5250). Which adds an option
> (spark.hadoop.ignoreInputErrors) to log exceptions raised by the hadoop
> Input format, but to continue on with the next task.
> Ideally in this case you would want to report the corrupt file paths back
> to the master so they could be dealt with in a particular way (eg moved to
> a separate directory), but that would require a public API
> change/addition. I was pondering on an addition to Spark's hadoop API that
> could report processing status back to the master via an optional
> accumulator that collects filepath/Option(exception message) tuples so the
> user has some idea of what files are being processed, and what files are
> being skipped.
> Regards,Dale.
>