You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kim Vogt <ki...@simplegeo.com> on 2011/01/25 23:54:35 UTC

Skip Badly Compressed Input Files

Hi,

I'm processing gzipped compressed files in a directory, but some files are
corrupted and can't be decompressed.  Is there a way to skip the bad files
with a custom load func?

-Kim

Re: Skip Badly Compressed Input Files

Posted by Kim Vogt <ki...@simplegeo.com>.
sure :-)

On Tue, Jan 25, 2011 at 5:54 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> I do it pre-pig.
> I think this has to be handled at the RecordReader level if you wanted to
> do
> it in the framework.
>
> Hey want to contribute to the erorr handling design discussion? :) We
> haven't thought about LoadFuncs yet..
>
> http://wiki.apache.org/pig/PigErrorHandlingInScripts
>
>
> On Tue, Jan 25, 2011 at 4:51 PM, Kim Vogt <ki...@simplegeo.com> wrote:
>
> > Do you catch the error when you load with pig, or is that a pre-pig step?
> > If I wanted to catch the error in a pig load, is it possible?  Where
> would
> > that code go?
> >
> > -Kim
> >
> > On Tue, Jan 25, 2011 at 4:44 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > Yeah so the unexpected EOF is the most common one we get (lzo requires
> a
> > > footer, and sometimes filehandles are closed before a footer is
> written,
> > if
> > > the network hiccups or something).
> > >
> > > Right now what we do is scan before moving to the DW, and if not
> > > successful,
> > > extract what's extractable, catch the error, log how much data is lost
> > > (what's left to read), and expose stats about this sort of thing to
> > > monitoring software so we can alert if stuff gets out of hand.
> > >
> > > D
> > >
> > > On Tue, Jan 25, 2011 at 3:49 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > >
> > > > This is the error I'm getting:
> > > >
> > > > java.io.EOFException: Unexpected end of input stream
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
> > > >        at java.io.InputStream.read(InputStream.java:85)
> > > >        at
> > org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
> > > >        at
> > > com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown
> > > > Source)
> > > >        at
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
> > > >        at
> > > >
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> > > >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> > > >        at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
> > > >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
> > > >        at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> > > >        at java.security.AccessController.doPrivileged(Native Method)
> > > >        at javax.security.auth.Subject.doAs(Subject.java:396)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> > > >        at org.apache.hadoop.mapred.Child.main(Child.java:211)
> > > >
> > > > I'm trying to drill down on what files it's bonking on, but I believe
> > the
> > > > data is corrupt from when flume and amazon hated me and died.
> > > >
> > > > Maybe I can skip bad files and just log their names somewhere, and we
> > > > probably should add some correctness tests :-)
> > > >
> > > > -Kim
> > > >
> > > >
> > > > On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > > > wrote:
> > > >
> > > > > How badly compressed are they? Problems in the codec, or in the
> data
> > > that
> > > > > comes out of the codec?
> > > > >
> > > > > We've had some lzo corruption problems, and so far have simply been
> > > > dealing
> > > > > with that by doing correctness tests in our log mover pipeline
> before
> > > > > moving
> > > > > into the "data warehouse" area.
> > > > >
> > > > > Skipping bad files silently seems like asking for trouble (at some
> > > point
> > > > > the
> > > > > problem quietly grows and you wind up skipping most of your data),
> so
> > > > I've
> > > > > been avoiding putting something like that in so that when things
> are
> > > > badly
> > > > > broken, we get some early pain rather than lots of late pain.
> > > > >
> > > > > D
> > > > >
> > > > > On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <ki...@simplegeo.com>
> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm processing gzipped compressed files in a directory, but some
> > > files
> > > > > are
> > > > > > corrupted and can't be decompressed.  Is there a way to skip the
> > bad
> > > > > files
> > > > > > with a custom load func?
> > > > > >
> > > > > > -Kim
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Skip Badly Compressed Input Files

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
I do it pre-pig.
I think this has to be handled at the RecordReader level if you wanted to do
it in the framework.

Hey want to contribute to the erorr handling design discussion? :) We
haven't thought about LoadFuncs yet..

http://wiki.apache.org/pig/PigErrorHandlingInScripts


On Tue, Jan 25, 2011 at 4:51 PM, Kim Vogt <ki...@simplegeo.com> wrote:

> Do you catch the error when you load with pig, or is that a pre-pig step?
> If I wanted to catch the error in a pig load, is it possible?  Where would
> that code go?
>
> -Kim
>
> On Tue, Jan 25, 2011 at 4:44 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > Yeah so the unexpected EOF is the most common one we get (lzo requires a
> > footer, and sometimes filehandles are closed before a footer is written,
> if
> > the network hiccups or something).
> >
> > Right now what we do is scan before moving to the DW, and if not
> > successful,
> > extract what's extractable, catch the error, log how much data is lost
> > (what's left to read), and expose stats about this sort of thing to
> > monitoring software so we can alert if stuff gets out of hand.
> >
> > D
> >
> > On Tue, Jan 25, 2011 at 3:49 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> >
> > > This is the error I'm getting:
> > >
> > > java.io.EOFException: Unexpected end of input stream
> > >        at
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99)
> > >        at
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
> > >        at
> > >
> >
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
> > >        at java.io.InputStream.read(InputStream.java:85)
> > >        at
> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> > >        at
> > >
> >
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
> > >        at
> > com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown
> > > Source)
> > >        at
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
> > >        at
> > >
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
> > >        at
> > > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> > >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> > >        at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
> > >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
> > >        at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> > >        at java.security.AccessController.doPrivileged(Native Method)
> > >        at javax.security.auth.Subject.doAs(Subject.java:396)
> > >        at
> > >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> > >        at org.apache.hadoop.mapred.Child.main(Child.java:211)
> > >
> > > I'm trying to drill down on what files it's bonking on, but I believe
> the
> > > data is corrupt from when flume and amazon hated me and died.
> > >
> > > Maybe I can skip bad files and just log their names somewhere, and we
> > > probably should add some correctness tests :-)
> > >
> > > -Kim
> > >
> > >
> > > On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > > wrote:
> > >
> > > > How badly compressed are they? Problems in the codec, or in the data
> > that
> > > > comes out of the codec?
> > > >
> > > > We've had some lzo corruption problems, and so far have simply been
> > > dealing
> > > > with that by doing correctness tests in our log mover pipeline before
> > > > moving
> > > > into the "data warehouse" area.
> > > >
> > > > Skipping bad files silently seems like asking for trouble (at some
> > point
> > > > the
> > > > problem quietly grows and you wind up skipping most of your data), so
> > > I've
> > > > been avoiding putting something like that in so that when things are
> > > badly
> > > > broken, we get some early pain rather than lots of late pain.
> > > >
> > > > D
> > > >
> > > > On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm processing gzipped compressed files in a directory, but some
> > files
> > > > are
> > > > > corrupted and can't be decompressed.  Is there a way to skip the
> bad
> > > > files
> > > > > with a custom load func?
> > > > >
> > > > > -Kim
> > > > >
> > > >
> > >
> >
>

Re: Skip Badly Compressed Input Files

Posted by Kim Vogt <ki...@simplegeo.com>.
Do you catch the error when you load with pig, or is that a pre-pig step?
If I wanted to catch the error in a pig load, is it possible?  Where would
that code go?

-Kim

On Tue, Jan 25, 2011 at 4:44 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Yeah so the unexpected EOF is the most common one we get (lzo requires a
> footer, and sometimes filehandles are closed before a footer is written, if
> the network hiccups or something).
>
> Right now what we do is scan before moving to the DW, and if not
> successful,
> extract what's extractable, catch the error, log how much data is lost
> (what's left to read), and expose stats about this sort of thing to
> monitoring software so we can alert if stuff gets out of hand.
>
> D
>
> On Tue, Jan 25, 2011 at 3:49 PM, Kim Vogt <ki...@simplegeo.com> wrote:
>
> > This is the error I'm getting:
> >
> > java.io.EOFException: Unexpected end of input stream
> >        at
> >
> org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99)
> >        at
> >
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
> >        at
> >
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
> >        at java.io.InputStream.read(InputStream.java:85)
> >        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> >        at
> >
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
> >        at
> com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown
> > Source)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
> >        at
> >
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
> >        at
> > org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
> >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> >        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
> >        at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
> >        at java.security.AccessController.doPrivileged(Native Method)
> >        at javax.security.auth.Subject.doAs(Subject.java:396)
> >        at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:211)
> >
> > I'm trying to drill down on what files it's bonking on, but I believe the
> > data is corrupt from when flume and amazon hated me and died.
> >
> > Maybe I can skip bad files and just log their names somewhere, and we
> > probably should add some correctness tests :-)
> >
> > -Kim
> >
> >
> > On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> >
> > > How badly compressed are they? Problems in the codec, or in the data
> that
> > > comes out of the codec?
> > >
> > > We've had some lzo corruption problems, and so far have simply been
> > dealing
> > > with that by doing correctness tests in our log mover pipeline before
> > > moving
> > > into the "data warehouse" area.
> > >
> > > Skipping bad files silently seems like asking for trouble (at some
> point
> > > the
> > > problem quietly grows and you wind up skipping most of your data), so
> > I've
> > > been avoiding putting something like that in so that when things are
> > badly
> > > broken, we get some early pain rather than lots of late pain.
> > >
> > > D
> > >
> > > On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm processing gzipped compressed files in a directory, but some
> files
> > > are
> > > > corrupted and can't be decompressed.  Is there a way to skip the bad
> > > files
> > > > with a custom load func?
> > > >
> > > > -Kim
> > > >
> > >
> >
>

Re: Skip Badly Compressed Input Files

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Yeah so the unexpected EOF is the most common one we get (lzo requires a
footer, and sometimes filehandles are closed before a footer is written, if
the network hiccups or something).

Right now what we do is scan before moving to the DW, and if not successful,
extract what's extractable, catch the error, log how much data is lost
(what's left to read), and expose stats about this sort of thing to
monitoring software so we can alert if stuff gets out of hand.

D

On Tue, Jan 25, 2011 at 3:49 PM, Kim Vogt <ki...@simplegeo.com> wrote:

> This is the error I'm getting:
>
> java.io.EOFException: Unexpected end of input stream
>        at
> org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99)
>        at
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
>        at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
>        at java.io.InputStream.read(InputStream.java:85)
>        at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>        at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
>        at com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown
> Source)
>        at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
>        at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
>        at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
>        at org.apache.hadoop.mapred.Child.main(Child.java:211)
>
> I'm trying to drill down on what files it's bonking on, but I believe the
> data is corrupt from when flume and amazon hated me and died.
>
> Maybe I can skip bad files and just log their names somewhere, and we
> probably should add some correctness tests :-)
>
> -Kim
>
>
> On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > How badly compressed are they? Problems in the codec, or in the data that
> > comes out of the codec?
> >
> > We've had some lzo corruption problems, and so far have simply been
> dealing
> > with that by doing correctness tests in our log mover pipeline before
> > moving
> > into the "data warehouse" area.
> >
> > Skipping bad files silently seems like asking for trouble (at some point
> > the
> > problem quietly grows and you wind up skipping most of your data), so
> I've
> > been avoiding putting something like that in so that when things are
> badly
> > broken, we get some early pain rather than lots of late pain.
> >
> > D
> >
> > On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> >
> > > Hi,
> > >
> > > I'm processing gzipped compressed files in a directory, but some files
> > are
> > > corrupted and can't be decompressed.  Is there a way to skip the bad
> > files
> > > with a custom load func?
> > >
> > > -Kim
> > >
> >
>

Re: Skip Badly Compressed Input Files

Posted by Kim Vogt <ki...@simplegeo.com>.
This is the error I'm getting:

java.io.EOFException: Unexpected end of input stream
	at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99)
	at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
	at java.io.InputStream.read(InputStream.java:85)
	at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
	at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:97)
	at com.simplegeo.elephantgeo.pig.load.PigJsonLoader.getNext(Unknown Source)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
	at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
	at org.apache.hadoop.mapred.Child.main(Child.java:211)

I'm trying to drill down on what files it's bonking on, but I believe the
data is corrupt from when flume and amazon hated me and died.

Maybe I can skip bad files and just log their names somewhere, and we
probably should add some correctness tests :-)

-Kim


On Tue, Jan 25, 2011 at 3:02 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> How badly compressed are they? Problems in the codec, or in the data that
> comes out of the codec?
>
> We've had some lzo corruption problems, and so far have simply been dealing
> with that by doing correctness tests in our log mover pipeline before
> moving
> into the "data warehouse" area.
>
> Skipping bad files silently seems like asking for trouble (at some point
> the
> problem quietly grows and you wind up skipping most of your data), so I've
> been avoiding putting something like that in so that when things are badly
> broken, we get some early pain rather than lots of late pain.
>
> D
>
> On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <ki...@simplegeo.com> wrote:
>
> > Hi,
> >
> > I'm processing gzipped compressed files in a directory, but some files
> are
> > corrupted and can't be decompressed.  Is there a way to skip the bad
> files
> > with a custom load func?
> >
> > -Kim
> >
>

Re: Skip Badly Compressed Input Files

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
How badly compressed are they? Problems in the codec, or in the data that
comes out of the codec?

We've had some lzo corruption problems, and so far have simply been dealing
with that by doing correctness tests in our log mover pipeline before moving
into the "data warehouse" area.

Skipping bad files silently seems like asking for trouble (at some point the
problem quietly grows and you wind up skipping most of your data), so I've
been avoiding putting something like that in so that when things are badly
broken, we get some early pain rather than lots of late pain.

D

On Tue, Jan 25, 2011 at 2:54 PM, Kim Vogt <ki...@simplegeo.com> wrote:

> Hi,
>
> I'm processing gzipped compressed files in a directory, but some files are
> corrupted and can't be decompressed.  Is there a way to skip the bad files
> with a custom load func?
>
> -Kim
>