You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Vadim Zaliva <kr...@gmail.com> on 2008/01/29 19:33:40 UTC

broken gzip file

I have a bunch of gzip files which I am trying to process with Hadoop  
task. The task fails with exception:
java.io.EOFException: Unexpected end of ZLIB input stream at  
java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223)  
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java: 
141) at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92) at  
org.apache.hadoop.io.compress.GzipCodec 
$GzipInputStream.read(GzipCodec.java:124) at  
java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at  
java.io.BufferedInputStream.read(BufferedInputStream.java:237) at  
org 
.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java: 
136) at  
org 
.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java: 
128) at  
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java: 
117) at  
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java: 
39) at org.apache.hadoop.mapred.MapTask 
$TrackedRecordReader.next(MapTask.java:147) at  
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at  
org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at  
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2016)
I guess some of files are invalid. However I could not find anywhere  
in logs file name of the file causing this exception. Due to the huge  
size of the dataset I would not want to extract files from DFS and  
verify them with Gzip one by one. Any suggestions? Thanks!
Sincerely,
Vadim

Re: broken gzip file

Posted by Jason Venner <ja...@attributor.com>.

Our change for this is mixed up in some other code we have, I will have 
to separate it out.

Arun C Murthy wrote:
>
> On Jan 29, 2008, at 1:30 PM, Jason Venner wrote:
>
>> We have overridden the base class public class MapReduceBase extends 
>> org.apache.hadoop.mapred.MapReduceBase
>> to have the configure method log the split name and split section (or 
>> in the case of gzip'd files the file name).
>>
>> We find it very helpful to make the job errors to the section of the 
>> input file causing the problem.
>>
>
> Maybe we should just log it by default? Want to submit that patch?
>
> Arun
>
>>
>> Vadim Zaliva wrote:
>>> I have a bunch of gzip files which I am trying to process with 
>>> Hadoop task. The task fails with exception:
>>> java.io.EOFException: Unexpected end of ZLIB input stream at 
>>> java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223) 
>>> at 
>>> java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141) 
>>> at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92) at 
>>> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.read(GzipCodec.java:124) 
>>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at 
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:237) at 
>>> org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:136) 
>>> at 
>>> org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:128) 
>>> at 
>>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:117) 
>>> at 
>>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:39) 
>>> at 
>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147) 
>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at 
>>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at 
>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2016)
>>> I guess some of files are invalid. However I could not find anywhere 
>>> in logs file name of the file causing this exception. Due to the 
>>> huge size of the dataset I would not want to extract files from DFS 
>>> and verify them with Gzip one by one. Any suggestions? Thanks!
>>> Sincerely,
>>> Vadim
>>>
>>>
>>
>> -- 
>> Jason Venner
>> Attributor - Publish with Confidence <http://www.attributor.com/>
>> Attributor is hiring Hadoop Wranglers, contact if interested
>

Re: broken gzip file

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Jan 29, 2008, at 1:30 PM, Jason Venner wrote:

> We have overridden the base class public class MapReduceBase  
> extends org.apache.hadoop.mapred.MapReduceBase
> to have the configure method log the split name and split section  
> (or in the case of gzip'd files the file name).
>
> We find it very helpful to make the job errors to the section of  
> the input file causing the problem.
>

Maybe we should just log it by default? Want to submit that patch?

Arun

>
> Vadim Zaliva wrote:
>> I have a bunch of gzip files which I am trying to process with  
>> Hadoop task. The task fails with exception:
>> java.io.EOFException: Unexpected end of ZLIB input stream at  
>> java.util.zip.InflaterInputStream.fill(InflaterInputStream.java: 
>> 223) at java.util.zip.InflaterInputStream.read 
>> (InflaterInputStream.java:141) at  
>> java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92) at  
>> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.read 
>> (GzipCodec.java:124) at java.io.BufferedInputStream.fill 
>> (BufferedInputStream.java:218) at java.io.BufferedInputStream.read 
>> (BufferedInputStream.java:237) at  
>> org.apache.hadoop.mapred.LineRecordReader.readLine 
>> (LineRecordReader.java:136) at  
>> org.apache.hadoop.mapred.LineRecordReader.readLine 
>> (LineRecordReader.java:128) at  
>> org.apache.hadoop.mapred.LineRecordReader.next 
>> (LineRecordReader.java:117) at  
>> org.apache.hadoop.mapred.LineRecordReader.next 
>> (LineRecordReader.java:39) at org.apache.hadoop.mapred.MapTask 
>> $TrackedRecordReader.next(MapTask.java:147) at  
>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at  
>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at  
>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
>> 2016)
>> I guess some of files are invalid. However I could not find  
>> anywhere in logs file name of the file causing this exception. Due  
>> to the huge size of the dataset I would not want to extract files  
>> from DFS and verify them with Gzip one by one. Any suggestions?  
>> Thanks!
>> Sincerely,
>> Vadim
>>
>>
>
> -- 
> Jason Venner
> Attributor - Publish with Confidence <http://www.attributor.com/>
> Attributor is hiring Hadoop Wranglers, contact if interested

Re: broken gzip file

Posted by Jason Venner <ja...@attributor.com>.

We have overridden the base class public class MapReduceBase extends 
org.apache.hadoop.mapred.MapReduceBase
to have the configure method log the split name and split section (or in 
the case of gzip'd files the file name).

We find it very helpful to make the job errors to the section of the 
input file causing the problem.


Vadim Zaliva wrote:
> I have a bunch of gzip files which I am trying to process with Hadoop 
> task. The task fails with exception:
> java.io.EOFException: Unexpected end of ZLIB input stream at 
> java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223) 
> at 
> java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141) 
> at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92) at 
> org.apache.hadoop.io.compress.GzipCodec$GzipInputStream.read(GzipCodec.java:124) 
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:237) at 
> org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:136) 
> at 
> org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:128) 
> at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:117) 
> at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:39) 
> at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:147) 
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2016)
> I guess some of files are invalid. However I could not find anywhere 
> in logs file name of the file causing this exception. Due to the huge 
> size of the dataset I would not want to extract files from DFS and 
> verify them with Gzip one by one. Any suggestions? Thanks!
> Sincerely,
> Vadim
>
>

-- 
Jason Venner
Attributor - Publish with Confidence <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers, contact if interested

Re: broken gzip file

Posted by Vadim Zaliva <kr...@gmail.com>.

On Jan 29, 2008 10:50 AM, Ted Dunning <td...@veoh.com> wrote:
> IF you drill into the task using the job tracker's web interface, you can
> get to the tasks xml configuration.  That configuration will have the input
> file split specification in it.
>
> You may also be able to see the input file elsewhere, but the xml
> configuration is definitive.

In task XML configuration I see only split file name
('mapred.job.split.file' property
which have value like
'/disk3/nutch/data/filesystem/mapreduce/system/job_200801212103_0067/job.split')
but not the original file name. Any way to get more information about splits?

Also, I was under impression that in case of the gzip input files are
not split. Does that mean that
even if they are not split they copies of them are made anyway? That
could be a potential optimization point.

Vadim

Re: broken gzip file

Posted by Vadim Zaliva <kr...@gmail.com>.

On Jan 29, 2008, at 10:50, Ted Dunning wrote:

I was using library RegexMapper. I did the following to add
logging which did the trick:


import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.lib.RegexMapper;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class LoggingRegexMapper extends RegexMapper
{
     public static final Log LOG =  
LogFactory.getLog("LoggingRegexMapper");

     public void configure(JobConf job)
     {
         super.configure(job);
         LOG.info("Input file="+job.get("map.input.file"));
     }

}

Vadim

>
> Vadim,
>
> IF you drill into the task using the job tracker's web interface,  
> you can
> get to the tasks xml configuration.  That configuration will have  
> the input
> file split specification in it.
>
> You may also be able to see the input file elsewhere, but the xml
> configuration is definitive.
>
>
> On 1/29/08 10:33 AM, "Vadim Zaliva" <kr...@gmail.com> wrote:
>
>> I have a bunch of gzip files which I am trying to process with Hadoop
>> task. The task fails with exception:
>> java.io.EOFException: Unexpected end of ZLIB input stream at
>> java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223)
>> at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:
>> 141) at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92)  
>> at
>> org.apache.hadoop.io.compress.GzipCodec
>> $GzipInputStream.read(GzipCodec.java:124) at
>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:237) at
>> org
>> .apache 
>> .hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:
>> 136) at
>> org
>> .apache 
>> .hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:
>> 128) at
>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:
>> 117) at
>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:
>> 39) at org.apache.hadoop.mapred.MapTask
>> $TrackedRecordReader.next(MapTask.java:147) at
>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at
>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at
>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
>> 2016)
>> I guess some of files are invalid. However I could not find anywhere
>> in logs file name of the file causing this exception. Due to the huge
>> size of the dataset I would not want to extract files from DFS and
>> verify them with Gzip one by one. Any suggestions? Thanks!
>> Sincerely,
>> Vadim
>>
>>
>

Re: broken gzip file

Posted by Ted Dunning <td...@veoh.com>.

Vadim,

IF you drill into the task using the job tracker's web interface, you can
get to the tasks xml configuration.  That configuration will have the input
file split specification in it.

You may also be able to see the input file elsewhere, but the xml
configuration is definitive.


On 1/29/08 10:33 AM, "Vadim Zaliva" <kr...@gmail.com> wrote:

> I have a bunch of gzip files which I am trying to process with Hadoop
> task. The task fails with exception:
> java.io.EOFException: Unexpected end of ZLIB input stream at
> java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223)
> at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:
> 141) at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:92) at
> org.apache.hadoop.io.compress.GzipCodec
> $GzipInputStream.read(GzipCodec.java:124) at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at
> java.io.BufferedInputStream.read(BufferedInputStream.java:237) at
> org 
> .apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:
> 136) at  
> org 
> .apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:
> 128) at  
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:
> 117) at  
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:
> 39) at org.apache.hadoop.mapred.MapTask
> $TrackedRecordReader.next(MapTask.java:147) at
> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:208) at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2016)
> I guess some of files are invalid. However I could not find anywhere
> in logs file name of the file causing this exception. Due to the huge
> size of the dataset I would not want to extract files from DFS and
> verify them with Gzip one by one. Any suggestions? Thanks!
> Sincerely,
> Vadim
> 
>