You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Malcolm Matalka <mm...@millennialmedia.com> on 2009/05/07 21:05:20 UTC

.gz input files having less output than uncompressed version

Problem:

I am comparing two jobs.  The both have the same input content, however
in one job the input file has been gziped, and in the other it has not.
I get far less output rows in the gzipped result than I do in the
uncompressed version:

 

Lines in output:

Gzipped: 86851

Uncompressed: 6569303

 

The gzipped input file is 875MB in size, and the entire job runs in
about 30 seconds.  The uncompressed file takes around 5 minutes to run.

 

Hadoop version:

0.18.1, r694836

 

Here is the output of the map task of the compressed input:

2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=

2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask:
numReduceTasks: 12

2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask:
io.sort.mb = 100

2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data
buffer = 79691776/99614720

2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record
buffer = 262144/327680

2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library

2009-05-07 14:54:54,005 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library

2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting
flush of map output

2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart
= 0; bufend = 45410962; bufvoid = 99614720

2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart =
0; kvend = 87923; length = 327680

2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index:
(0, 3786199, 3786199)

2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index:
(3786199, 3789579, 3789579)

2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index:
(7575778, 3859183, 3859183)

2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index:
(11434961, 3792449, 3792449)

2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index:
(15227410, 3818963, 3818963)

2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index:
(19046373, 3780875, 3780875)

2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index:
(22827248, 3814950, 3814950)

2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index:
(26642198, 3871426, 3871426)

2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index:
(30513624, 3799971, 3799971)

2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index:
(34313595, 3813327, 3813327)

2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index:
(38126922, 3835208, 3835208)

2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index:
(41962130, 3747048, 3747048)

2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0

2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200905071451_0001_m_000000_0: No outputs to promote from
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/
_temporary/_attempt_200905071451_0001_m_000000_0

2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_200905071451_0001_m_000000_0' done.

 

 

Am I doing something wrong?  Is there anything else I can do to debug
this?  Is it a known bug?

 

Let me know if you need anything else, thanks.


RE: .gz input files having less output than uncompressed version

Posted by Malcolm Matalka <mm...@millennialmedia.com>.
This is the result of running gzip on the input files.  There appears to be some support for two reasons:

1) I do get some output in my results.  There are 86851 lines in my output file, and they are valid results.

2) In the job task output I pasted it states: org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library, suggesting it has determined what compression codec to use.


-----Original Message-----
From: tim robertson [mailto:timrobertson100@gmail.com] 
Sent: Thursday, May 07, 2009 15:29
To: core-user@hadoop.apache.org
Subject: Re: .gz input files having less output than uncompressed version

Hi,

What input format are you using for the GZipped file?

I don't believe there is a GZip input format although some people have
 discussed whether it is feasible...

Cheers

Tim

On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka
<mm...@millennialmedia.com> wrote:
> Problem:
>
> I am comparing two jobs.  The both have the same input content, however
> in one job the input file has been gziped, and in the other it has not.
> I get far less output rows in the gzipped result than I do in the
> uncompressed version:
>
>
>
> Lines in output:
>
> Gzipped: 86851
>
> Uncompressed: 65693I03
>
>
>
> The gzipped input file is 875MB in size, and the entire job runs in
> about 30 seconds.  The uncompressed file takes around 5 minutes to run.
>
>
>
> Hadoop version:
>
> 0.18.1, r694836
>
>
>
> Here is the output of the map task of the compressed input:
>
> 2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
>
> 2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask:
> numReduceTasks: 12
>
> 2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask:
> io.sort.mb = 100
>
> 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data
> buffer = 79691776/99614720
>
> 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record
> buffer = 262144/327680
>
> 2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader:
> Loaded the native-hadoop library
>
> 2009-05-07 14:54:54,005 INFO
> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
> initialized native-zlib library
>
> 2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting
> flush of map output
>
> 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart
> = 0; bufend = 45410962; bufvoid = 99614720
>
> 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 0; kvend = 87923; length = 327680
>
> 2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index:
> (0, 3786199, 3786199)
>
> 2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index:
> (3786199, 3789579, 3789579)
>
> 2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index:
> (7575778, 3859183, 3859183)
>
> 2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index:
> (11434961, 3792449, 3792449)
>
> 2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index:
> (15227410, 3818963, 3818963)
>
> 2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index:
> (19046373, 3780875, 3780875)
>
> 2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index:
> (22827248, 3814950, 3814950)
>
> 2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index:
> (26642198, 3871426, 3871426)
>
> 2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index:
> (30513624, 3799971, 3799971)
>
> 2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index:
> (34313595, 3813327, 3813327)
>
> 2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index:
> (38126922, 3835208, 3835208)
>
> 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index:
> (41962130, 3747048, 3747048)
>
> 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished
> spill 0
>
> 2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner:
> attempt_200905071451_0001_m_000000_0: No outputs to promote from
> hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/
> _temporary/_attempt_200905071451_0001_m_000000_0
>
> 2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_200905071451_0001_m_000000_0' done.
>
>
>
>
>
> Am I doing something wrong?  Is there anything else I can do to debug
> this?  Is it a known bug?
>
>
>
> Let me know if you need anything else, thanks.
>
>

Re: .gz input files having less output than uncompressed version

Posted by tim robertson <ti...@gmail.com>.
Hi,

What input format are you using for the GZipped file?

I don't believe there is a GZip input format although some people have
 discussed whether it is feasible...

Cheers

Tim

On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka
<mm...@millennialmedia.com> wrote:
> Problem:
>
> I am comparing two jobs.  The both have the same input content, however
> in one job the input file has been gziped, and in the other it has not.
> I get far less output rows in the gzipped result than I do in the
> uncompressed version:
>
>
>
> Lines in output:
>
> Gzipped: 86851
>
> Uncompressed: 65693I03
>
>
>
> The gzipped input file is 875MB in size, and the entire job runs in
> about 30 seconds.  The uncompressed file takes around 5 minutes to run.
>
>
>
> Hadoop version:
>
> 0.18.1, r694836
>
>
>
> Here is the output of the map task of the compressed input:
>
> 2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
>
> 2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask:
> numReduceTasks: 12
>
> 2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask:
> io.sort.mb = 100
>
> 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data
> buffer = 79691776/99614720
>
> 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record
> buffer = 262144/327680
>
> 2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader:
> Loaded the native-hadoop library
>
> 2009-05-07 14:54:54,005 INFO
> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
> initialized native-zlib library
>
> 2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting
> flush of map output
>
> 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart
> = 0; bufend = 45410962; bufvoid = 99614720
>
> 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 0; kvend = 87923; length = 327680
>
> 2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index:
> (0, 3786199, 3786199)
>
> 2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index:
> (3786199, 3789579, 3789579)
>
> 2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index:
> (7575778, 3859183, 3859183)
>
> 2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index:
> (11434961, 3792449, 3792449)
>
> 2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index:
> (15227410, 3818963, 3818963)
>
> 2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index:
> (19046373, 3780875, 3780875)
>
> 2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index:
> (22827248, 3814950, 3814950)
>
> 2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index:
> (26642198, 3871426, 3871426)
>
> 2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index:
> (30513624, 3799971, 3799971)
>
> 2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index:
> (34313595, 3813327, 3813327)
>
> 2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index:
> (38126922, 3835208, 3835208)
>
> 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index:
> (41962130, 3747048, 3747048)
>
> 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished
> spill 0
>
> 2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner:
> attempt_200905071451_0001_m_000000_0: No outputs to promote from
> hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/
> _temporary/_attempt_200905071451_0001_m_000000_0
>
> 2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_200905071451_0001_m_000000_0' done.
>
>
>
>
>
> Am I doing something wrong?  Is there anything else I can do to debug
> this?  Is it a known bug?
>
>
>
> Let me know if you need anything else, thanks.
>
>