You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Malcolm Matalka <mm...@millennialmedia.com> on 2009/05/07 21:05:20 UTC
.gz input files having less output than uncompressed version
Problem:
I am comparing two jobs. The both have the same input content, however
in one job the input file has been gziped, and in the other it has not.
I get far less output rows in the gzipped result than I do in the
uncompressed version:
Lines in output:
Gzipped: 86851
Uncompressed: 6569303
The gzipped input file is 875MB in size, and the entire job runs in
about 30 seconds. The uncompressed file takes around 5 minutes to run.
Hadoop version:
0.18.1, r694836
Here is the output of the map task of the compressed input:
2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask:
numReduceTasks: 12
2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask:
io.sort.mb = 100
2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data
buffer = 79691776/99614720
2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record
buffer = 262144/327680
2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library
2009-05-07 14:54:54,005 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
initialized native-zlib library
2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting
flush of map output
2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart
= 0; bufend = 45410962; bufvoid = 99614720
2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart =
0; kvend = 87923; length = 327680
2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index:
(0, 3786199, 3786199)
2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index:
(3786199, 3789579, 3789579)
2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index:
(7575778, 3859183, 3859183)
2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index:
(11434961, 3792449, 3792449)
2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index:
(15227410, 3818963, 3818963)
2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index:
(19046373, 3780875, 3780875)
2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index:
(22827248, 3814950, 3814950)
2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index:
(26642198, 3871426, 3871426)
2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index:
(30513624, 3799971, 3799971)
2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index:
(34313595, 3813327, 3813327)
2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index:
(38126922, 3835208, 3835208)
2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index:
(41962130, 3747048, 3747048)
2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished
spill 0
2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200905071451_0001_m_000000_0: No outputs to promote from
hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/
_temporary/_attempt_200905071451_0001_m_000000_0
2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_200905071451_0001_m_000000_0' done.
Am I doing something wrong? Is there anything else I can do to debug
this? Is it a known bug?
Let me know if you need anything else, thanks.
RE: .gz input files having less output than uncompressed version
Posted by Malcolm Matalka <mm...@millennialmedia.com>.
This is the result of running gzip on the input files. There appears to be some support for two reasons:
1) I do get some output in my results. There are 86851 lines in my output file, and they are valid results.
2) In the job task output I pasted it states: org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library, suggesting it has determined what compression codec to use.
-----Original Message-----
From: tim robertson [mailto:timrobertson100@gmail.com]
Sent: Thursday, May 07, 2009 15:29
To: core-user@hadoop.apache.org
Subject: Re: .gz input files having less output than uncompressed version
Hi,
What input format are you using for the GZipped file?
I don't believe there is a GZip input format although some people have
discussed whether it is feasible...
Cheers
Tim
On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka
<mm...@millennialmedia.com> wrote:
> Problem:
>
> I am comparing two jobs. The both have the same input content, however
> in one job the input file has been gziped, and in the other it has not.
> I get far less output rows in the gzipped result than I do in the
> uncompressed version:
>
>
>
> Lines in output:
>
> Gzipped: 86851
>
> Uncompressed: 65693I03
>
>
>
> The gzipped input file is 875MB in size, and the entire job runs in
> about 30 seconds. The uncompressed file takes around 5 minutes to run.
>
>
>
> Hadoop version:
>
> 0.18.1, r694836
>
>
>
> Here is the output of the map task of the compressed input:
>
> 2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
>
> 2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask:
> numReduceTasks: 12
>
> 2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask:
> io.sort.mb = 100
>
> 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data
> buffer = 79691776/99614720
>
> 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record
> buffer = 262144/327680
>
> 2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader:
> Loaded the native-hadoop library
>
> 2009-05-07 14:54:54,005 INFO
> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
> initialized native-zlib library
>
> 2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting
> flush of map output
>
> 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart
> = 0; bufend = 45410962; bufvoid = 99614720
>
> 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 0; kvend = 87923; length = 327680
>
> 2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index:
> (0, 3786199, 3786199)
>
> 2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index:
> (3786199, 3789579, 3789579)
>
> 2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index:
> (7575778, 3859183, 3859183)
>
> 2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index:
> (11434961, 3792449, 3792449)
>
> 2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index:
> (15227410, 3818963, 3818963)
>
> 2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index:
> (19046373, 3780875, 3780875)
>
> 2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index:
> (22827248, 3814950, 3814950)
>
> 2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index:
> (26642198, 3871426, 3871426)
>
> 2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index:
> (30513624, 3799971, 3799971)
>
> 2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index:
> (34313595, 3813327, 3813327)
>
> 2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index:
> (38126922, 3835208, 3835208)
>
> 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index:
> (41962130, 3747048, 3747048)
>
> 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished
> spill 0
>
> 2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner:
> attempt_200905071451_0001_m_000000_0: No outputs to promote from
> hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/
> _temporary/_attempt_200905071451_0001_m_000000_0
>
> 2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_200905071451_0001_m_000000_0' done.
>
>
>
>
>
> Am I doing something wrong? Is there anything else I can do to debug
> this? Is it a known bug?
>
>
>
> Let me know if you need anything else, thanks.
>
>
Re: .gz input files having less output than uncompressed version
Posted by tim robertson <ti...@gmail.com>.
Hi,
What input format are you using for the GZipped file?
I don't believe there is a GZip input format although some people have
discussed whether it is feasible...
Cheers
Tim
On Thu, May 7, 2009 at 9:05 PM, Malcolm Matalka
<mm...@millennialmedia.com> wrote:
> Problem:
>
> I am comparing two jobs. The both have the same input content, however
> in one job the input file has been gziped, and in the other it has not.
> I get far less output rows in the gzipped result than I do in the
> uncompressed version:
>
>
>
> Lines in output:
>
> Gzipped: 86851
>
> Uncompressed: 65693I03
>
>
>
> The gzipped input file is 875MB in size, and the entire job runs in
> about 30 seconds. The uncompressed file takes around 5 minutes to run.
>
>
>
> Hadoop version:
>
> 0.18.1, r694836
>
>
>
> Here is the output of the map task of the compressed input:
>
> 2009-05-07 14:54:53,492 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
>
> 2009-05-07 14:54:53,636 INFO org.apache.hadoop.mapred.MapTask:
> numReduceTasks: 12
>
> 2009-05-07 14:54:53,663 INFO org.apache.hadoop.mapred.MapTask:
> io.sort.mb = 100
>
> 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: data
> buffer = 79691776/99614720
>
> 2009-05-07 14:54:53,909 INFO org.apache.hadoop.mapred.MapTask: record
> buffer = 262144/327680
>
> 2009-05-07 14:54:53,994 INFO org.apache.hadoop.util.NativeCodeLoader:
> Loaded the native-hadoop library
>
> 2009-05-07 14:54:54,005 INFO
> org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded &
> initialized native-zlib library
>
> 2009-05-07 14:55:05,026 INFO org.apache.hadoop.mapred.MapTask: Starting
> flush of map output
>
> 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: bufstart
> = 0; bufend = 45410962; bufvoid = 99614720
>
> 2009-05-07 14:55:05,027 INFO org.apache.hadoop.mapred.MapTask: kvstart =
> 0; kvend = 87923; length = 327680
>
> 2009-05-07 14:55:08,624 INFO org.apache.hadoop.mapred.MapTask: Index:
> (0, 3786199, 3786199)
>
> 2009-05-07 14:55:08,969 INFO org.apache.hadoop.mapred.MapTask: Index:
> (3786199, 3789579, 3789579)
>
> 2009-05-07 14:55:09,292 INFO org.apache.hadoop.mapred.MapTask: Index:
> (7575778, 3859183, 3859183)
>
> 2009-05-07 14:55:09,610 INFO org.apache.hadoop.mapred.MapTask: Index:
> (11434961, 3792449, 3792449)
>
> 2009-05-07 14:55:09,929 INFO org.apache.hadoop.mapred.MapTask: Index:
> (15227410, 3818963, 3818963)
>
> 2009-05-07 14:55:10,241 INFO org.apache.hadoop.mapred.MapTask: Index:
> (19046373, 3780875, 3780875)
>
> 2009-05-07 14:55:10,559 INFO org.apache.hadoop.mapred.MapTask: Index:
> (22827248, 3814950, 3814950)
>
> 2009-05-07 14:55:10,882 INFO org.apache.hadoop.mapred.MapTask: Index:
> (26642198, 3871426, 3871426)
>
> 2009-05-07 14:55:11,197 INFO org.apache.hadoop.mapred.MapTask: Index:
> (30513624, 3799971, 3799971)
>
> 2009-05-07 14:55:11,513 INFO org.apache.hadoop.mapred.MapTask: Index:
> (34313595, 3813327, 3813327)
>
> 2009-05-07 14:55:11,834 INFO org.apache.hadoop.mapred.MapTask: Index:
> (38126922, 3835208, 3835208)
>
> 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Index:
> (41962130, 3747048, 3747048)
>
> 2009-05-07 14:55:12,146 INFO org.apache.hadoop.mapred.MapTask: Finished
> spill 0
>
> 2009-05-07 14:55:12,160 INFO org.apache.hadoop.mapred.TaskRunner:
> attempt_200905071451_0001_m_000000_0: No outputs to promote from
> hdfs://hadoop00.corp.millennialmedia.com:54313/user/hadoop/kerry.common/
> _temporary/_attempt_200905071451_0001_m_000000_0
>
> 2009-05-07 14:55:12,162 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_200905071451_0001_m_000000_0' done.
>
>
>
>
>
> Am I doing something wrong? Is there anything else I can do to debug
> this? Is it a known bug?
>
>
>
> Let me know if you need anything else, thanks.
>
>