You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Martin Traverso <mt...@gmail.com> on 2007/10/02 22:54:53 UTC

Slow maps

Hi all,

I just got started running hadoop, and I'm seeing extremely low map
performance.

I'm trying the grep example over about 8.3 GB of data (~23 million lines)
and it's taking more than 3h to complete the map step. During that time,
hadoop consumes two entire CPUs on each of the slave nodes. As a point of
comparison, processing those files with unix grep from the command line
takes just 10 minutes.

My setup is as follows:

  Hadoop 0.14.1, r571288
  Java 1.5.0_07
  5-node cluster (including namenode/jobtracker), each node w/ 4 cpu cores
  All nodes connected to a Gigabit switch

I'm using the default hadoop config plus the following overrides:

  mapred.map.tasks = 27
  mapred.reduce.tasks = 11
  mapred.child.java.opts=-Xmx512M

This is the output of 'hadoop fsck':

Status: HEALTHY
 Total size:    8934685280 B
 Total blocks:  876 (avg. block size 10199412 B)
 Total dirs:    1
 Total files:   876
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Target replication factor:     3
 Real replication factor:       3.0


It takes about 3 minutes to complete one map with the following counters:

    Map input records      53,679
    Map output records     1,973
    Map input bytes     20,118,818
    Map output bytes     608,890
    Combine input records     1,973
    Combine output records     1,971


Would somebody give me a couple of pointers on where to start
troubleshooting this problem?

Thanks in advance!

Martin