You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Todd Lipcon <to...@cloudera.com> on 2009/08/14 08:52:44 UTC

Reduce input records >> Map output records

Hey all,

Has anyone seen behavior where the number of reduce input records is
significantly larger than the number of map output records? There's no
combiner involved in the job at hand, and it's not particularly large (250GB
in, about the same output). The numbers on one example job are:
2,202,290,092 map input records, 2,198,215,987 map output records,
2,200,081,377 reduce input records. The job in question had no failures or
speculative task attempts killed. Running 0.18.3 on JVM 1.6.0u14.

Anyone have any thoughts? Could a broken comparator trip up the merge in
such a way that it would invent records? I searched JIRA and svn logs but
nothing caught my eye. If no one has seen this before I'll keep digging and
certainly open a JIRA if I can find some more useful data.

Thanks
-Todd