You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Saptarshi Guha <sa...@gmail.com> on 2013/02/22 21:17:44 UTC

Single JVM, many tasks: How do I know when I'm on the last map task

Hello,

In my Java Hadoop job, i have reset the reuse variable to be -1.
hence a JVM will process multiple tasks.

I have also seen to it that instead of writing to the job context, the
keys and values are accumulated in a hashtable.
When the bytes written to this table reach BUFSIZE (e..g 150MB)
i call my reducer(or what some call combiner) (inside the map task).

However if BUFSIZE is never accumulated my reducer is never called.
So i have to flush it. Now I could flush this in the map classes
'cleanup' method. In that case, the data would be rewritten to the
same hashtable.

But at one point this hashtable must be written to the job context
onto the Hadoop Reduce stage. The way i see it, if i intend to share
this hashtable across map tasks (within the same JVM), i need to know
when the JVM has reached it's final map task. When that is complete,
then i know i *must* flush this to the job context.

Hopefully i've been some what clear. Does Hadoop 0.20.2 have an API
that tells the child JVM if it's on the last map task?

Cheers
Saptarshi