You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Alex Baranau <al...@gmail.com> on 2010/12/08 22:06:02 UTC

Making input in Map iterable

Hello,

I have a data processing logic implemented so that on input it receives
Iterable<Some>. I.e. pretty much the same as reducer's API. But I need to
use this code in Map, where each element is "arrived" as map() method
invocation.
To solve the problem (at least for now), I'm doing the following:
* run processing code in a thread which I start in setup() and wait for
completion for it in cleanup()
* keep a buffer which I fill with map input items (and feed Iterable object
from this buffer until it has something)
* write to buffer until it is full and only then switch to a thread which
does processing.
(assumption: processing logic always read data from buffer till the end, if
processing fails, then the whole job is marked as failed).

I don't see that it should cause any noticeable performance degradation:
switches between threads are quite rare. Also it looks like the approach is
safe. Could anyone please confirm that? Or in case there's a better
solution, please, let me know.

Btw, the rough cut of implementation you can find here (small class):
https://github.com/sematext/HBaseHUT/blob/master/src/main/java/com/sematext/hbase/hut/UpdatesProcessingMrJob.java.
It is in working (unit-tests work well at least) state.

Thank you in advance!

Alex Baranau
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase