You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Narinder Kumar <nk...@inphina.com> on 2010/11/30 12:06:40 UTC

Map-Reduce Applicability With All-In Memory Data

Hi All,

We have a problem in hand which we would like to solve using Distributed and
Parallel Processing.

Brief context : We have a Map (Entity, Associated value). The entity can
have a parent which in turn will have its parent and so on till we reach the
head. I have to traverse this tree and do some calculations at every step.
Now as we can see the tree structure can be quite deep and we have a huge
list of these Maps to process before coming to the final result. Processing
them sequentially takes quite long time. We were thinking of using
Map-Reduce to split the computation across multiple nodes in a Hadoop
Cluster and then aggregate the results to get the final output.

Having a quick read at the documentation and the samples, I see that both
Mapper and Reducer work with implementations of InputFormat and OutPutFormat
respectively. All their implementations appeared to me to be either File or
DB based. Do we have some input-output format which directly takes/updates
things in Memory ?

In order to be able to use Map-Reduce, my understanding is in terms of
following steps :

   - Put starting list of Maps (Entity, Value) into one/multiple input files
   - Use these files as input for Mapper, do the necessary traversal till
   the head of the tree and corresponding calculation there
   - Emit the results of Mapper in output files
   - Use Reducer to aggregate/combine these results accordingly

The potential issues I see in this approach is that I will have to do
round-trip in terms of taking the data out from Memory to files and then
back again into Memory which might cause me performance hits again. Is this
some what correct approach for using Map-Reduce in my context or I am
missing the point completely ?

Further, I would like to know whether Map-Reduce is the appropriate platform
for these kind of scenarios or we should think of it more for huge DB/File
based data only ?

Best Regards
Narinder