You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Anis Ahmed <an...@gmail.com> on 2007/01/24 15:20:28 UTC

Question on static chunking.

Hi,

I need to solve the following problem in MAP REDUCE paradigm, looking for
advice.

I have a million entries in a file...line by line, which is my input.
I have a series of hadoop jobs which work on these entries, they work on one
entry at a time, but for one specific hadoop job I need to look at 50
entries at a time and analyze them then do some Bizz logic. My problem has
been to access  exact 50 entries in one go.

Options i am thinking are....

1. Do the processing as part of REDUCE. I will ensure that i use the same
intermediate key for a batch of 50 entries inside MAP. (have a static
counter, for every 50 change the intermediate key and so on) so that REDUCE
will get an iterator of 50

2. The option above has a lot of I/O, sorting etc. So instead...
Inside MAP, create a in mem pool (intialized in configure() ) and when 50 is
reached do the Bizz logic and clear pool.

I was looking to see if there is any better way to statically group entries
by a pre-deteremined number and process them in hadoop

thanks
Anis

Re: Question on static chunking.

Posted by Doug Cutting <cu...@apache.org>.
[ This question is probably more appropriate for hadoop-user. ]

Anis Ahmed wrote:
> 1. Do the processing as part of REDUCE. I will ensure that i use the same
> intermediate key for a batch of 50 entries inside MAP. (have a static
> counter, for every 50 change the intermediate key and so on) so that REDUCE
> will get an iterator of 50

This is probably the simplest approach.  Has it proven too slow?

> 2. The option above has a lot of I/O, sorting etc. So instead...
> Inside MAP, create a in mem pool (intialized in configure() ) and when 
> 50 is
> reached do the Bizz logic and clear pool.

Alternately you could define an InputFormat that reads 50 lines at a 
time instead of a single line.

Doug