You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "George P. Stathis" <gs...@traackr.com> on 2010/09/27 20:27:47 UTC

Limiting the number of data records processed per reduce process

Possible beginner's question here but I can't find an obvious answer in the
docs. Is there a way to configure a job such that it imposes a cap on the
number of records each reduce process receives at a time, regardless of how
the data was partitioned or how many reducers were configured for the job?
The limitation here is that one does not know the number of records that
will be processed ahead of time so as to manually configure the number of
reducers. The obvious workaround is to do a first pass to count the records
and then a second one that sets how many reducers should be used so as the
match the target max records number. But I'm hoping for a more elegant
alternative.

Thank you in advance for your time.

-GS