You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Shi Yu <sh...@uchicago.edu> on 2012/05/15 03:33:15 UTC
Random Sample in Map/Reduce
Hi,
Before I raise this question I searched relevant topics. There
are suggestions online:
"Mappers: Output all qualifying values, each with a random
integer key.
Single reducer: Output the first N values, throwing away the
keys."
However, this schema seems not very efficient when the data
set is very huge, for example, sampling 100 out of one
billion. Things are especially worse when Map task is
computational demanding. I was trying to write a program to do
sampling in Mappers, however, I ended up storing everything in
memory and let the final sampling done at Mapper.cleanup()
stage. It still seems not a graceful way to do it because it
requires lots of memory. Maybe a better way is to control
random sample at file.split() stage, is there any good
approach existing?
Best,
Shi
Re: Random Sample in Map/Reduce
Posted by Shi Yu <sh...@uchicago.edu>.
To answer my own question. I applied a non-repeatable random
number generator in the mapper. At mapper setup stage I generate
a pre-defined number of random numbers, then I use a counter
along the mapper. When the counter is contained in the random
number set, the Mapper executes and outputs data. The problem
now becomes how to know the ceiling of random number
[1...ceiling]. That ceiling number cannot be too small to make
sampling valid, it also cannot exceed the total number of data
records contained in each split. The problem is because my data
is not divided by line, sometimes a complete data record is
composed by multiple lines, so I am not sure how to estimate
that ceiling number ... Of course, if each line is a complete
record, that ceiling number is easy to obtain.