You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Jeff Zhang <zj...@gmail.com> on 2011/06/27 09:11:23 UTC
How to select random n records using mapreduce ?
Hi all,
I'd like to select random N records from a large amount of data using
hadoop, just wonder how can I archive this ? Currently my idea is that let
each mapper task select N / mapper_number records. Does anyone has such
experience ?
--
Best Regards
Jeff Zhang
Re: How to select random n records using mapreduce ?
Posted by David Rosenstrauch <da...@darose.net>.
Building on this, you could do something like the following to make it
more random:
if (numRecordsWritten < NUM_RECORDS_DESIRED) {
int n = generateARandomNumberBetween1and100();
if (n == 100) {
context.write(key, value);
}
}
The above would somewhat randomly output 1 record out of every 100, up
to a specified maximum amount desired, and discard all the rest.
HTH,
DR
On 06/27/2011 03:28 PM, Niels Basjes wrote:
> The only solution I can think of is by creating a counter in Hadoop
> that is incremented each time a mapper lets a record through.
> As soon as the value reaches a preselected value the mappers simply
> discard the additional input they receive.
>
> Note that this will not at all be random.... yet it's the best I can
> come up with right now.
>
> HTH
>
> On Mon, Jun 27, 2011 at 09:11, Jeff Zhang<zj...@gmail.com> wrote:
>>
>> Hi all,
>> I'd like to select random N records from a large amount of data using
>> hadoop, just wonder how can I archive this ? Currently my idea is that let
>> each mapper task select N / mapper_number records. Does anyone has such
>> experience ?
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
Re: How to select random n records using mapreduce ?
Posted by Niels Basjes <Ni...@basjes.nl>.
The only solution I can think of is by creating a counter in Hadoop
that is incremented each time a mapper lets a record through.
As soon as the value reaches a preselected value the mappers simply
discard the additional input they receive.
Note that this will not at all be random.... yet it's the best I can
come up with right now.
HTH
On Mon, Jun 27, 2011 at 09:11, Jeff Zhang <zj...@gmail.com> wrote:
>
> Hi all,
> I'd like to select random N records from a large amount of data using
> hadoop, just wonder how can I archive this ? Currently my idea is that let
> each mapper task select N / mapper_number records. Does anyone has such
> experience ?
>
> --
> Best Regards
>
> Jeff Zhang
>
--
Best regards / Met vriendelijke groeten,
Niels Basjes
Re: How to select random n records using mapreduce ?
Posted by Anthony Urso <an...@cs.ucla.edu>.
On Mon, Jun 27, 2011 at 12:11 AM, Jeff Zhang <zj...@gmail.com> wrote:
>
> Hi all,
> I'd like to select random N records from a large amount of data using
> hadoop, just wonder how can I archive this ? Currently my idea is that let
> each mapper task select N / mapper_number records. Does anyone has such experience ?
I've done this before, and it will work fine as long as all of your
splits have identical numbers of records.