You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by Rakesh Davanum <ra...@gmail.com> on 2011/01/12 19:03:56 UTC

Restricting number of records from map output

Hi,

I have a sort job consisting of only the Mapper (no Reducer) task. I want my
results to contain only the top n records. Is there any way of restricting
the number of records that are emitted by the Mappers?

Basically I am looking to see if there is an equivalent of achieving
the behavior similar to LIMIT in SQL queries.

Thanks & Regards,
Rakesh

Re: Restricting number of records from map output

Posted by Alex Kozlov <al...@cloudera.com>.
Hi Rakesh, What do you mean by the top N?  The first ones or you need to
sort them in memory?  You can always output records in the cleanup() method
at the end of the mapper run.

On Fri, Jan 14, 2011 at 7:05 AM, Hari Sreekumar <hs...@clickable.com>wrote:

> Ideally, mappers should be independent of other mappers. Still, you can use
> counters and start skipping records when counter>some value to achieve
> similar behavior. It will not be very reliable if you want very exact
> results though.
>
> On Thu, Jan 13, 2011 at 12:43 AM, Anthony Urso <an...@cs.ucla.edu>
> wrote:
>
> > Either use an instance variable or a Combiner.  The latter is correct
> > if you want the top-n per key from the mapper.
> >
> > On Wed, Jan 12, 2011 at 10:03 AM, Rakesh Davanum <ra...@gmail.com>
> > wrote:
> > > Hi,
> > >
> > > I have a sort job consisting of only the Mapper (no Reducer) task. I
> want
> > my
> > > results to contain only the top n records. Is there any way of
> > restricting
> > > the number of records that are emitted by the Mappers?
> > >
> > > Basically I am looking to see if there is an equivalent of achieving
> > > the behavior similar to LIMIT in SQL queries.
> > >
> > > Thanks & Regards,
> > > Rakesh
> > >
> >
>

Re: Restricting number of records from map output

Posted by Hari Sreekumar <hs...@clickable.com>.
Ideally, mappers should be independent of other mappers. Still, you can use
counters and start skipping records when counter>some value to achieve
similar behavior. It will not be very reliable if you want very exact
results though.

On Thu, Jan 13, 2011 at 12:43 AM, Anthony Urso <an...@cs.ucla.edu> wrote:

> Either use an instance variable or a Combiner.  The latter is correct
> if you want the top-n per key from the mapper.
>
> On Wed, Jan 12, 2011 at 10:03 AM, Rakesh Davanum <ra...@gmail.com>
> wrote:
> > Hi,
> >
> > I have a sort job consisting of only the Mapper (no Reducer) task. I want
> my
> > results to contain only the top n records. Is there any way of
> restricting
> > the number of records that are emitted by the Mappers?
> >
> > Basically I am looking to see if there is an equivalent of achieving
> > the behavior similar to LIMIT in SQL queries.
> >
> > Thanks & Regards,
> > Rakesh
> >
>

Re: Restricting number of records from map output

Posted by Anthony Urso <an...@cs.ucla.edu>.
Either use an instance variable or a Combiner.  The latter is correct
if you want the top-n per key from the mapper.

On Wed, Jan 12, 2011 at 10:03 AM, Rakesh Davanum <ra...@gmail.com> wrote:
> Hi,
>
> I have a sort job consisting of only the Mapper (no Reducer) task. I want my
> results to contain only the top n records. Is there any way of restricting
> the number of records that are emitted by the Mappers?
>
> Basically I am looking to see if there is an equivalent of achieving
> the behavior similar to LIMIT in SQL queries.
>
> Thanks & Regards,
> Rakesh
>

Re: Restricting number of records from map output

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

> I have a sort job consisting of only the Mapper (no Reducer) task. I want my
> results to contain only the top n records. Is there any way of restricting
> the number of records that are emitted by the Mappers?
>
> Basically I am looking to see if there is an equivalent of achieving
> the behavior similar to LIMIT in SQL queries.

I think I understand your goal. However the question is toward (what I
think) is the wrong solution.

A mapper gets 1 record as input and only knows about that one record.
There is no way to limit there.

If you implement a simple reducer you can very easily let is stop
reading the input iterator after N records and limit the output in
that way.

Doing it in the reducer also allows you to easily add a concept of
"Top N" by using the "Secondary Sort" trick to sort the input before
it arrives at the reducer.

HTH

Niels Basjes

Re: Restricting number of records from map output

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

> I have a sort job consisting of only the Mapper (no Reducer) task. I want my
> results to contain only the top n records. Is there any way of restricting
> the number of records that are emitted by the Mappers?
>
> Basically I am looking to see if there is an equivalent of achieving
> the behavior similar to LIMIT in SQL queries.

I think I understand your goal. However the question is toward (what I
think) is the wrong solution.

A mapper gets 1 record as input and only knows about that one record.
There is no way to limit there.

If you implement a simple reducer you can very easily let is stop
reading the input iterator after N records and limit the output in
that way.

Doing it in the reducer also allows you to easily add a concept of
"Top N" by using the "Secondary Sort" trick to sort the input before
it arrives at the reducer.

HTH

Niels Basjes

Re: Restricting number of records from map output

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

> I have a sort job consisting of only the Mapper (no Reducer) task. I want my
> results to contain only the top n records. Is there any way of restricting
> the number of records that are emitted by the Mappers?
>
> Basically I am looking to see if there is an equivalent of achieving
> the behavior similar to LIMIT in SQL queries.

I think I understand your goal. However the question is toward (what I
think) is the wrong solution.

A mapper gets 1 record as input and only knows about that one record.
There is no way to limit there.

If you implement a simple reducer you can very easily let is stop
reading the input iterator after N records and limit the output in
that way.

Doing it in the reducer also allows you to easily add a concept of
"Top N" by using the "Secondary Sort" trick to sort the input before
it arrives at the reducer.

HTH

Niels Basjes