You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jimmy Wan <ji...@indeed.com> on 2008/03/06 19:40:14 UTC

Equivalent of cmdline head or tail?

I've got some jobs where I'd like to just pull out the top N or bottom N  
values.

It seems like I can't do this from the map or combine phases (due to not  
having enough data), but I could aggregate this data during the reduce  
phase. The problem I have is that I won't know when to actually write them  
out until I've gone through the entire set, at which point reduce isn't  
called anymore.

It's easy enough to post-process with some combination of sort, head, and  
tail, but I was wondering if I was missing something obvious.

-- 
Jimmy

RE: Equivalent of cmdline head or tail?

Posted by Jeff Eastman <je...@collab.net>.
I think the accepted pattern for this is to accumulate your top N and
bottom N values while you reduce and then output them in the close()
call. The files from your config can be obtained during the configure()
call.

Jeff

-----Original Message-----
From: Jimmy Wan [mailto:jimmy@indeed.com] 
Sent: Thursday, March 06, 2008 10:40 AM
To: core-user@hadoop.apache.org
Subject: Equivalent of cmdline head or tail?

I've got some jobs where I'd like to just pull out the top N or bottom N

values.

It seems like I can't do this from the map or combine phases (due to not

having enough data), but I could aggregate this data during the reduce  
phase. The problem I have is that I won't know when to actually write
them  
out until I've gone through the entire set, at which point reduce isn't

called anymore.

It's easy enough to post-process with some combination of sort, head,
and  
tail, but I was wondering if I was missing something obvious.

-- 
Jimmy

Re: Equivalent of cmdline head or tail?

Posted by Ted Dunning <td...@veoh.com>.
I thought so as well until I reflected for a moment.

But if you include the top N from every combiner, then you are guaranteed to
have the global top N in the output of all of the combiners.


On 3/6/08 11:50 PM, "Owen O'Malley" <oo...@yahoo-inc.com> wrote:

> 
> On Mar 6, 2008, at 5:02 PM, Ted Dunning wrote:
> 
>>  I don't know if the combiner sees things in
>> order.  IF it does, then you can prune on both levels to minimize data
>> transfer.
> 
> The input to the combiners is sorted. However, when filtering to the
> top N, you need to be careful to include enough that the partial view
> doesn't distort the global view.
> 
> -- O


Re: Equivalent of cmdline head or tail?

Posted by Owen O'Malley <oo...@yahoo-inc.com>.
On Mar 6, 2008, at 5:02 PM, Ted Dunning wrote:

>  I don't know if the combiner sees things in
> order.  IF it does, then you can prune on both levels to minimize data
> transfer.

The input to the combiners is sorted. However, when filtering to the  
top N, you need to be careful to include enough that the partial view  
doesn't distort the global view.

-- O

Re: Equivalent of cmdline head or tail?

Posted by Ted Dunning <td...@veoh.com>.

First, there is a close method on reducers.  That means that combiners and
reducers can do first N values pretty easily.

Secondly, you can define sort orders so that the reduce can just process the
first N items and then quit.  I don't know if the combiner sees things in
order.  IF it does, then you can prune on both levels to minimize data
transfer.


On 3/6/08 10:40 AM, "Jimmy Wan" <ji...@indeed.com> wrote:

> I've got some jobs where I'd like to just pull out the top N or bottom N
> values.
> 
> It seems like I can't do this from the map or combine phases (due to not
> having enough data), but I could aggregate this data during the reduce
> phase. The problem I have is that I won't know when to actually write them
> out until I've gone through the entire set, at which point reduce isn't
> called anymore.
> 
> It's easy enough to post-process with some combination of sort, head, and
> tail, but I was wondering if I was missing something obvious.