You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Martin Nilsson <ni...@pike.ida.liu.se> on 2008/02/12 21:53:08 UTC

Binary tree reduction

Hello,

I'm looking at a problem where we need to determine the number of unique 
entities for a specific period of time. As an example, consider that we 
log all outgoing URLs in a set of proxies. We can easily create a mapper 
that turns every log file or slice thereof into a sorted list of URLs. 
Unix sort scales very well, but we are only operating on output from at 
most one log file for a day anyway.

Now for reduction I would like to take two (or more?) of these files, 
simply containing a line based list of sorted URLs, and merge them into 
a single file, removing any duplicates. This is a fast operation and 
takes constant memory, but requires that the complete files are operated 
on by the same reducer. Also the key-value paradigm doesn't apply.

The end product would be a big file with URLs for that day. When URLs 
for e.g. a week or a month are available, those should be merged into 
aggregates. I'm really only interested in the final row count, but I 
need to keep all the URLs to be able to add the statistics properly.

Is what I've described readily available within Hadoop (I did some 
looking but didn't find anything)? If not, do you have any pointers for 
how to achieve this type of processing?

/Martin Nilsson

Re: Binary tree reduction

Posted by Ted Dunning <td...@veoh.com>.

You don't need to use a single reducer for this.

Just make URL's be the key for the reduce and carry any extra data (such as
counts) along for the ride.  Use combiners to knock down data volume a bit
if you can (that won't help for merging two files that are already uniqued,
but does help if you are doing something like counting).

Each reducer would be given all of the records (1 or 2 of them for merging
two days) for a given URL and it would simply output the single record for
that key.  You can have multiple reduces because the sort ensures that
records with the same key won't go to different reducers.

On 2/12/08 12:53 PM, "Martin Nilsson" <ni...@pike.ida.liu.se> wrote:

> 
> Hello,
> 
> I'm looking at a problem where we need to determine the number of unique
> entities for a specific period of time. As an example, consider that we
> log all outgoing URLs in a set of proxies. We can easily create a mapper
> that turns every log file or slice thereof into a sorted list of URLs.
> Unix sort scales very well, but we are only operating on output from at
> most one log file for a day anyway.
> 
> Now for reduction I would like to take two (or more?) of these files,
> simply containing a line based list of sorted URLs, and merge them into
> a single file, removing any duplicates. This is a fast operation and
> takes constant memory, but requires that the complete files are operated
> on by the same reducer. Also the key-value paradigm doesn't apply.
> 
> The end product would be a big file with URLs for that day. When URLs
> for e.g. a week or a month are available, those should be merged into
> aggregates. I'm really only interested in the final row count, but I
> need to keep all the URLs to be able to add the statistics properly.
> 
> Is what I've described readily available within Hadoop (I did some
> looking but didn't find anything)? If not, do you have any pointers for
> how to achieve this type of processing?
> 
> /Martin Nilsson