You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Natarajan, Senthil" <se...@pitt.edu> on 2008/04/08 17:37:08 UTC

Reduce Sort

Hi,
I am new to MapReduce.

After slightly modifying the example wordcount, to count the IP Address.
I have two files part-00000 and part-00001 with the contents something like.

IP Add       Count
1.2. 5. 42   27
2.8. 6. 6   24
7.9.24.13   8
7.9. 6. 9    201

I want to sort it by IP Address count in descending order(i.e.) I would expect to see

7.9. 6. 9    201
1.2. 5. 42   27
2.8. 6. 6   24
7.9.24.13   8

Could you please suggest how to do this.
And to merge both the partitions (part-00000 and part-00001) in to one output file, is there any functions already available in MapReduce Framework.
Or we need to use Java IO to do this.
Thanks,
Senthil

Re: Reduce Sort

Posted by Ted Dunning <td...@veoh.com>.

On 4/8/08 10:43 AM, "Natarajan, Senthil" <se...@pitt.edu> wrote:

> I would like to try using Hadoop.

That is good for education, probably bad for run time.  It could take
SECONDS longer to run (oh my).

> Do you mean to write another MapReduce program which takes the output of the
> first MapReduce (the already existing file of this format)

Yes.

> And use count as the key and IP Address as the value.

Yes.

> Is it possible to do this in the same program instead of writing another one.

No.

> If it is not possible, is it something available in Hadoop once the first
> program is done, can I call Second program to do the sorting.

Yes.  If you are using Java, just create a second configuration and do the
same thing as you did the first time to run the program.

> If I set the number of reducer to 1, then it will take more time to reduce all
> the maps and hence affect the performance right?

Not really.  Most of the sorting work will be done by the mappers.  The
reducer will only be merging the data so it will be pretty fast.  The
largest cost will be startup time, the second largest will be network
transfer time.

RE: Reduce Sort

Posted by "Natarajan, Senthil" <se...@pitt.edu>.

Thanks Ted.

I would like to try using Hadoop.
Do you mean to write another MapReduce program which takes the output of the first MapReduce (the already existing file of this format)

IP Add       Count
1.2. 5. 42   27
2.8. 6. 6   24
7.9.24.13   8
7.9. 6. 9    201

And use count as the key and IP Address as the value.
Is it possible to do this in the same program instead of writing another one.
If it is not possible, is it something available in Hadoop once the first program is done, can I call
Second program to do the sorting.

If I set the number of reducer to 1, then it will take more time to reduce all the maps and hence affect the performance right?

Thanks,
Senthil

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com]
Sent: Tuesday, April 08, 2008 11:53 AM
To: core-user@hadoop.apache.org; 'hadoop-user@lucene.apache.org'
Subject: Re: Reduce Sort

There are two ways to do this.  Both of them assume that you have counted
the addresses using map-reduce and the results are in HDFS.

First, since the number of unique IP address is likely to be relatively
small, simply sorting the results using conventional sort is probably as
good as it gets.  This will take just a few lines of scripting code:

   base=http://your-nameserver-here/data
   wget $base/your-results-directory/part-00000 --output-document=a
   wget $base/your-results-directory/part-00001 --output-document=b
   sort -k1nr a b > where-you-want-the-output

It would be convenient if there were a URL that would allow you to retrieve
the concatenation of a wild-carded list of files, but the method I show
above isn't bad.

You are likely to be unhappy at the perceived impurity of this approach, but
I would ask to think about why one might use hadoop at all.  The best reason
is to get high performance on large problems.  The sorting part of this
problem is not all that big a deal and using a conventional sort is probably
the most effective approach here.

You can also do the sorting using hadoop.  Just use a mapper that moves the
count to the key and keeps the IP as the value.  I think that if you use an
IntWritable or LongWritable as the key then the default sorting would give
you ascending order.  You can also define the sort order so that you get
descending order.  Make sure you set the number of reducers to 1 so that you
only get a single output file.

If you have less than 10 million values, the conventional sort is likely to
be faster simply because of hadoop's startup time.

On 4/8/08 8:37 AM, "Natarajan, Senthil" <se...@pitt.edu> wrote:

> Hi,
> I am new to MapReduce.
>
> After slightly modifying the example wordcount, to count the IP Address.
> I have two files part-00000 and part-00001 with the contents something like.
>
> IP Add       Count
> 1.2. 5. 42   27
> 2.8. 6. 6   24
> 7.9.24.13   8
> 7.9. 6. 9    201
>
> I want to sort it by IP Address count in descending order(i.e.) I would expect
> to see
>
> 7.9. 6. 9    201
> 1.2. 5. 42   27
> 2.8. 6. 6   24
> 7.9.24.13   8
>
> Could you please suggest how to do this.
> And to merge both the partitions (part-00000 and part-00001) in to one output
> file, is there any functions already available in MapReduce Framework.
> Or we need to use Java IO to do this.
> Thanks,
> Senthil

RE: Reduce Sort

Posted by "Natarajan, Senthil" <se...@pitt.edu>.

Ted,
I am using IntWritable, the default is as you mentioned Ascending order sort, do you know how to set this to sort descending order.
I checked the API for IntWritable, LongWritable and JobConf. I couldn't find any methods.
Thanks,
Senthil

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com]
Sent: Tuesday, April 08, 2008 11:53 AM
To: core-user@hadoop.apache.org; 'hadoop-user@lucene.apache.org'
Subject: Re: Reduce Sort

There are two ways to do this.  Both of them assume that you have counted
the addresses using map-reduce and the results are in HDFS.

First, since the number of unique IP address is likely to be relatively
small, simply sorting the results using conventional sort is probably as
good as it gets.  This will take just a few lines of scripting code:

   base=http://your-nameserver-here/data
   wget $base/your-results-directory/part-00000 --output-document=a
   wget $base/your-results-directory/part-00001 --output-document=b
   sort -k1nr a b > where-you-want-the-output

It would be convenient if there were a URL that would allow you to retrieve
the concatenation of a wild-carded list of files, but the method I show
above isn't bad.

You are likely to be unhappy at the perceived impurity of this approach, but
I would ask to think about why one might use hadoop at all.  The best reason
is to get high performance on large problems.  The sorting part of this
problem is not all that big a deal and using a conventional sort is probably
the most effective approach here.

You can also do the sorting using hadoop.  Just use a mapper that moves the
count to the key and keeps the IP as the value.  I think that if you use an
IntWritable or LongWritable as the key then the default sorting would give
you ascending order.  You can also define the sort order so that you get
descending order.  Make sure you set the number of reducers to 1 so that you
only get a single output file.

If you have less than 10 million values, the conventional sort is likely to
be faster simply because of hadoop's startup time.

On 4/8/08 8:37 AM, "Natarajan, Senthil" <se...@pitt.edu> wrote:

> Hi,
> I am new to MapReduce.
>
> After slightly modifying the example wordcount, to count the IP Address.
> I have two files part-00000 and part-00001 with the contents something like.
>
> IP Add       Count
> 1.2. 5. 42   27
> 2.8. 6. 6   24
> 7.9.24.13   8
> 7.9. 6. 9    201
>
> I want to sort it by IP Address count in descending order(i.e.) I would expect
> to see
>
> 7.9. 6. 9    201
> 1.2. 5. 42   27
> 2.8. 6. 6   24
> 7.9.24.13   8
>
> Could you please suggest how to do this.
> And to merge both the partitions (part-00000 and part-00001) in to one output
> file, is there any functions already available in MapReduce Framework.
> Or we need to use Java IO to do this.
> Thanks,
> Senthil

Re: Reduce Sort

Posted by Ted Dunning <td...@veoh.com>.

There are two ways to do this.  Both of them assume that you have counted
the addresses using map-reduce and the results are in HDFS.

First, since the number of unique IP address is likely to be relatively
small, simply sorting the results using conventional sort is probably as
good as it gets.  This will take just a few lines of scripting code:

   base=http://your-nameserver-here/data
   wget $base/your-results-directory/part-00000 --output-document=a
   wget $base/your-results-directory/part-00001 --output-document=b
   sort -k1nr a b > where-you-want-the-output

It would be convenient if there were a URL that would allow you to retrieve
the concatenation of a wild-carded list of files, but the method I show
above isn't bad.

You are likely to be unhappy at the perceived impurity of this approach, but
I would ask to think about why one might use hadoop at all.  The best reason
is to get high performance on large problems.  The sorting part of this
problem is not all that big a deal and using a conventional sort is probably
the most effective approach here.

You can also do the sorting using hadoop.  Just use a mapper that moves the
count to the key and keeps the IP as the value.  I think that if you use an
IntWritable or LongWritable as the key then the default sorting would give
you ascending order.  You can also define the sort order so that you get
descending order.  Make sure you set the number of reducers to 1 so that you
only get a single output file.

If you have less than 10 million values, the conventional sort is likely to
be faster simply because of hadoop's startup time.

On 4/8/08 8:37 AM, "Natarajan, Senthil" <se...@pitt.edu> wrote:

> Hi,
> I am new to MapReduce.
> 
> After slightly modifying the example wordcount, to count the IP Address.
> I have two files part-00000 and part-00001 with the contents something like.
> 
> IP Add       Count
> 1.2. 5. 42   27
> 2.8. 6. 6   24
> 7.9.24.13   8
> 7.9. 6. 9    201
> 
> I want to sort it by IP Address count in descending order(i.e.) I would expect
> to see
> 
> 7.9. 6. 9    201
> 1.2. 5. 42   27
> 2.8. 6. 6   24
> 7.9.24.13   8
> 
> Could you please suggest how to do this.
> And to merge both the partitions (part-00000 and part-00001) in to one output
> file, is there any functions already available in MapReduce Framework.
> Or we need to use Java IO to do this.
> Thanks,
> Senthil