You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Aayush Garg <aa...@gmail.com> on 2010/03/03 18:21:39 UTC

Sorting

Hi,

Suppose I do need to sort a big file(in GB). How would I accomplish
this task using hadoop.
My main problem is how to merge the output of individual reduce phases?

thanks

Re: Sorting

Posted by Thejas Nair <te...@yahoo-inc.com>.
If you don't want to implement all that, then just use 3 lines of pig.
 l = load 'file';
 o = order file by $1;
 store o into 'file.sorted'

-Thejas



On 3/4/10 2:17 PM, "Alex Kozlov" <al...@cloudera.com> wrote:

> Hi Aayush,
> 
> In short, you write a special partitioner that partitions the data in
> non-overlapping intervals.
> 
> There a few article on this with a lot more details:
> 
> http://sortbenchmark.org/YahooHadoop.pdf
> http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162
> .html
> 
> Alex K
> 
> On Wed, Mar 3, 2010 at 9:21 AM, Aayush Garg <aa...@gmail.com> wrote:
> 
>> Hi,
>> 
>> Suppose I do need to sort a big file(in GB). How would I accomplish
>> this task using hadoop.
>> My main problem is how to merge the output of individual reduce phases?
>> 
>> thanks
>> 


Re: Sorting

Posted by Alex Kozlov <al...@cloudera.com>.
Hi Aayush,

In short, you write a special partitioner that partitions the data in
non-overlapping intervals.

There a few article on this with a lot more details:

http://sortbenchmark.org/YahooHadoop.pdf
http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html

Alex K

On Wed, Mar 3, 2010 at 9:21 AM, Aayush Garg <aa...@gmail.com> wrote:

> Hi,
>
> Suppose I do need to sort a big file(in GB). How would I accomplish
> this task using hadoop.
> My main problem is how to merge the output of individual reduce phases?
>
> thanks
>

Re: Sorting

Posted by Arun C Murthy <ac...@yahoo-inc.com>.
Sample your input data and use the sample to drive your partitioner.

Please take a look at TeraSort example in  
org.apache.hadoop.examples.terasort.

Arun

On Mar 3, 2010, at 9:21 AM, Aayush Garg wrote:

> Hi,
>
> Suppose I do need to sort a big file(in GB). How would I accomplish
> this task using hadoop.
> My main problem is how to merge the output of individual reduce  
> phases?
>
> thanks