You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Steve Lewis <lo...@gmail.com> on 2010/06/23 19:15:12 UTC

Newbie - question - how do I use Hadoop to sort a very large file

Assume I have a large file called *BigData.unsorted*  ( say 500GB)
consisting of lines of text. Assume that these lines are in random order -
I understand how to assign a key to lines and that Hadoop will pass the
lines to my reducers in order of that key.

Now assume I want a single file called *BigData.sorted*  with the lines in
the order of the keys.

I think I understand how to get files part00000, part000001 ,,, but not
1) How I get just the lines from the reducer not the keys
2) How I  make the reducer generate a file with the name that I want "*
BigData.sorted"*
*3) How without using a single reducer instance I get a single output file
or is a single reducer the right choice for this task.*
*
*
*Also it would be very nice if the output of the reducer were compressed -
say BigData.sorted.gz *
*
*
*Any suggestions
*--
Steven M. Lewis PhD
Institute for Systems Biology
Seattle WA

Re: Newbie - question - how do I use Hadoop to sort a very large file

Posted by James Hammerton <ja...@mendeley.com>.
Hi,

Regarding getting part00000, part000001 joined together, assuming the files
are numbered in order, then you can use:

hadoop fs -getmerge

This is used to concatenate the files. See the following URL for details:

http://hadoop.apache.org/common/docs/current/hdfs_shell.html#getmerge

As for removing the keys, if the file is tab separated you could remove the
keys using the unix/linux 'cut' command, e.g.:

cut -f2,3,4 file.txt

This will give you the 2nd, 3rd and 4th columns from file.txt. Don't know if
there's a similar command for Windows though.

Regards,

James

On Wed, Jun 23, 2010 at 6:15 PM, Steve Lewis <lo...@gmail.com> wrote:

> Assume I have a large file called *BigData.unsorted*  ( say 500GB)
> consisting of lines of text. Assume that these lines are in random order -
> I understand how to assign a key to lines and that Hadoop will pass the
> lines to my reducers in order of that key.
>
> Now assume I want a single file called *BigData.sorted*  with the lines in
> the order of the keys.
>
> I think I understand how to get files part00000, part000001 ,,, but not
> 1) How I get just the lines from the reducer not the keys
> 2) How I  make the reducer generate a file with the name that I want "*
> BigData.sorted"*
> *3) How without using a single reducer instance I get a single output file
> or is a single reducer the right choice for this task.*
> *
> *
> *Also it would be very nice if the output of the reducer were compressed -
> say BigData.sorted.gz *
> *
> *
> *Any suggestions
> *--
> Steven M. Lewis PhD
> Institute for Systems Biology
> Seattle WA
>



-- 
James Hammerton | Senior Data Mining Engineer
www.mendeley.com/profiles/james-hammerton

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015