You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Arun Vasu <ar...@gmail.com> on 2013/02/15 15:39:09 UTC

Sorting huge text files in Hadoop

Hi,

Is it possible to sort a huge text file lexicographically using a mapreduce
job which has only map tasks and zero reduce tasks?

The records of the text file is separated by new line character and the
size of the file is around 1 Terra Byte.

It will be great if any one can suggest a way to achieve sorting on this
huge file.

Thanks in advance,
Arun

Re: Sorting huge text files in Hadoop

Posted by Sandy Ryza <sa...@cloudera.com>.

A map-only job does not result in the standard shuffle-sort.  Map outputs
are written directly to HDFS.

-Sandy

On Fri, Feb 15, 2013 at 12:23 PM, Jay Vyas <ja...@gmail.com> wrote:

> Maybe im mistaken about what is meant by map-only.  Does a map-only job
> still result in standard shuffle-sort ?  Or does that get cut short?
>
> hmmm i think I see what you mean, i guess a map-only sort is possible as
> long as you use a custom partitioner and you let the shuffle/sort run to
> completion.
>
> i think the shuffle/sort, if you use a partitioner that partitions the
> sorting in order (i.e. part-0 is all lines starting with "a", part-1 is all
> starting with "b", etc...),
> does still run inspite of the fact that your not running reducers.
>
>
>
>
> On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Why do you need a 1TB block?
>>
>> On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> well.. ok... i guess you could have a 1TB block do an in place sort on
>> the file, write it to a tmp directory, and then spill the records in order
>> or something.  at that point might as well not use hadoop.
>>
>>
>> Michael Segel  <ms...@segel.com> | (m) 312.755.9623****
>>
>> Segel and Associates****
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Sorting huge text files in Hadoop

Posted by Sandy Ryza <sa...@cloudera.com>.

A map-only job does not result in the standard shuffle-sort.  Map outputs
are written directly to HDFS.

-Sandy

On Fri, Feb 15, 2013 at 12:23 PM, Jay Vyas <ja...@gmail.com> wrote:

> Maybe im mistaken about what is meant by map-only.  Does a map-only job
> still result in standard shuffle-sort ?  Or does that get cut short?
>
> hmmm i think I see what you mean, i guess a map-only sort is possible as
> long as you use a custom partitioner and you let the shuffle/sort run to
> completion.
>
> i think the shuffle/sort, if you use a partitioner that partitions the
> sorting in order (i.e. part-0 is all lines starting with "a", part-1 is all
> starting with "b", etc...),
> does still run inspite of the fact that your not running reducers.
>
>
>
>
> On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Why do you need a 1TB block?
>>
>> On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> well.. ok... i guess you could have a 1TB block do an in place sort on
>> the file, write it to a tmp directory, and then spill the records in order
>> or something.  at that point might as well not use hadoop.
>>
>>
>> Michael Segel  <ms...@segel.com> | (m) 312.755.9623****
>>
>> Segel and Associates****
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Sorting huge text files in Hadoop

Posted by Sandy Ryza <sa...@cloudera.com>.

A map-only job does not result in the standard shuffle-sort.  Map outputs
are written directly to HDFS.

-Sandy

On Fri, Feb 15, 2013 at 12:23 PM, Jay Vyas <ja...@gmail.com> wrote:

> Maybe im mistaken about what is meant by map-only.  Does a map-only job
> still result in standard shuffle-sort ?  Or does that get cut short?
>
> hmmm i think I see what you mean, i guess a map-only sort is possible as
> long as you use a custom partitioner and you let the shuffle/sort run to
> completion.
>
> i think the shuffle/sort, if you use a partitioner that partitions the
> sorting in order (i.e. part-0 is all lines starting with "a", part-1 is all
> starting with "b", etc...),
> does still run inspite of the fact that your not running reducers.
>
>
>
>
> On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Why do you need a 1TB block?
>>
>> On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> well.. ok... i guess you could have a 1TB block do an in place sort on
>> the file, write it to a tmp directory, and then spill the records in order
>> or something.  at that point might as well not use hadoop.
>>
>>
>> Michael Segel  <ms...@segel.com> | (m) 312.755.9623****
>>
>> Segel and Associates****
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Sorting huge text files in Hadoop

Posted by Sandy Ryza <sa...@cloudera.com>.

A map-only job does not result in the standard shuffle-sort.  Map outputs
are written directly to HDFS.

-Sandy

On Fri, Feb 15, 2013 at 12:23 PM, Jay Vyas <ja...@gmail.com> wrote:

> Maybe im mistaken about what is meant by map-only.  Does a map-only job
> still result in standard shuffle-sort ?  Or does that get cut short?
>
> hmmm i think I see what you mean, i guess a map-only sort is possible as
> long as you use a custom partitioner and you let the shuffle/sort run to
> completion.
>
> i think the shuffle/sort, if you use a partitioner that partitions the
> sorting in order (i.e. part-0 is all lines starting with "a", part-1 is all
> starting with "b", etc...),
> does still run inspite of the fact that your not running reducers.
>
>
>
>
> On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel <mi...@hotmail.com>wrote:
>
>> Why do you need a 1TB block?
>>
>> On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:
>>
>> well.. ok... i guess you could have a 1TB block do an in place sort on
>> the file, write it to a tmp directory, and then spill the records in order
>> or something.  at that point might as well not use hadoop.
>>
>>
>> Michael Segel  <ms...@segel.com> | (m) 312.755.9623****
>>
>> Segel and Associates****
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

Maybe im mistaken about what is meant by map-only.  Does a map-only job
still result in standard shuffle-sort ?  Or does that get cut short?

hmmm i think I see what you mean, i guess a map-only sort is possible as
long as you use a custom partitioner and you let the shuffle/sort run to
completion.

i think the shuffle/sort, if you use a partitioner that partitions the
sorting in order (i.e. part-0 is all lines starting with "a", part-1 is all
starting with "b", etc...),
does still run inspite of the fact that your not running reducers.

On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel <mi...@hotmail.com>wrote:

> Why do you need a 1TB block?
>
> On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> well.. ok... i guess you could have a 1TB block do an in place sort on the
> file, write it to a tmp directory, and then spill the records in order or
> something.  at that point might as well not use hadoop.
>
>
> Michael Segel  <ms...@segel.com> | (m) 312.755.9623****
>
> Segel and Associates****
>
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

Maybe im mistaken about what is meant by map-only.  Does a map-only job
still result in standard shuffle-sort ?  Or does that get cut short?

hmmm i think I see what you mean, i guess a map-only sort is possible as
long as you use a custom partitioner and you let the shuffle/sort run to
completion.

i think the shuffle/sort, if you use a partitioner that partitions the
sorting in order (i.e. part-0 is all lines starting with "a", part-1 is all
starting with "b", etc...),
does still run inspite of the fact that your not running reducers.

On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel <mi...@hotmail.com>wrote:

> Why do you need a 1TB block?
>
> On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> well.. ok... i guess you could have a 1TB block do an in place sort on the
> file, write it to a tmp directory, and then spill the records in order or
> something.  at that point might as well not use hadoop.
>
>
> Michael Segel  <ms...@segel.com> | (m) 312.755.9623****
>
> Segel and Associates****
>
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

Maybe im mistaken about what is meant by map-only.  Does a map-only job
still result in standard shuffle-sort ?  Or does that get cut short?

hmmm i think I see what you mean, i guess a map-only sort is possible as
long as you use a custom partitioner and you let the shuffle/sort run to
completion.

i think the shuffle/sort, if you use a partitioner that partitions the
sorting in order (i.e. part-0 is all lines starting with "a", part-1 is all
starting with "b", etc...),
does still run inspite of the fact that your not running reducers.

On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel <mi...@hotmail.com>wrote:

> Why do you need a 1TB block?
>
> On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> well.. ok... i guess you could have a 1TB block do an in place sort on the
> file, write it to a tmp directory, and then spill the records in order or
> something.  at that point might as well not use hadoop.
>
>
> Michael Segel  <ms...@segel.com> | (m) 312.755.9623****
>
> Segel and Associates****
>
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

Maybe im mistaken about what is meant by map-only.  Does a map-only job
still result in standard shuffle-sort ?  Or does that get cut short?

hmmm i think I see what you mean, i guess a map-only sort is possible as
long as you use a custom partitioner and you let the shuffle/sort run to
completion.

i think the shuffle/sort, if you use a partitioner that partitions the
sorting in order (i.e. part-0 is all lines starting with "a", part-1 is all
starting with "b", etc...),
does still run inspite of the fact that your not running reducers.

On Fri, Feb 15, 2013 at 3:09 PM, Michael Segel <mi...@hotmail.com>wrote:

> Why do you need a 1TB block?
>
> On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:
>
> well.. ok... i guess you could have a 1TB block do an in place sort on the
> file, write it to a tmp directory, and then spill the records in order or
> something.  at that point might as well not use hadoop.
>
>
> Michael Segel  <ms...@segel.com> | (m) 312.755.9623****
>
> Segel and Associates****
>
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Sorting huge text files in Hadoop

Posted by Michael Segel <mi...@hotmail.com>.

Why do you need a 1TB block? 

On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:

> well.. ok... i guess you could have a 1TB block do an in place sort on the file, write it to a tmp directory, and then spill the records in order or something.  at that point might as well not use hadoop.

Michael Segel  | (m) 312.755.9623

Segel and Associates

Re: Sorting huge text files in Hadoop

Posted by Michael Segel <mi...@hotmail.com>.

Why do you need a 1TB block? 

On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:

> well.. ok... i guess you could have a 1TB block do an in place sort on the file, write it to a tmp directory, and then spill the records in order or something.  at that point might as well not use hadoop.

Michael Segel  | (m) 312.755.9623

Segel and Associates

Re: Sorting huge text files in Hadoop

Posted by Michael Segel <mi...@hotmail.com>.

Why do you need a 1TB block? 

On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:

> well.. ok... i guess you could have a 1TB block do an in place sort on the file, write it to a tmp directory, and then spill the records in order or something.  at that point might as well not use hadoop.

Michael Segel  | (m) 312.755.9623

Segel and Associates

Re: Sorting huge text files in Hadoop

Posted by Michael Segel <mi...@hotmail.com>.

Why do you need a 1TB block? 

On Feb 15, 2013, at 1:29 PM, Jay Vyas <ja...@gmail.com> wrote:

> well.. ok... i guess you could have a 1TB block do an in place sort on the file, write it to a tmp directory, and then spill the records in order or something.  at that point might as well not use hadoop.

Michael Segel  | (m) 312.755.9623

Segel and Associates

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

well.. ok... i guess you could have a 1TB block do an in place sort on the
file, write it to a tmp directory, and then spill the records in order or
something.  at that point might as well not use hadoop.

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

well.. ok... i guess you could have a 1TB block do an in place sort on the
file, write it to a tmp directory, and then spill the records in order or
something.  at that point might as well not use hadoop.

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

well.. ok... i guess you could have a 1TB block do an in place sort on the
file, write it to a tmp directory, and then spill the records in order or
something.  at that point might as well not use hadoop.

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

well.. ok... i guess you could have a 1TB block do an in place sort on the
file, write it to a tmp directory, and then spill the records in order or
something.  at that point might as well not use hadoop.

Re: Sorting huge text files in Hadoop

Posted by Michael Segel <mi...@hotmail.com>.

Why not? 

Who said you had to parallelize anything?

On Feb 15, 2013, at 12:09 PM, Jay Vyas <ja...@gmail.com> wrote:

> i don't think you can't do an embarassingly parallel sort of a randomly ordered file without merging results.  
> 
> However, if you know that the file is psudeoordered: 
> 
> 10000123
> 10000232
> 10000000
> 19991019
> 20200222
> 301111111
> 30000000
> 
> Then you can (maybe) sort the individual blocks in mappers using some black magic ...  but it would be very very ugly 
> 
> better off simply running the mappers with the default reducer - they will sort the file for you naturally :)
>

Re: Sorting huge text files in Hadoop

Posted by Michael Segel <mi...@hotmail.com>.

Why not? 

Who said you had to parallelize anything?

On Feb 15, 2013, at 12:09 PM, Jay Vyas <ja...@gmail.com> wrote:

> i don't think you can't do an embarassingly parallel sort of a randomly ordered file without merging results.  
> 
> However, if you know that the file is psudeoordered: 
> 
> 10000123
> 10000232
> 10000000
> 19991019
> 20200222
> 301111111
> 30000000
> 
> Then you can (maybe) sort the individual blocks in mappers using some black magic ...  but it would be very very ugly 
> 
> better off simply running the mappers with the default reducer - they will sort the file for you naturally :)
>

Re: Sorting huge text files in Hadoop

Posted by Michael Segel <mi...@hotmail.com>.

Why not? 

Who said you had to parallelize anything?

On Feb 15, 2013, at 12:09 PM, Jay Vyas <ja...@gmail.com> wrote:

> i don't think you can't do an embarassingly parallel sort of a randomly ordered file without merging results.  
> 
> However, if you know that the file is psudeoordered: 
> 
> 10000123
> 10000232
> 10000000
> 19991019
> 20200222
> 301111111
> 30000000
> 
> Then you can (maybe) sort the individual blocks in mappers using some black magic ...  but it would be very very ugly 
> 
> better off simply running the mappers with the default reducer - they will sort the file for you naturally :)
>

Re: Sorting huge text files in Hadoop

Posted by Michael Segel <mi...@hotmail.com>.

Why not? 

Who said you had to parallelize anything?

On Feb 15, 2013, at 12:09 PM, Jay Vyas <ja...@gmail.com> wrote:

> i don't think you can't do an embarassingly parallel sort of a randomly ordered file without merging results.  
> 
> However, if you know that the file is psudeoordered: 
> 
> 10000123
> 10000232
> 10000000
> 19991019
> 20200222
> 301111111
> 30000000
> 
> Then you can (maybe) sort the individual blocks in mappers using some black magic ...  but it would be very very ugly 
> 
> better off simply running the mappers with the default reducer - they will sort the file for you naturally :)
>

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

i don't think you can't do an embarassingly parallel sort of a randomly
ordered file without merging results.

However, if you know that the file is psudeoordered:

10000123
10000232
10000000
19991019
20200222
301111111
30000000

Then you can (maybe) sort the individual blocks in mappers using some black
magic ...  but it would be very very ugly

better off simply running the mappers with the default reducer - they will
sort the file for you naturally :)

Re: Sorting huge text files in Hadoop

Posted by Azuryy Yu <az...@gmail.com>.

This is a typical total sort using map/reduce.  it can be done with both
map and reduce.


On Fri, Feb 15, 2013 at 10:39 PM, Arun Vasu <ar...@gmail.com> wrote:

> Hi,
>
> Is it possible to sort a huge text file lexicographically using a
> mapreduce job which has only map tasks and zero reduce tasks?
>
> The records of the text file is separated by new line character and the
> size of the file is around 1 Terra Byte.
>
> It will be great if any one can suggest a way to achieve sorting on this
> huge file.
>
> Thanks in advance,
> Arun
>
>

Re: Sorting huge text files in Hadoop

Posted by Azuryy Yu <az...@gmail.com>.

This is a typical total sort using map/reduce.  it can be done with both
map and reduce.


On Fri, Feb 15, 2013 at 10:39 PM, Arun Vasu <ar...@gmail.com> wrote:

> Hi,
>
> Is it possible to sort a huge text file lexicographically using a
> mapreduce job which has only map tasks and zero reduce tasks?
>
> The records of the text file is separated by new line character and the
> size of the file is around 1 Terra Byte.
>
> It will be great if any one can suggest a way to achieve sorting on this
> huge file.
>
> Thanks in advance,
> Arun
>
>

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

i don't think you can't do an embarassingly parallel sort of a randomly
ordered file without merging results.

However, if you know that the file is psudeoordered:

10000123
10000232
10000000
19991019
20200222
301111111
30000000

Then you can (maybe) sort the individual blocks in mappers using some black
magic ...  but it would be very very ugly

better off simply running the mappers with the default reducer - they will
sort the file for you naturally :)

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

i don't think you can't do an embarassingly parallel sort of a randomly
ordered file without merging results.

However, if you know that the file is psudeoordered:

10000123
10000232
10000000
19991019
20200222
301111111
30000000

Then you can (maybe) sort the individual blocks in mappers using some black
magic ...  but it would be very very ugly

better off simply running the mappers with the default reducer - they will
sort the file for you naturally :)

Re: Sorting huge text files in Hadoop

Posted by Jay Vyas <ja...@gmail.com>.

i don't think you can't do an embarassingly parallel sort of a randomly
ordered file without merging results.

However, if you know that the file is psudeoordered:

10000123
10000232
10000000
19991019
20200222
301111111
30000000

Then you can (maybe) sort the individual blocks in mappers using some black
magic ...  but it would be very very ugly

better off simply running the mappers with the default reducer - they will
sort the file for you naturally :)

Re: Sorting huge text files in Hadoop

Posted by Azuryy Yu <az...@gmail.com>.

This is a typical total sort using map/reduce.  it can be done with both
map and reduce.


On Fri, Feb 15, 2013 at 10:39 PM, Arun Vasu <ar...@gmail.com> wrote:

> Hi,
>
> Is it possible to sort a huge text file lexicographically using a
> mapreduce job which has only map tasks and zero reduce tasks?
>
> The records of the text file is separated by new line character and the
> size of the file is around 1 Terra Byte.
>
> It will be great if any one can suggest a way to achieve sorting on this
> huge file.
>
> Thanks in advance,
> Arun
>
>

Re: Sorting huge text files in Hadoop

Posted by Azuryy Yu <az...@gmail.com>.

This is a typical total sort using map/reduce.  it can be done with both
map and reduce.


On Fri, Feb 15, 2013 at 10:39 PM, Arun Vasu <ar...@gmail.com> wrote:

> Hi,
>
> Is it possible to sort a huge text file lexicographically using a
> mapreduce job which has only map tasks and zero reduce tasks?
>
> The records of the text file is separated by new line character and the
> size of the file is around 1 Terra Byte.
>
> It will be great if any one can suggest a way to achieve sorting on this
> huge file.
>
> Thanks in advance,
> Arun
>
>