You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Niels Basjes <Ni...@basjes.nl> on 2012/02/28 21:10:53 UTC

Merge sorting reduce output files

Hi,

We have a job that outputs a set of files that are several hundred MB of
text each.
Using the comparators and such we can produce output files that are each
sorted by themselves.

What we want is to have one giant outputfile (outside of the cluster) that
is sorted.

Now we see the following options:
1) Run the last job with 1 reducer. This is not really an option because
that would put a significant part of the processing time through 1 cpu
(this would take too long).
2) Create an additional job that sorts the existing files and has 1 reducer.
3) Download all of the files and run the standard commandline tool "sort
-m"
4) Install HDFS fuse and run the standard commandline tool "sort -m"
5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort
-m" in one go.

During our discussion we were wondering: What is the best way of doing this?
What do you recommend?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Merge sorting reduce output files

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi,

On Thu, Mar 1, 2012 at 00:07, Robert Evans <ev...@yahoo-inc.com> wrote:

>  Sorry it has taken me so long to respond.  Today has been a very crazy
> day.
>

No worries.


> I am just guessing what your algorithm is for auto-complete.
>

What we have has a lot more features. Yet the basic idea of what we have is
similar enough to what you describe for this discussion.


>  If we want the keys to come out in sorted order, we need to have a
> sequence file with the partition keys for the total order partitioner.
> TeraSort generates a partition file by getting ....
> This only really works for Terasort because it assumes that all of the
> partitions are more or less random already.
>

And that is something I don't have.


> This is the case for the output of a typical map/reduce job where the
> reduce does not change the keys passed in and the output of the reducer is
> less then a block in size.  That sure sounds like what wordcount does to
> me.  The only real way to get around that is to do it as part of a
> map/reduce job, and do some random sampling instead of reading the first N.
>  It should be a map/reduce job because it is going to be reading a lot more
> data then TeraSort’s partition generation code.  In this case you would
> have a second M/R job that runs after the first and randomly samples
> words/phrases to work on.  It would then generate the increasing long
> phrases and send them all to a single reducer that would buffer them up,
> and when the Reducer has no more input it would output every Nth key so
> that you get the proper number of partitions for the Reducers.  You could
> sort these keys yourself to be sure, but they should come in in sorted
> order so why bother resorting.
>
> If my assumptions are totally wrong here please let me know.
>

I've had a discussion with some coworkers and we came to a possible
solution that is very closely related to your idea.
Because this is a job that runs periodically we think we can assume the
distribution of the dataset will have a similar "shape" from one run to the
next.
If this assumption holds we can:
1) Create a job that takes the output of run 1 and create a aggregate that
can be used to partition the dataset
2) Use the partitioning dataset from '1)' to distribute the processing for
the next run.

Thanks for your suggestions.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Merge sorting reduce output files

Posted by Robert Evans <ev...@yahoo-inc.com>.
Niels,

Sorry it has taken me so long to respond.  Today has been a very crazy day.

I am just guessing what your algorithm is for auto-complete.  I really don't know so I will just design a back of the envelope one myself as a starting point.  My guess is that you have a few map/reduce jobs.  The first M/R Job is mostly a glorified word count to get a word or phrase with how often it is searched for.  In the next job the  map splits the phrases up so that they are output with an ever increasing number of letters as the key along with the original phrase and its weight as the value.  The Reducer/Combiner groups them by the key and produces a top N list of phrases that have the highest weights for each key.  If we want the keys to come out in sorted order, we need to have a sequence file with the partition keys for the total order partitioner.

TeraSort generates a partition file by getting the number of splits and then reading the first N records from each split where N is based off of the number of samples desired and the number of splits.  The keys for all of the sampled entries are sorted and divided into mostly equal length partitions that are stored in the partition file.  This only really works for Terasort because it assumes that all of the partitions are more or less random already.  The worst input dataset to TeraSort would be one where each partition is sorted internally, but made up of fairly evenly distributed data.  This is the case for the output of a typical map/reduce job where the reduce does not change the keys passed in and the output of the reducer is less then a block in size.  That sure sounds like what wordcount does to me.  The only real way to get around that is to do it as part of a map/reduce job, and do some random sampling instead of reading the first N.  It should be a map/reduce job because it is going to be reading a lot more data then TeraSort's partition generation code.  In this case you would have a second M/R job that runs after the first and randomly samples words/phrases to work on.  It would then generate the increasing long phrases and send them all to a single reducer that would buffer them up, and when the Reducer has no more input it would output every Nth key so that you get the proper number of partitions for the Reducers.  You could sort these keys yourself to be sure, but they should come in in sorted order so why bother resorting.

If my assumptions are totally wrong here please let me know.

--Bobby Evans

On 2/29/12 4:59 AM, "Niels Basjes" <Ni...@basjes.nl> wrote:

Robert,

On Tue, Feb 28, 2012 at 23:28, Robert Evans <ev...@yahoo-inc.com> wrote:
I am not sure I can help with that unless I know better what "a special distribution" means.

The thing is that this application is a "Auto Complete" feature that has a key that is "the letters that have been typed so far".
Now for several reasons we need this to be sorted by length of the input. So the '1 letter suggestions' first, then the '2 letter suggestions' etc.
I've been trying to come up with an automatic partitioning that would split the dataset into something like 30 parts that when concatenated do what you suggest.

Unless you are doing a massive amount of processing in your reducer having a partition that is only close to balancing the distribution is a big win over all of the other options that put the data on a single machine and sort it there.  Even if you are doing a lot of processing in the reducer, or you need a special grouping to make the reduce work properly having a second map/reduce job to sort the data that is just close to balancing I would suspect would beat out all of the other options.

Thanks, this is a useful suggestion. I'll see if there is a pattern in the data and from there simply manual define the partitions based on the pattern we find.

Re: Merge sorting reduce output files

Posted by Niels Basjes <Ni...@basjes.nl>.
Robert,

On Tue, Feb 28, 2012 at 23:28, Robert Evans <ev...@yahoo-inc.com> wrote:

>  I am not sure I can help with that unless I know better what “a special
> distribution” means.
>

The thing is that this application is a "Auto Complete" feature that has a
key that is "the letters that have been typed so far".
Now for several reasons we need this to be sorted by length of the input.
So the '1 letter suggestions' first, then the '2 letter suggestions' etc.
I've been trying to come up with an automatic partitioning that would split
the dataset into something like 30 parts that when concatenated do what you
suggest.

Unless you are doing a massive amount of processing in your reducer having
> a partition that is only close to balancing the distribution is a big win
> over all of the other options that put the data on a single machine and
> sort it there.  Even if you are doing a lot of processing in the reducer,
> or you need a special grouping to make the reduce work properly having a
> second map/reduce job to sort the data that is just close to balancing I
> would suspect would beat out all of the other options.
>

Thanks, this is a useful suggestion. I'll see if there is a pattern in the
data and from there simply manual define the partitions based on the
pattern we find.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Merge sorting reduce output files

Posted by Robert Evans <ev...@yahoo-inc.com>.
Niels,

I am not sure I can help with that unless I know better what "a special distribution" means.   Unless you are doing a massive amount of processing in your reducer having a partition that is only close to balancing the distribution is a big win over all of the other options that put the data on a single machine and sort it there.  Even if you are doing a lot of processing in the reducer, or you need a special grouping to make the reduce work properly having a second map/reduce job to sort the data that is just close to balancing I would suspect would beat out all of the other options.

If that is still not an option then I would run some benchmarks with a mapreduce job with a single reducer and copying the data to a single machine and running sort -m.  I don't have much experience with HDFS fuse, but from what I have seen most older versions of it have some performance issues, especially with the page cache.  Although sort -m really would not need the page cache, so it may not be a problem.  If download then sort is even close to as fast as the mapreduce job then you might want to try option 5, because it would reduce the disk overhead when moving the data and I would suspect be faster then just downloading and sorting.

A disclaimer, the only real way to know what to do is to run benchmarks on all of these options.  Hadoop is so complex trying to really reason about exactly which will be faster is very difficult.

--Bobby Evans

On 2/28/12 3:46 PM, "Niels Basjes" <Ni...@basjes.nl> wrote:

Hi Robert,

On Tue, Feb 28, 2012 at 21:41, Robert Evans <ev...@yahoo-inc.com> wrote:
I would recommend that you do what terrasort does and use a different partitioner, to ensure that all keys within a given range will go to a single reducer.  If your partitioner is set up correctly then all you have to do is to concatenate the files together, if you even need to do that.

Look at TotalOrderPartitioner.  It should do what you want.

I know about that partitioner.
The trouble I have is comming up with a partitioning that "evenly" balances the data for this specific problem.
Taking a sample and base the partitioning on that (like the one used in terrasort) wouldn't help.
The data has a special distribution...


Niels Basjes



--Bobby Evans


On 2/28/12 2:10 PM, "Niels Basjes" <Niels@basjes.nl <ht...@basjes.nl> > wrote:

Hi,

We have a job that outputs a set of files that are several hundred MB of text each.
Using the comparators and such we can produce output files that are each sorted by themselves.

What we want is to have one giant outputfile (outside of the cluster) that is sorted.

Now we see the following options:
1) Run the last job with 1 reducer. This is not really an option because that would put a significant part of the processing time through 1 cpu (this would take too long).
2) Create an additional job that sorts the existing files and has 1 reducer.
3) Download all of the files and run the standard commandline tool "sort -m"
4) Install HDFS fuse and run the standard commandline tool "sort -m"
5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort -m" in one go.

During our discussion we were wondering: What is the best way of doing this?
What do you recommend?



Re: Merge sorting reduce output files

Posted by Niels Basjes <Ni...@basjes.nl>.
Hi Robert,

On Tue, Feb 28, 2012 at 21:41, Robert Evans <ev...@yahoo-inc.com> wrote:

>  I would recommend that you do what terrasort does and use a different
> partitioner, to ensure that all keys within a given range will go to a
> single reducer.  If your partitioner is set up correctly then all you have
> to do is to concatenate the files together, if you even need to do that.
>
> Look at TotalOrderPartitioner.  It should do what you want.
>

I know about that partitioner.
The trouble I have is comming up with a partitioning that "evenly" balances
the data for this specific problem.
Taking a sample and base the partitioning on that (like the one used in
terrasort) wouldn't help.
The data has a special distribution...


Niels Basjes



>
> --Bobby Evans
>
>
> On 2/28/12 2:10 PM, "Niels Basjes" <Ni...@basjes.nl> wrote:
>
> Hi,
>
> We have a job that outputs a set of files that are several hundred MB of
> text each.
> Using the comparators and such we can produce output files that are each
> sorted by themselves.
>
> What we want is to have one giant outputfile (outside of the cluster) that
> is sorted.
>
> Now we see the following options:
> 1) Run the last job with 1 reducer. This is not really an option because
> that would put a significant part of the processing time through 1 cpu
> (this would take too long).
> 2) Create an additional job that sorts the existing files and has 1
> reducer.
> 3) Download all of the files and run the standard commandline tool "sort
> -m"
> 4) Install HDFS fuse and run the standard commandline tool "sort -m"
> 5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort
> -m" in one go.
>
> During our discussion we were wondering: What is the best way of doing
> this?
> What do you recommend?
>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Merge sorting reduce output files

Posted by Robert Evans <ev...@yahoo-inc.com>.
I would recommend that you do what terrasort does and use a different partitioner, to ensure that all keys within a given range will go to a single reducer.  If your partitioner is set up correctly then all you have to do is to concatenate the files together, if you even need to do that.

Look at TotalOrderPartitioner.  It should do what you want.

--Bobby Evans

On 2/28/12 2:10 PM, "Niels Basjes" <Ni...@basjes.nl> wrote:

Hi,

We have a job that outputs a set of files that are several hundred MB of text each.
Using the comparators and such we can produce output files that are each sorted by themselves.

What we want is to have one giant outputfile (outside of the cluster) that is sorted.

Now we see the following options:
1) Run the last job with 1 reducer. This is not really an option because that would put a significant part of the processing time through 1 cpu (this would take too long).
2) Create an additional job that sorts the existing files and has 1 reducer.
3) Download all of the files and run the standard commandline tool "sort -m"
4) Install HDFS fuse and run the standard commandline tool "sort -m"
5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort -m" in one go.

During our discussion we were wondering: What is the best way of doing this?
What do you recommend?