You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Doug Judd (JIRA)" <ji...@apache.org> on 2007/01/25 23:30:49 UTC

[jira] Created: (HADOOP-939) No-sort optimization

No-sort optimization
--------------------

                 Key: HADOOP-939
                 URL: https://issues.apache.org/jira/browse/HADOOP-939
             Project: Hadoop
          Issue Type: New Feature
          Components: mapred
         Environment: all
            Reporter: Doug Judd


There should be a way to tell the mapred framework that the output of the map() phase will already be sorted.  The Reduce phase can just merge the intermediate files together without sorting.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-939) No-sort optimization

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467728 ] 

Doug Cutting commented on HADOOP-939:
-------------------------------------

> 14 (more than half) are unavoidable. 

Make that 15: those associated with the input and output.  So the remaining 12 are associated with sort & reduce.  9 of those could be eliminated when input is largely pre-sorted and reduces can be placed on the same rack as the vast majority of their input, reducing the sort/reduce overhead from 12 out of 27 to 3 out of 18.

> No-sort optimization
> --------------------
>
>                 Key: HADOOP-939
>                 URL: https://issues.apache.org/jira/browse/HADOOP-939
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> There should be a way to tell the mapred framework that the output of the map() phase will already be sorted.  The Reduce phase can just merge the intermediate files together without sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-939) No-sort optimization

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467693 ] 

Owen O'Malley commented on HADOOP-939:
--------------------------------------

I think the complexity of the general case makes this problematic. I wouldn't want to see a config option to do this, because it will be easy for users to get it wrong. 

There are some more specific cases that might be interesting:
  1. After the spill of the map outputs, it would make sense to continue appending to the spill as long as the outputs from the map are sorted. Note that the partition is the primary key for that sort.
  2. The reduces should be scheduled near the map output. That would help in the case where each reduce is getting inputs from a small number of maps.

Note that even if the map outputs are sorted, the reduce needs to do a merge sort because there the map outputs are fetched in a fairly random order.

> No-sort optimization
> --------------------
>
>                 Key: HADOOP-939
>                 URL: https://issues.apache.org/jira/browse/HADOOP-939
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> There should be a way to tell the mapred framework that the output of the map() phase will already be sorted.  The Reduce phase can just merge the intermediate files together without sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-939) No-sort optimization

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517672 ] 

Joydeep Sen Sarma commented on HADOOP-939:
------------------------------------------

am a new hadoop user - i am looking into how much of a warehouse type infrastructure can be implemented over hadoop. in some cases - i would like to have the flexibility of having the output partitioned by the mapoutput key - but i am not interested in it being sorted. the directmapoutputcollector takes away sorting - but it also takes away partitioning. a lighter hammer would be useful imho.

> No-sort optimization
> --------------------
>
>                 Key: HADOOP-939
>                 URL: https://issues.apache.org/jira/browse/HADOOP-939
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> There should be a way to tell the mapred framework that the output of the map() phase will already be sorted.  The Reduce phase can just merge the intermediate files together without sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-939) No-sort optimization

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499965 ] 

Owen O'Malley commented on HADOOP-939:
--------------------------------------

Doug Judd,
    Has the recent change to support reduces = 0 addressed your need? If you set the number of reduces to 0, the output collector is fed directly from the Mapper output. If the map output is already sorted this saves all of the costs associated with the shuffle and the distributed sort.

> No-sort optimization
> --------------------
>
>                 Key: HADOOP-939
>                 URL: https://issues.apache.org/jira/browse/HADOOP-939
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> There should be a way to tell the mapred framework that the output of the map() phase will already be sorted.  The Reduce phase can just merge the intermediate files together without sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-939) No-sort optimization

Posted by "Doug Judd (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467705 ] 

Doug Judd commented on HADOOP-939:
----------------------------------

Bryan (bp@geedom.net) writes:

Seems like it wouldn't be more expensive than a few calls to the appropriate
Comparator to figure this out - the OutputCollector merely compares each
output key to the previously output key. If order is preserved, output this
extra truth when the "spill" to disk happens as a header field. If not, you
can stop calling the comparator as soon as output fails to be ordered a
single time. In any case, this means that sorts can be skipped on any output
sequences that are already sorted, and only applied to output sequences that
aren't.

My (doug) thought was to name the intermediate output file with a .sorted extension if comes out sorted.

As far as Owen's comment goes, the reducer should merge the intermediate files with the .sorted extension in parallel.  The non-sorted ones can get pulled and sorted in any random order.

> No-sort optimization
> --------------------
>
>                 Key: HADOOP-939
>                 URL: https://issues.apache.org/jira/browse/HADOOP-939
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> There should be a way to tell the mapred framework that the output of the map() phase will already be sorted.  The Reduce phase can just merge the intermediate files together without sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-939) No-sort optimization

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting resolved HADOOP-939.
---------------------------------

    Resolution: Duplicate

Sorting can be disabled by setting the number of reduce tasks to zero, so that map outputs are written directly.  Reduce makes no sense without sorting.

> No-sort optimization
> --------------------
>
>                 Key: HADOOP-939
>                 URL: https://issues.apache.org/jira/browse/HADOOP-939
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> There should be a way to tell the mapred framework that the output of the map() phase will already be sorted.  The Reduce phase can just merge the intermediate files together without sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (HADOOP-939) No-sort optimization

Posted by Doug Cutting <cu...@apache.org>.

Arkady Borkovsky wrote:
> So it is 27*D+13*M  vs.  17*D+2*M  .  With D<<M,  the gain is about 6x 
> or 12x, with ("anywhere" and "local" reduce, correspondingly).

If reduces are placed close to maps, then M would not need to be 
shuffled through the central switch, with nearly identical savings.

Doug

Re: [jira] Commented: (HADOOP-939) No-sort optimization

Posted by Arkady Borkovsky <ar...@yahoo-inc.com>.

How savings can be much higher than 1/3.
(probably I'm already preaching to the quire)

1. The situation I have in mind looks like this:
-- you have a data set M with 10^11 to 10^12  records that is produced 
once in  a while and used hundreds of times (before its new version is 
generated).  This data set has been produced by a key-preserving 
reduce, so all its fragments are sorted, and the keys are split between 
the fragments in a known way.
-- Each time you "use" this data set, you have another data set D with 
the same keys space (and, for simplicity, same record type) with 10^8 
or 10^9 records.

The job is a JOIN -- it produces an output record for each key in the 
intersection of M and D.  Think of M as a database and D as a query.
The the number of output records is less or equal to the number of 
records in D.

If D fits into memory, you do not need a reduce, and everything can be 
done in map by copying D to all the map tasks. A pain, but no real 
problems -- map with no reduce does it.
If D does not fit into memory, a natural way do this processing is to
-- sort and split D into buckets so that there is one bucket for each 
block of M, with the same keys.
-- run a reduce task on each block of M close to this block and merge 
(join) this block with the corresponding bucket of D while reading the 
input.
So  steps b-d  are needed only for D, and steps e-g -- for the output 
that is the same size as D -- few percent of the total data involved.

In this kind of applications, Eric's model becomes
now     no-sort    step
M+D     M+D        a.  1 read input data from local drive on map node
[ identity map ]
M+D      D         b.  1 write batches of sorted output data to 
temporary file on map node
M+D      D         c. 10 shuffle batches of sorted data
M+D      D  ("local" reduce)
         M+D ("anywhere" reduce)
                    d.  1 write batches of sorted data to reduce node
[ reduce]
D        D         e.  1 write one copy of output locally
D        D         f.  2 transfer and write one copy to another node on 
the same rack
D        D         g. 11 transfer and write one copy to an off-rack node

So it is 27*D+13*M  vs.  17*D+2*M  .  With D<<M,  the gain is about 6x 
or 12x, with ("anywhere" and "local" reduce, correspondingly).

2. A variation of the situation described above is when in addition to 
M ("database") and D ("query"), you have a third input data set U 
("incremental updates" or "deltas"). U has the size similar to D;  the 
record type of U is exactly the same as that in M.  The JOIN works 
similar to described above, but it takes a record from D, and one or 
more records from U and M.



On Jan 29, 2007, at 10:17 AM, Doug Cutting wrote:

> Arkady Borkovsky wrote:
>> Does this model assume that the size of the output of reduce is 
>> similar to the size of the input?
>> An important class of applications (mentioned in this thread before) 
>> uses two inputs:
>> -- M ("master file") -- very large, presorted and not changing from 
>> run to run,
>> -- D ("details file") -- smaller, different from run  to run, not 
>> necessarily presorted
>> and the output size is proportional to the size of D.
>> In this case the gain from "no-sort" may be much higher, as the 13 
>> "transfer and write" to DFS are applied to a smaller amount of data, 
>> while 11 (b-d) sort-n-shuffle-related are saved on the larger data).
>
> Could a combiner be used in this hypothetical case?  If so, then the 
> b-d steps might be faster too.
>
> Doug

Re: [jira] Commented: (HADOOP-939) No-sort optimization

Posted by Doug Cutting <cu...@apache.org>.

Arkady Borkovsky wrote:
> Does this model assume that the size of the output of reduce is similar 
> to the size of the input?
> 
> An important class of applications (mentioned in this thread before) 
> uses two inputs:
> -- M ("master file") -- very large, presorted and not changing from run 
> to run,
> -- D ("details file") -- smaller, different from run  to run, not 
> necessarily presorted
> and the output size is proportional to the size of D.
> In this case the gain from "no-sort" may be much higher, as the 13 
> "transfer and write" to DFS are applied to a smaller amount of data, 
> while 11 (b-d) sort-n-shuffle-related are saved on the larger data).

Could a combiner be used in this hypothetical case?  If so, then the b-d 
steps might be faster too.

Doug

Re: [jira] Commented: (HADOOP-939) No-sort optimization

Posted by Arkady Borkovsky <ar...@yahoo-inc.com>.

Doug's calculation shows that the total gain can be only 1/3 (15 are  
unavoidable, and taking advantage of largely pre-sorted input reduces  
overhead from 12/27 to 3/18, so the maximum total gain is  27->18.)

Does this model assume that the size of the output of reduce is similar  
to the size of the input?

An important class of applications (mentioned in this thread before)  
uses two inputs:
-- M ("master file") -- very large, presorted and not changing from run  
to run,
-- D ("details file") -- smaller, different from run  to run, not  
necessarily presorted
and the output size is proportional to the size of D.
In this case the gain from "no-sort" may be much higher, as the 13  
"transfer and write" to DFS are applied to a smaller amount of data,  
while 11 (b-d) sort-n-shuffle-related are saved on the larger data).


On Jan 25, 2007, at 5:21 PM, Doug Cutting (JIRA) wrote:

>
>     [  
> https://issues.apache.org/jira/browse/HADOOP-939? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12467717 ]
>
> Doug Cutting commented on HADOOP-939:
> -------------------------------------
>
> I suspect that most of the performance gains to be had by declaring  
> input to be sorted can also be had by using heuristics that also speed  
> things when input is only nearly sorted.  (By "nearly sorted" I mean  
> things like merging a set of updates into a sorted database, e.g., the  
> crawl db update task in Nutch.)
>
> Eric Baldeschwieler proposed a simple model for MapReduce performance.  
>  If you assume that disks can read and write at 100MB/s, and that  
> nodes can talk within rack at 100MB/s (Gb/s) and to nodes in another  
> rack at 10MB/s, then a MapReduce requires the following number of  
> seconds per 100MB.  (Note that this assumes various sort optimizations  
> that are already in progress, where map outputs are buffered and  
> sorted before they're spilled to the local disk on map nodes, and  
> reduce inputs are buffered and merged before they're spilled to the  
> local disk on the reduce node, so that, in many cases, reduce can  
> proceed without an explicit sort stage but simply by merging a set of  
> already sorted input files from the local disk.)
>
> a.  1 read input data from local drive on map node
> [ map ]
> b.  1 write batches of sorted output data to temporary file on map node
> c. 10 shuffle batches of sorted data to reduce node
> d.  1 write batches of sorted data to reduce node
> [ reduce]
> e.  1 write one copy of output locally
> f.  2 transfer and write one copy to another node on the same rack
> g. 11 transfer and write one copy to an off-rack node
>
> So the total is 27s/100MB.  Only two of those are really  
> sort-specific, (b) and (d).  14 (more than half) are unavoidable.
>
> The biggest chunk of fat to go after for pre-sorted input is (c).   
> This can be eliminated if maps can be placed near reduces.  For  
> example, tasktrackers might report the size of each partition they're  
> generating and the jobtracker might use this to schedule reduces on  
> racks which already have a lot of their input.
>
>
>> No-sort optimization
>> --------------------
>>
>>                 Key: HADOOP-939
>>                 URL: https://issues.apache.org/jira/browse/HADOOP-939
>>             Project: Hadoop
>>          Issue Type: New Feature
>>          Components: mapred
>>         Environment: all
>>            Reporter: Doug Judd
>>
>> There should be a way to tell the mapred framework that the output of  
>> the map() phase will already be sorted.  The Reduce phase can just  
>> merge the intermediate files together without sorting.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Commented: (HADOOP-939) No-sort optimization

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467717 ] 

Doug Cutting commented on HADOOP-939:
-------------------------------------

I suspect that most of the performance gains to be had by declaring input to be sorted can also be had by using heuristics that also speed things when input is only nearly sorted.  (By "nearly sorted" I mean things like merging a set of updates into a sorted database, e.g., the crawl db update task in Nutch.)

Eric Baldeschwieler proposed a simple model for MapReduce performance.  If you assume that disks can read and write at 100MB/s, and that nodes can talk within rack at 100MB/s (Gb/s) and to nodes in another rack at 10MB/s, then a MapReduce requires the following number of seconds per 100MB.  (Note that this assumes various sort optimizations that are already in progress, where map outputs are buffered and sorted before they're spilled to the local disk on map nodes, and reduce inputs are buffered and merged before they're spilled to the local disk on the reduce node, so that, in many cases, reduce can proceed without an explicit sort stage but simply by merging a set of already sorted input files from the local disk.)

a.  1 read input data from local drive on map node
[ map ]
b.  1 write batches of sorted output data to temporary file on map node
c. 10 shuffle batches of sorted data to reduce node
d.  1 write batches of sorted data to reduce node
[ reduce]
e.  1 write one copy of output locally
f.  2 transfer and write one copy to another node on the same rack
g. 11 transfer and write one copy to an off-rack node

So the total is 27s/100MB.  Only two of those are really sort-specific, (b) and (d).  14 (more than half) are unavoidable.

The biggest chunk of fat to go after for pre-sorted input is (c).  This can be eliminated if maps can be placed near reduces.  For example, tasktrackers might report the size of each partition they're generating and the jobtracker might use this to schedule reduces on racks which already have a lot of their input.


> No-sort optimization
> --------------------
>
>                 Key: HADOOP-939
>                 URL: https://issues.apache.org/jira/browse/HADOOP-939
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> There should be a way to tell the mapred framework that the output of the map() phase will already be sorted.  The Reduce phase can just merge the intermediate files together without sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-939) No-sort optimization

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517719 ] 

Joydeep Sen Sarma commented on HADOOP-939:
------------------------------------------

not really - what i was looking for was partition but no sort.

> No-sort optimization
> --------------------
>
>                 Key: HADOOP-939
>                 URL: https://issues.apache.org/jira/browse/HADOOP-939
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> There should be a way to tell the mapred framework that the output of the map() phase will already be sorted.  The Reduce phase can just merge the intermediate files together without sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-939) No-sort optimization

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517690 ] 

Arun C Murthy commented on HADOOP-939:
--------------------------------------

@Joydeep

You can control key-partions using a custom partitioner: http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/Partitioner.html 

Does that help?

> No-sort optimization
> --------------------
>
>                 Key: HADOOP-939
>                 URL: https://issues.apache.org/jira/browse/HADOOP-939
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> There should be a way to tell the mapred framework that the output of the map() phase will already be sorted.  The Reduce phase can just merge the intermediate files together without sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.