You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Scott Chen (JIRA)" <ji...@apache.org> on 2010/12/07 20:50:13 UTC

[jira] Created: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

MapTask and ReduceTask should only compress/decompress the final map output file
--------------------------------------------------------------------------------

                 Key: MAPREDUCE-2212
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: task
    Affects Versions: 0.23.0
            Reporter: Scott Chen
            Assignee: Scott Chen
             Fix For: 0.23.0


Currently if we set mapred.map.output.compression.codec
1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.

This cause all the data being compressed/decompressed many times.
The reason we need mapred.map.output.compression.codec is for network traffic.
We should not compress/decompress the data again and again during merge sort.

We should do the compression only for the final map output file that is been transmit over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969152#action_12969152 ] 

Joydeep Sen Sarma commented on MAPREDUCE-2212:
----------------------------------------------

Todd - do you know for sure that the benefit is due to the compression of the final spill or because of the compression of the intermediate sort runs? i am thinking that if the experiment is just to turn compression on/off and run some benchmark - then it wouldn't be clear whether any win is from lower network latencies (from map->reduce) or from faster mappers (if they were disk bound without compression).

in general i have seen that the map-reduce stack consumes data at a very low rate (it's cpu bound by the time it gets to 10-20 MBps). (Obviously this is a very loose statement and depends a lot on what the mappers are doing etc.). so even with 6 disks (say a total of 300MBps streaming read/write bandwidth)  and 8 cores (say about 200 MBps processing bandwidth) - it would seem that we would be cpu bound before we would be disk throughput bound. would be nice to get more accurate numbers along these lines.

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Scott Chen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969099#action_12969099 ] 

Scott Chen commented on MAPREDUCE-2212:
---------------------------------------

In our case, the resource is usually CPU or network bounded.
I like Joydeep's idea. It will be nice to have two separate codec options for the intermediate compression (for disk IO) and the final output compression (for network traffic).

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969049#action_12969049 ] 

Mahadev konar commented on MAPREDUCE-2212:
------------------------------------------

joydeep,
 shouldnt that be the default to compress the on disk version if JobConf.setCompressMapOutput() is set to true. The jobs that have been running with this property set should have the same disk/network footprint that they used to have. no? 

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Scott Chen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970925#action_12970925 ] 

Scott Chen commented on MAPREDUCE-2212:
---------------------------------------

I have done some experiments on the latency.
In the experiment, 500mb of data are read from the disk, compressed and written to the disk.
It shows that the throughput of LZO is slightly worse than no codec. But they are very close.

I think for latency, there is no much difference.
The question here is about the trade-off between disk IO and CPU.
Using LZO uses more CPU (I don' have number for this) but can save disk IO to 50%.

{code}
================================================
Initialize codec lzo 
Finished. Time: 10278 ms
File size: 239.19908142089844MB Compression ratio: 0.501636832
Throughput: 47.50741875851333MB/s
================================================
Initialize codec gz
Finished. Time: 38132 ms
File size: 161.91629219055176MB Compression ratio: 0.339563076
Throughput: 12.805025962446239MB/s
================================================
Initialize codec none
Finished. Time: 8783 ms
File size: 476.837158203125MB Compression ratio: 1.0
Throughput: 55.59390299442104MB/s
================================================
{code}

Here is a simple example that produces these numbers.
{code}
public class TestCodecDiskIO extends TestCase {
  
  Log LOG = LogFactory.getLog(TestCodecDiskIO.class);

  static {
    System.setProperty(Compression.Algorithm.CONF_LZO_CLASS,
        "com.hadoop.compression.lzo.LzoCodec");
  }
  
  public void testCodecWrite()
      throws Exception {
    File dataFile = new File("/home/schen/data/test_data");
    print("Data file:" + dataFile.getName());
    InputStream in = new BufferedInputStream(new FileInputStream(dataFile));
    int dataLength = 5 * 1024 * 1024 * 1024;
    byte buff[] = new byte[dataLength];
    print("Start reading file. Read length = " + dataLength);
    long start = now();
    in.read(buff);
    long timeSpent = now() - start;
    in.close();
    print("Reading time: " + timeSpent);
    
    byte buff2[] = new byte[dataLength];
    start = now();
    System.arraycopy(buff, 0, buff2, 0, buff.length);
    timeSpent = now() - start;
    print("Memory copy time: " + timeSpent);
    
    int count = 3;

    for (int i = 0; i < count; ++i) {
      for (Compression.Algorithm algo : Compression.Algorithm.values()) {
        print("================================================");
        print("Initialize codec " + algo.getName());
        CompressionCodec codec = algo.getCodec();
        File temp = File.createTempFile("test", "", new File("/tmp"));
        temp.deleteOnExit();
        FileOutputStream fout = new FileOutputStream(temp);
        BufferedOutputStream bout = new BufferedOutputStream(fout);
        OutputStream out;
        if (codec != null) {
          out = codec.createOutputStream(bout);
        } else {
          out = bout;
        }
        print("Start writing");
        start = now();
        out.write(buff);
        out.flush();
        fout.getFD().sync();
        out.close();
        timeSpent = now() - start;
        print("Finished. Time: " + timeSpent + " ms");
        print("File size: " + (temp.length() / 1024.0 / 1024.0) + "MB" +
            " Compression ratio: " + temp.length() / (double)(dataLength));
        print(("Throughput: " + dataLength / (double)(timeSpent) / 1024.0) + "MB/s");
      }
    }
    print("================================================");
  }

  private void print(String s) {
    System.out.println(s);
  }
  private long now() {
    return System.currentTimeMillis();
  }
}
{code}

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Scott Chen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970962#action_12970962 ] 

Scott Chen commented on MAPREDUCE-2212:
---------------------------------------

I think maybe we should leave the way it is right now since there is no huge difference.
Doing more change increases the complexity.

What do you guys think?

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969034#action_12969034 ] 

Joydeep Sen Sarma commented on MAPREDUCE-2212:
----------------------------------------------

we should have a separate option for compressing intermediate runs (for optimizing for disk bandwidth for folks who need it)

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969085#action_12969085 ] 

Chris Douglas commented on MAPREDUCE-2212:
------------------------------------------

Todd's point on disk bandwidth matches some benchmarks we did a couple years ago. Compressing the intermediate data improved the spill and merge times. It would be interesting to see if those results hold today, and for which codecs.

In the case where no records are collected after the soft spill, the intermediate output will either need to be rewritten (since the reduce is expecting compressed output) or the shuffle will need to handle mixed segments. It's a rare case, but one the framework would need to handle.

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969021#action_12969021 ] 

Todd Lipcon commented on MAPREDUCE-2212:
----------------------------------------

Do we have data to confirm that intermediate compression is only useful for reducing network traffic? It seems we're also reducing disk IO which can be a bottleneck especially when the core:disk ratio is high.

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969142#action_12969142 ] 

Todd Lipcon commented on MAPREDUCE-2212:
----------------------------------------

We've found that even on single rack clusters (where bandwidth is usually not the bottleneck) LZO intermediate compression almost always helps. That indicates that at least in many workloads we're intermediate-IO bound more than CPU. This is consistent with what we see on most clusters with 4-6 disks. Clusters with 12 local disks more often are bound on network or CPU.

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Scott Chen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Chen updated MAPREDUCE-2212:
----------------------------------

    Description: 
Currently if we set mapred.map.output.compression.codec
1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.

This causes all the data being compressed/decompressed many times.
The reason we need mapred.map.output.compression.codec is for network traffic.
We should not compress/decompress the data again and again during merge sort.

We should only compress the final map output file that will be transmitted over the network.

  was:
Currently if we set mapred.map.output.compression.codec
1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.

This cause all the data being compressed/decompressed many times.
The reason we need mapred.map.output.compression.codec is for network traffic.
We should not compress/decompress the data again and again during merge sort.

We should do the compression only for the final map output file that is been transmit over the network.


> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969071#action_12969071 ] 

Joydeep Sen Sarma commented on MAPREDUCE-2212:
----------------------------------------------

makes sense - hadn't thought about backwards compatibility. so that would imply an additional (new) option to turn off intermediate run compression.

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Scott Chen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969137#action_12969137 ] 

Scott Chen commented on MAPREDUCE-2212:
---------------------------------------

bq. In our case, the resource is usually CPU or network bounded.

Let me take this back. I don't have number for this one.

I think the intuition is that the latency should be better if we do lzo compression for intermediate data.
For throughput, it varies. If TT is constantly running with 100% cpu, we should probably trade cpu with some disk io.

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Posted by "Scott Chen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Chen resolved MAPREDUCE-2212.
-----------------------------------

    Resolution: Won't Fix

I am closing this now because I think there is no much benefit to do this.
This will increase complexity of the code.

> MapTask and ReduceTask should only compress/decompress the final map output file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.