You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by 侯锐 <ho...@ict.ac.cn> on 2011/07/05 16:20:17 UTC

Where does the compression take place for MapOutputStream in Map phase?

Hello guys, 
We wonder to know where the compression take place for MapOutputStream in Map phase.

We guess there are two possible places in sortAndSpill() at MapTask.java:
Writer.append() or Writer.close()
Which one makes compression? 
Appreciate very much for your response~

See lines marked by ****** as below (from sortAndSpill() at MapTask.java).

for (int i = 0; i < partitions; ++i) {
          IFile.Writer<K, V> writer = null;
          try {;
            writer = new Writer<K, V>(job, out, keyClass, valClass, codec,
                                      spilledRecordsCounter);
            if (combinerRunner == null) {
                 …
                key.reset(kvbuffer, kvindices[kvoff + KEYSTART],
                          (kvindices[kvoff + VALSTART] - 
                           kvindices[kvoff + KEYSTART]));
                /**************************************/
                writer.append(key, value);   // The 1st possible place
                ++spindex;
              }
            } else {
…
              }
              …

            // close the writer
            /**************************************/
            writer.close();   // The 2st possible place

--
Rui Hou (侯锐)
Insititute of Technology, Chinese Academy of Sciences

Re: Where does the compression take place for MapOutputStream in Map phase?

Posted by Harsh Chouraria <ha...@cloudera.com>.

Hello Rui Hou,

If you look at the Writer constructor used here, you'll get your answer very easily. It takes a codec (a compression codec, to be specific) as an argument. The codec, if not null (in case compression is disabled), is then responsible for compressing the streams of data by wrapping around the actual output stream.

The codec variable is initialized during the MapOutputStream construction accordingly.

The code for how codecs work can be read in the common code for the chosen algorithm, if you'd like to take a look. For example, there's the DefaultCodec class.

I hope this helps! :)

P.s. Please do not cross post to multiple lists while seeking an answer. And for future mapreduce development questions such as this, please direct it to mapreduce-dev@hadoop.apache.org

On 05-Jul-2011, at 7:50 PM, 侯锐 wrote:

> Hello guys, 
> We wonder to know where the compression take place for MapOutputStream in Map phase.
> 
> We guess there are two possible places in sortAndSpill() at MapTask.java:
> Writer.append() or Writer.close()
> Which one makes compression? 
> Appreciate very much for your response~
> 
> See lines marked by ****** as below (from sortAndSpill() at MapTask.java).
> 
> for (int i = 0; i < partitions; ++i) {
>          IFile.Writer<K, V> writer = null;
>          try {;
>            writer = new Writer<K, V>(job, out, keyClass, valClass, codec,
>                                      spilledRecordsCounter);
>            if (combinerRunner == null) {
>                 …
>                key.reset(kvbuffer, kvindices[kvoff + KEYSTART],
>                          (kvindices[kvoff + VALSTART] - 
>                           kvindices[kvoff + KEYSTART]));
>                /**************************************/
>                writer.append(key, value);   // The 1st possible place
>                ++spindex;
>              }
>            } else {
> …
>              }
>              …
> 
>            // close the writer
>            /**************************************/
>            writer.close();   // The 2st possible place
> 
> --
> Rui Hou (侯锐)
> Insititute of Technology, Chinese Academy of Sciences
> 
> 
> 
>