You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Espen Amble Kolstad (JIRA)" <ji...@apache.org> on 2007/07/13 13:42:04 UTC

[jira] Created: (HADOOP-1609) Optimize MapTask.MapOutputBuffer.spill() by not deserialize/serialize keys/values but use appendRaw

Optimize MapTask.MapOutputBuffer.spill() by not deserialize/serialize keys/values but use appendRaw
---------------------------------------------------------------------------------------------------

                 Key: HADOOP-1609
                 URL: https://issues.apache.org/jira/browse/HADOOP-1609
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.14.0
            Reporter: Espen Amble Kolstad
         Attachments: spill.patch

In MapTask.MapOutputBuffer.spill() every key and value is read from buffer and then written to file with append(key, value):

{code}
      DataInputBuffer keyIn = new DataInputBuffer();
      DataInputBuffer valIn = new DataInputBuffer();
      DataOutputBuffer valOut = new DataOutputBuffer();
      while (resultIter.next()) {
        keyIn.reset(resultIter.getKey().getData(), 
                    resultIter.getKey().getLength());
        key.readFields(keyIn);
        valOut.reset();
        (resultIter.getValue()).writeUncompressedBytes(valOut);
        valIn.reset(valOut.getData(), valOut.getLength());
        value.readFields(valIn);
        writer.append(key, value);
        reporter.progress();
      }
{code}

When you have complex objects, like nutch's ParseData or Inlinks, this takes time and creates lots of garbage.

I've created a patch, it seems to be working, only tested on 0.13.0.
It's a bit clumsy, since ValueBytes is cast to Un-/CompressedBytes in SequenceFile.Writer.

Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1609) Optimize MapTask.MapOutputBuffer.spill() by not deserialize/serialize keys/values but use appendRaw

Posted by "Espen Amble Kolstad (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Espen Amble Kolstad updated HADOOP-1609:
----------------------------------------

    Attachment: spill.patch

Patch for trunk

> Optimize MapTask.MapOutputBuffer.spill() by not deserialize/serialize keys/values but use appendRaw
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1609
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1609
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Espen Amble Kolstad
>         Attachments: spill.patch
>
>
> In MapTask.MapOutputBuffer.spill() every key and value is read from buffer and then written to file with append(key, value):
> {code}
>       DataInputBuffer keyIn = new DataInputBuffer();
>       DataInputBuffer valIn = new DataInputBuffer();
>       DataOutputBuffer valOut = new DataOutputBuffer();
>       while (resultIter.next()) {
>         keyIn.reset(resultIter.getKey().getData(), 
>                     resultIter.getKey().getLength());
>         key.readFields(keyIn);
>         valOut.reset();
>         (resultIter.getValue()).writeUncompressedBytes(valOut);
>         valIn.reset(valOut.getData(), valOut.getLength());
>         value.readFields(valIn);
>         writer.append(key, value);
>         reporter.progress();
>       }
> {code}
> When you have complex objects, like nutch's ParseData or Inlinks, this takes time and creates lots of garbage.
> I've created a patch, it seems to be working, only tested on 0.13.0.
> It's a bit clumsy, since ValueBytes is cast to Un-/CompressedBytes in SequenceFile.Writer.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1609) Optimize MapTask.MapOutputBuffer.spill() by not deserialize/serialize keys/values but use appendRaw

Posted by "Espen Amble Kolstad (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Espen Amble Kolstad updated HADOOP-1609:
----------------------------------------

    Attachment: spill.patch

Fixed bug in constructor

> Optimize MapTask.MapOutputBuffer.spill() by not deserialize/serialize keys/values but use appendRaw
> ---------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1609
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1609
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.14.0
>            Reporter: Espen Amble Kolstad
>         Attachments: spill.patch, spill.patch
>
>
> In MapTask.MapOutputBuffer.spill() every key and value is read from buffer and then written to file with append(key, value):
> {code}
>       DataInputBuffer keyIn = new DataInputBuffer();
>       DataInputBuffer valIn = new DataInputBuffer();
>       DataOutputBuffer valOut = new DataOutputBuffer();
>       while (resultIter.next()) {
>         keyIn.reset(resultIter.getKey().getData(), 
>                     resultIter.getKey().getLength());
>         key.readFields(keyIn);
>         valOut.reset();
>         (resultIter.getValue()).writeUncompressedBytes(valOut);
>         valIn.reset(valOut.getData(), valOut.getLength());
>         value.readFields(valIn);
>         writer.append(key, value);
>         reporter.progress();
>       }
> {code}
> When you have complex objects, like nutch's ParseData or Inlinks, this takes time and creates lots of garbage.
> I've created a patch, it seems to be working, only tested on 0.13.0.
> It's a bit clumsy, since ValueBytes is cast to Un-/CompressedBytes in SequenceFile.Writer.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.