You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Devaraj Das (JIRA)" <ji...@apache.org> on 2007/01/13 12:27:27 UTC

[jira] Commented: (HADOOP-874) merge code is really slow

    [ https://issues.apache.org/jira/browse/HADOOP-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464436 ] 

Devaraj Das commented on HADOOP-874:
------------------------------------

I have uploaded a patch that keeps the spills open during the entire duration of the merges of the segments. This is done in order to avoid the seeks into the spill files for each merge. Could you please test out your application with this patch? Thanks. Note this kind of contradicts HADOOP-868 wherein we try to minimize open files but if it helps, we can think about how to proceed with this patch. An option is if the number of spills are more than a certain number, we don't keep the all the spills open and it falls back to what exists today in the trunk. Just to clarify, with today's trunk, only io.sort.factor number of spills will be open at any one time (some spills will be closed as merge finishes with the segment in question, etc., and so seeks will happen).

> merge code is really slow
> -------------------------
>
>                 Key: HADOOP-874
>                 URL: https://issues.apache.org/jira/browse/HADOOP-874
>             Project: Hadoop
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.10.0
>            Reporter: Owen O'Malley
>         Assigned To: Devaraj Das
>             Fix For: 0.11.0
>
>         Attachments: merge-no-seek.patch
>
>
> I had a case where the map output buffer size (io.sort.mb) was set too low and caused a spill and merge. Fixing the configuration caused it to not spill until it was finished. With the spill it took 9.5 minutes per a map. Without the spill it took 45 seconds. Therefore, I assume it was taking ~9 minutes to do the 2 file merge. That is really slow. The input files to the merge were two 25 mb sequence files (default codec (java), block compressed)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira