You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2009/01/11 03:42:01 UTC

[jira] Updated: (HADOOP-4888) Use Apache HttpClient for fetching map outputs

     [ https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-4888:
----------------------------------

    Attachment: 4888-1.patch

@Zheng: You're right, I shouldn't have said "degraded."

@Steve: Thanks for the ivy settings; I hadn't started to consider that, yet. The goal of this is identical to HADOOP-1338, really. Reimplementing the connection pooling in Hadoop could offer some advantages (e.g. more granular progress reporting), but appropriating all the work done in HttpClient seems like a clear win until that work is completed.

I tried a similar, still preliminary patch, but with max connections per host set to 1 and on a job with different parameters, i.e. mapred.reduce.slowstart.completed.maps=1.0, 38272 maps, 448 reducers, 32MB (generated) per map on ~300 nodes. Times measured are from the start of the reduce (after all maps have finished, so the stragglers are not a factor) to end of the shuffle (avg / std.d):

|| Version || 1 || 2 || 3 || 4 || 5 || avg || avg job ||
| r732838 | 786.89 / 45.55 | 842.596 / 70.69 | 1458.75 / 83.88 | 1140.93 / 44.22 | 1294.67 / 58.87 | 1104.77 | 2479.8 |
| r732838 + patch | 803.261 / 73.36 | 783.243 / 93.34 | 792.106 / 78.94 | 917.153 / 52.91 | 776.756 / 113.56 | 814.50 | 1955.2 |

Many of the parameters need to be adjusted. In particular, the timeouts are worth revisiting, as are the number of connections and threads at the server and client. Whether the HEAD + GET imposes a measurable penalty may also merit consideration before this can be committed. However, the preceding demonstrates that a measurable improvement is possible, and that this part of the pipeline could be mined for performance improvements.

> Use Apache HttpClient for fetching map outputs
> ----------------------------------------------
>
>                 Key: HADOOP-4888
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4888
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>            Assignee: Chris Douglas
>         Attachments: 4888-0.patch, 4888-1.patch
>
>
> It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the shuffle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.