You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2008/04/07 18:29:24 UTC

[jira] Commented: (HADOOP-3196) get rid of excessive flushes from PipeMapper/Reducer

    [ https://issues.apache.org/jira/browse/HADOOP-3196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586457#action_12586457 ] 

Doug Cutting commented on HADOOP-3196:
--------------------------------------

I think the purpose of these flushes is to signal activity to the parent process, so that the streaming task is not killed for inactivity.  Removing the flushes, while more efficient, changes this contract.  Perhaps we need something like a timer-based flush running in a separate thread?

> get rid of excessive flushes from PipeMapper/Reducer
> ----------------------------------------------------
>
>                 Key: HADOOP-3196
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3196
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.16.2
>            Reporter: Joydeep Sen Sarma
>
> there's a flush on the buffered output streams in mapper/reducer for every row of data.
>       // 2/4 Hadoop to Tool                                                                                                                   
>       if (numExceptions_ == 0) {
>         if (!this.ignoreKey) {
>           write(key);
>           clientOut_.write('\t');
>         }
>         write(value);
>         if(!this.skipNewline) {
>             clientOut_.write('\n');
>         }
>         clientOut_.flush();
>       } else {
>         numRecSkipped_++;
>       }
> tried to measure impact of removing this. number of context switches reported by vmstat shows marked decline. 
> with flush (10 second intervals):
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  4  2    784  23140  83352 3114648    0    0  4819 32397 1175 13220 59 11 13 17
>  1  2    784 129724  80704 3075696    0    0  4614 27196 1156 14797 49 11 19 21
>  4  0    784  24160  83440 3174880    0    0    96 36070 1337 10976 67 11  9 12
>  5  0    784 155872  84400 3158840    0    0   125 44084 1280 11044 68 14 10  8
>  2  1    784 365128  87048 2892032    0    0   119 38472 1317 11610 69 14 10  7
> without flush:
>  5  0    784  24652  56056 3217864    0    0   310 29499 1379  7603 76  9  7  8
>  5  3    784 118456  54568 3209992    0    0  3249 33426 1173  6828 63 11 12 14
>  0  2    784 227628  54820 3198560    0    0  7840 30063 1146  8899 60 10 15 15
>  3  1    784  25608  55048 3313512    0    0  3251 36276 1194  7915 60 10 15 15
>  1  2    784 197324  49968 3194572    0    0  4714 35479 1281  8204 62 13 12 13
> cs goes down by about 20-30%. but having trouble measuring overall speed improvement (too many variables due to spec. execution etc. - need better benchmark).
> can't hurt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.