You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "He Tianyi (JIRA)" <ji...@apache.org> on 2016/06/08 23:35:20 UTC

[jira] [Comment Edited] (MAPREDUCE-6712) Support grouping values for reducer on java-side

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321658#comment-15321658 ] 

He Tianyi edited comment on MAPREDUCE-6712 at 6/8/16 11:34 PM:
---------------------------------------------------------------

Actually in my experiements (in-house workload) turning strings back and forth is not the bottleneck (does not make a difference with typedbytes). But just grouping values make a simple reducer 20% faster (for both text and typedbytes). 
Also, many users are using C/C++ to implement mapper/reducer which I think is possible to be more efficient than java/scala (smaller memory footprint, less gc, better SIMD support, etc.). 


was (Author: he tianyi):
Actually in my experiements (in-house workload) turning strings back and forth is not the bottleneck (does not make a difference with typedbytes). But just grouping values make a simple reducer 20% faster (for both text and typedbytes). 
Also, many users are using C/C++ to implement mapper/reducer which I think is possible to be more efficient than java/scala (smaller memory footprint, less gc, no virtual call overhead, better SIMD support, etc.). 

> Support grouping values for reducer on java-side
> ------------------------------------------------
>
>                 Key: MAPREDUCE-6712
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: He Tianyi
>            Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each line representing a (k, v) tuple from {{stdin}}, in which values with identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for caching,
> Suppose we need another InputWriter. But this is not enough, since the interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not {{writeValues}}. Though we can compare key in custom InputWriter and group them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org