You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Daniel Templeton (JIRA)" <ji...@apache.org> on 2016/06/08 14:57:20 UTC

[jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15320692#comment-15320692 ] 

Daniel Templeton commented on MAPREDUCE-6712:
---------------------------------------------

Hadoop Streaming is limited by the fact that all intermediate data are passed as strings.  In most cases the cost of translating those strings back into the intended data types makes Hadoop Streaming so much slower than Java MapReduce that tuning the Hadoop Streaming implementation won't make a significant dent.  Turning strings into number is expensive.  Using interpreted languages is expensive.  If you want better performance you should consider Java MapReduce, or better yet, Spark, e.g. pyspark. 

> Support grouping values for reducer on java-side
> ------------------------------------------------
>
>                 Key: MAPREDUCE-6712
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: He Tianyi
>            Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each line representing a (k, v) tuple from {{stdin}}, in which values with identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for caching,
> Suppose we need another InputWriter. But this is not enough, since the interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not {{writeValues}}. Though we can compare key in custom InputWriter and group them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org