You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2014/07/24 09:53:40 UTC

[jira] [Commented] (TEZ-1157) Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data

    [ https://issues.apache.org/jira/browse/TEZ-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072955#comment-14072955 ] 

Bikas Saha commented on TEZ-1157:
---------------------------------

I am guessing this is a WIP patch? vertexParallelism > 32K?
{code}+      final long sourceRead = vertexParallelism * shufflePayload.getOutputSize();
+      if(vertexParallelism > 32*1024 || sourceRead > 1024*1024*1024) {
+        // TODO: configurables
+      }{code}

The transfer of parallelism info from AM to task looks good.

> Optimize broadcast :- Tasks pertaining to same job in same machine should not download multiple copies of broadcast data
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1157
>                 URL: https://issues.apache.org/jira/browse/TEZ-1157
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>              Labels: performance
>         Attachments: TEZ-1152.WIP.patch, TEZ-broadcast-shuffle+vertex-parallelism.patch
>
>
> Currently tasks (belonging to same job) running in the same machine download its own copy of broadcast data.  Optimization could be to  download one copy in the machine, and the rest of the tasks can refer to this downloaded copy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)