You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Hitesh Shah (JIRA)" <ji...@apache.org> on 2013/08/30 02:04:52 UTC

[jira] [Commented] (TEZ-410) Refactor Edge Connection Pattern to be more clear

    [ https://issues.apache.org/jira/browse/TEZ-410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754234#comment-13754234 ] 

Hitesh Shah commented on TEZ-410:
---------------------------------

Comments:

{code}
+      default : throw new RuntimeException("unknown 'SchedulingType'");
{code}
  - might help to add the actual value to what enum was not handled
  - may be required in other places in the same class ( DagTypeConverters.java )

{code}
+    /**
+     * Data produced by the source task is persisted and available even when the
+     * task is not running. The data may be unavailable and may cause the source
+     * task to be re-executed.
+     */
+    PERSISTED,
{code}
   - "... data may be*come* unavailable ... "

   - "source task is stored in reliably" --> remove the "in" ?

Looks good apart from the above minor nits. Good to commit after addressing above.
                
> Refactor Edge Connection Pattern to be more clear
> -------------------------------------------------
>
>                 Key: TEZ-410
>                 URL: https://issues.apache.org/jira/browse/TEZ-410
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-410.1.patch, TEZ-410.2.patch, TEZ-410.3.patch, TEZ-410.4.patch
>
>
> During discussion with users there was feedback that edge properties need to be named better to make them more clear. There was a suggestion to look at MPI for inspiration. Based on that feedback, the proposal is to renamed ConnectionPattern to DataMovement as that is essentially what the property is defining. A Bipartite connection pattern can be constructed from both broadcast and scatter-gather data movement types. There will be 3 kinds of data movements initially. 
> ONE_TO_ONE - Defines an output produced by the ith upstream task is available the the ith downstream task.
> BROADCAST - Defines an output produced by any upstream task is available to all downstream tasks.
> SCATTER_GATHER - Defines that the ith output produced by all upstream tasks is available to the same downstream task. Upstream tasks scatter there outputs and they are gathered by designated downstream tasks.
> To be clear, output being available to the a task does not imply that the entire output is transferred/read by it. The task can choose to read any amount of the total data.
> Current users: In the EdgeProperty object
> Please change EdgeConnectionPattern.BIPARTITE -> DataMovementType.SCATTER_GATHER
> Please change SourceType.STABLE -> DataSourceType.PERSISTED
> Please add SchedulingType.SEQUENTIAL to EdgeProperty objects.
> The getter methods have similar name changes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira