You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Zhijie Shen (JIRA)" <ji...@apache.org> on 2015/03/18 23:42:41 UTC
[jira] [Comment Edited] (YARN-3040) [Data Model] Implement client-side API for handling flows

    [ https://issues.apache.org/jira/browse/YARN-3040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368064#comment-14368064 ] 

Zhijie Shen edited comment on YARN-3040 at 3/18/15 10:42 PM:
-------------------------------------------------------------

I've just uploaded a patch. It's an e2e modification to make the context information can be passed from the client to the backend storage. The context information includes *clusterId*, *userId*, *flowId*, *flowRunId* and *appId*. According to YARN-3240, new TimelineClient is constructed per application, and in the context of one application, we can reasonably assume this context information should be unchanged. Therefore, they just need to be specified when the client is constructed. The context information should be gathered or passed to AM and NM to construct timeline client  properly. For example, for AM, this information can be passed via env inside CLC. Anyway, it's out of the scope of this Jira, we will cover that integration once we make some particular framework AM to use new timeline client.

Back to the context information, some of them can be null, and some of them doesn't need to be specified explicitly:

*  *clusterId*: The application should specify the a unique cluster ID, or by default the cluster ID will be cluster_<start timestamp of RM>.
* *userId*: The user doesn't need to specify this information. Instead, it will be obtained by the current ugi of the client.
* *flowId*: The user either pass in a flowID or if it is an orphan application, the flowId will be the appId by replace the prefix with "flow".
* *flowRunId*: If it is an orphan application, it's 0. The reason why it should be 0 instead of a current timestamp when creating the timeline client is that their may have multiple clients in AM and NMs to be constructed at different time. They need to be synced on the same flowRunId.
* *appId*: It's the only mandatory context information as we defined before. The client is constructed to only work with one application.

I changed the web service endpoint accordingly to make it restful, and change the writer interface accordingly to pass in the context information when putting the entity. In addition, I've modified the FS-based writer implementation to reflect the change. The entity file will be put in the dir {{root/entities/<clusterId>/<userId>/<flowId>/<flowRunId>/<appId>/<entityType>/<entityId>.thist}}. It has been verified by TestDistributedShell and TestFileSystemTimelineWriterImpl.



was (Author: zjshen):
I've just uploaded a patch. It's an e2e modification to make the context information can be passed from the client to the backend storage. The context information includes *clusterId*, *userId*, *flowId*, *flowRunId* and *appId*. According to YARN-3240, new TimelineClient is constructed per application, and in the context of one application, we can reasonably assume this context information should be unchanged. Therefore, they just need to be specified when the client is constructed. The context information should be gathered or passed to AM and NM to construct timeline client  properly. For example, for AM, this information can be passed via env inside CLC. Anyway, it's out of the scope of this Jira, we will cover that integration once we make some particular framework AM to use new timeline client.

Back to the context information, some of them can be null, and some of them doesn't need to be specified explicitly:

*  *clusterId*: The application should specify the a unique cluster ID, or by default the cluster ID will be cluster_<start timestamp of RM>.
* *userId*: The user doesn't need to specify this information. Instead, it will be obtained by the current ugi of the client.
* *flowId*: The user either pass in a flowID or if it is an orphan application, the flowId will be the appId by replace the prefix with "flow".
* *flowRunId": If it is an orphan application, it's 0. The reason why it should be 0 instead of a current timestamp when creating the timeline client is that their may have multiple clients in AM and NMs to be constructed at different time. They need to be synced on the same flowRunId.
* *appId*: It's the only mandatory context information as we defined before. The client is constructed to only work with one application.

I changed the web service endpoint accordingly to make it restful, and change the writer interface accordingly to pass in the context information when putting the entity. In addition, I've modified the FS-based writer implementation to reflect the change. The entity file will be put in the dir {{root/entities/<clusterId>/<userId>/<flowId>/<flowRunId>/<appId>/<entityType>/<entityId>.thist}}. It has been verified by TestDistributedShell and TestFileSystemTimelineWriterImpl.


> [Data Model] Implement client-side API for handling flows
> ---------------------------------------------------------
>
>                 Key: YARN-3040
>                 URL: https://issues.apache.org/jira/browse/YARN-3040
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Zhijie Shen
>         Attachments: YARN-3040.1.patch
>
>
> Per design in YARN-2928, implement client-side API for handling *flows*. Frameworks should be able to define and pass in all attributes of flows and flow runs to YARN, and they should be passed into ATS writers.
> YARN tags were discussed as a way to handle this piece of information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)