You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Vrushali C (JIRA)" <ji...@apache.org> on 2018/08/03 10:43:00 UTC
[jira] [Comment Edited] (YARN-5357) Timeline service v2 integration with Federation

    [ https://issues.apache.org/jira/browse/YARN-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568085#comment-16568085 ] 

Vrushali C edited comment on YARN-5357 at 8/3/18 10:42 AM:
-----------------------------------------------------------

Update:
[~prabham] [~Sushil-K-S] [~abmodi] [~rohithsharma] and I have been discussing federation integration with timeline service in the recent community calls.

Here is a summary of our current assumption, understanding and direction: 
- There exists a managed AM which we will refer to as 'original AM' for this discussion
- There exist sub cluster ids which are physical cluster ids defined in the yarn site xml
- Unmanaged AMs which coordinate containers in other subclusters run in the jvm context of the Node Manager which manages the original AM.
- The same app id is used on all sub clusters
- Containers launched on other sub clusters will have different epoch ids. 

Current thought process:
- As per our current understanding, the flow context is initialized once and then used by all containers launched as part of that application.
- Say an application is launched on sub Cluster A and runs containers on sub clusters A, B and C.
- Entities from containers launched on the sub cluster A will have the cluster id as sub cluster A in their row key
- The entities from containers that run on different sub clusters B and C will also have the cluster id as sub cluster A.
- The containers will be updated to emit to timeline storage the actual physical sub cluster that they run on. For instance, containers that run on sub clusters B and C will have sub cluster A in their row key but will store sub cluster B (or C as the case maybe) in their info column family. 

This enables entities that belong to one application will have one cluster identifier (sub cluster A).  Storing the sub clusters B & C for entities that are emitted from containers run on sub clusters B & C enables answering federation related queries like, how many containers from this application id ran on subCluster B? What were the metrics that are associated with entities run on subCluster C? What was the average (or min/max/median) runtimes for entities for this application on sub clusters A, B and C? etc.

Adding the physical sub cluster as a column / value in info also helps differentiate between application entities that belong to applications that run on just one sub cluster versus entities that belong to applications that run on multiple sub clusters. For entities that belong to applications that run on exactly one subcluster, this field will have just one physical cluster id which will be the same as the cluster id in the row key. 

We do need to also store the logical federated cluster name, which will a new config variable added in YARN-5358. 









was (Author: vrushalic):
Update:
[~prabham] [~Sushil-K-S] [~abmodi] [~rohithsharma] and I have been discussing federation integration with timeline service in the recent community calls.

Here is a summary of our current assumption, understanding and direction: 
- There exists a managed AM which we will refer to as 'original AM' for this discussion
- There exist sub cluster ids which are physical cluster ids defined in the yarn site xml
- Unmanaged AMs which coordinate containers in other subclusters run in the jvm context of the Node Manager which manages the original AM.
- The same app id is used on all sub clusters
- Containers launched on other sub clusters will have different epoch ids. 

Current thought process:
- As per our current understanding, the flow context is initialized once and then used by all containers launched as part of that application.
- Say an application is launched on sub Cluster A and runs containers on sub clusters A, B and C.
- Entities from containers launched on the sub cluster A will have the cluster id as sub cluster A in their row key
- The containers that run on different sub clusters B and C will also have the cluster id as sub cluster A.
- The containers will be updated to emit to timeline storage the actual physical sub cluster that they run on. For instance, containers that run on sub clusters B and C will have sub cluster A in their row key but will store sub cluster B (or C as the case maybe) in their info column family. 

This enables entities that belong to one application will have one cluster identifier (sub cluster A).  Storing the sub clusters B & C for entities that are emitted from containers run on sub clusters B & C enables answering federation related queries like, how many containers from this application id ran on subCluster B? What were the metrics that are associated with entities run on subCluster C? What was the average (or min/max/median) runtimes for entities for this application on sub clusters A, B and C? etc.

Adding the physical sub cluster as a column / value in info also helps differentiate between application entities that belong to applications that run on just one sub cluster versus entities that belong to applications that run on multiple sub clusters. For entities that belong to applications that run on exactly one subcluster, this field will have just one physical cluster id which will be the same as the cluster id in the row key. 

We do need to also store the logical federated cluster name, which will a new config variable added in YARN-5358. 








> Timeline service v2 integration with Federation 
> ------------------------------------------------
>
>                 Key: YARN-5357
>                 URL: https://issues.apache.org/jira/browse/YARN-5357
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Vrushali C
>            Assignee: Prabha Manepalli
>            Priority: Major
>
> Jira to note the discussion points from an initial chat about integrating Timeline Service v2 with Federation (YARN-2915).
> cc [~subru] [~curino] 
> For Federation:
> - all entities that belong to the same flow run should have the same cluster name
> - app id in the same flow run strongly ordered in time
> - need a logical cluster name and physical cluster name
> - a possibility to implement the Application TimelineCollector as an interceptor in the AMRMProxyService.
> For Timeline Service:
> - need to store physical cluster id and logical cluster id so that we don't lose information at any level (flow/app/entity etc)
> - add a  new table app id to cluster mapping table
> - need a different entity table/some table to store node level metrics for physical cluster stats. Once we get to node-level rollup, we probably have to store something in a dc, cluster, rack, node hierarchy. In that case a physical cluster makes sense, but we'd still need some way to tie physical and logical together in order to make automatic error detection etc that we're envisioning feasible within a federated setup.
> For the Cluster Naming convention:
> - three situations for cluster name:
> ----> app submitted to router should take federated (aka logical) cluster name
> ----> app submitted directly to RM should take physical cluster name
> ----> Info about the physical cluster  in entities?
> - suggestion to set the cluster name as yarn tag at the router level (in the app submission context) 
> Other points to note:
> - for federation to work smoothly in environments that use HDFS some additional considerations are needed, and possibly some solution like what is being used at Twitter with the nFly approach.
> Email thread context:
> {code}
> ---------- Forwarded message ----------
> From: Joep Rottinghuis 
> Date: Fri, Jul 8, 2016 at 1:22 PM
> Subject: Re: Federation -Timeline Service meeting notes
> To: Subramaniam Venkatraman Krishnan 
> Cc: Sangjin Lee, Vrushali Channapattan , Carlo Curino
> Thanks for the notes.
> I think that for federation to work smoothly in environments that use HDFS some additional considerations are needed, and possibly some solution like what we're using at Twitter with our nFly approach.
> bq. - need a different entity table/some table to store node level metrics for physical cluster stats
> Once we get to node-level rollup, we probably have to store something in a dc, cluster, rack, node hierarchy. In that case a physical cluster makes sense, but we'd still need some way to tie physical and logical together in order to make automatic error detection etc that we're envisioning feasible within a federated setup.
> Cheers,
> Joep
> On Fri, Jul 8, 2016 at 1:00 PM, Subramaniam Venkatraman Krishnan  wrote:
>     Thanks Vrushali for crisply capturing the essential from our rambling discussion J.
>      
>     Sangjin, I just want to add one comment to yours – we want to retain the physical cluster name (possibly as a new entity type) so that we don’t lose information & we can cluster level rollups even if they are not efficient.
>      
>     Additionally, based on the walkthrough of Federation design:
>     ·         There was general agreement with the proposed approach.
>     ·         There is a possibility to implement the Application TimelineCollector as an interceptor in the AMRMProxyService.
>     ·         Joep raised the concern that it would be better if the RMs obtain the epoch from FederationStateStore. This is not currently in the roadmap of our MVP but we definitely plan to address this in future.
>      
>     Regards,
>     Subru
>      
>     From: Sangjin Lee
>     Sent: Thursday, July 07, 2016 6:22 PM
>     To: Vrushali Channapattan 
>     Cc: Joep Rottinghuis; Carlo Curino; Subramaniam Venkatraman Krishnan 
>     Subject: Re: Federation -Timeline Service meeting notes
>      
>     Thanks for the summary Vrushali!
>      
>     Just so that we're on the same page regarding the terminology, I understand we're using the terms "logical cluster" and "federated cluster" interchangeably.
>      
>     Also, between using the federated cluster name and the home cluster name as a solution, I think we were leaning towards the federated cluster name (although not concluded).
>      
>     On Thu, Jul 7, 2016 at 4:33 PM, Vrushali Channapattan wrote:
>          
>         For Federation:
>         - all entities that belong to the same flow run should have the same cluster name
>         - app id in the same flow run strongly ordered in time
>         - need a logical cluster name and physical cluster name
>         For Timeline Service:
>         - need to store physical cluster id and logical cluster id so that we don't lose information at any level (flow/app/entity etc)
>         - add a  new table app id to cluster mapping table
>         - need a different entity table/some table to store node level metrics for physical cluster stats
>         For the Cluster Naming convention:
>         - three situations for cluster name:
>         ----> app submitted to router should take federated cluster name
>         ----> app submitted directly to RM should take physical cluster name
>         ----> Info about the physical cluster  in entities?
>         - suggestion to set the cluster name as yarn tag at the router level (in the app submission context)
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org