You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Xintong Song (Jira)" <ji...@apache.org> on 2020/01/02 03:21:00 UTC

[jira] [Commented] (FLINK-15448) Make "ResourceID#toString" more descriptive

    [ https://issues.apache.org/jira/browse/FLINK-15448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006554#comment-17006554 ] 

Xintong Song commented on FLINK-15448:
--------------------------------------

Thank you [~victor-wong] for bring this up.

I think it is indeed a pain spot that finding the host of a TM is not convenient enough, especially when you have lots of TM failures and wondering if they are on the same machine.

However, I'm not sure about the proposal of changing `ResourceID#toString`. 

* `ResourceID` is a general purpose identifier for Flink's distributed components. In addition to `TaskExecutor`, `JotMaster` and `ResourceManager` are also using this identifier. We should not add TM specific information into such a common purpose class.
* It is intuitive to add additional information in an identifier. Such additional host information provides no value for identifying the distributed components, which is the only responsibility for this class. 
* The host information is not used for any production purpose except for logging. I don't think we should complicate the data structures and code paths purely for logging purpose.

I think the right approach should be providing more information at the places where the logs are generated, rather than modifying `ResourceID`. WDYT?

> Make "ResourceID#toString" more descriptive
> -------------------------------------------
>
>                 Key: FLINK-15448
>                 URL: https://issues.apache.org/jira/browse/FLINK-15448
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.9.1
>            Reporter: Victor Wong
>            Priority: Major
>
> With Flink on Yarn, sometimes we ran into an exception like this:
> {code:java}
> java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id container_xxxx  timed out.
> {code}
> We'd like to find out the host of the lost TaskManager to log into it for more details, we have to check the previous logs for the host information, which is a little time-consuming.
> Maybe we can add more descriptive information to ResourceID of Yarn containers, e.g. "container_xxx@host_name:port_number".
> Here's the demo:
> {code:java}
> class ResourceID {
>   final String resourceId;
>   final String details;
>   public ResourceID(String resourceId) {
>     this.resourceId = resourceId;
>     this.details = resourceId;
>   }
>   public ResourceID(String resourceId, String details) {
>     this.resourceId = resourceId;
>     this.details = details;
>   }
>   public String toString() {
>     return details;
>   }	  
> }
> // in flink-yarn
> private void startTaskExecutorInContainer(Container container) {
>   final String containerIdStr = container.getId().toString();
>   final String containerDetail = container.getId() + "@" + container.getNodeId();  
>   final ResourceID resourceId = new ResourceID(containerIdStr, containerDetail);
>   ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)