You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/11/16 08:01:35 UTC

[GitHub] [flink] reswqa opened a new pull request, #21331: [FLINK-29639] Print resourceId of remote taskmanager when encounter transport exception.

reswqa opened a new pull request, #21331:
URL: https://github.com/apache/flink/pull/21331

   ## What is the purpose of the change
   
   *Print resourceId of remote taskmanager when encounter transport exception.*
   
   
   ## Brief change log
   
     - *Add resourceId to `TransportException`.*
     - *Print resourceId of remote taskmanager when encounter transport exception.*
   
   
   ## Verifying this change
   
   Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
   
   *(Please pick either of the following options)*
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] xintongsong commented on a diff in pull request #21331: [FLINK-29639] Print resourceId of remote taskmanager when encounter transport exception.

Posted by GitBox <gi...@apache.org>.
xintongsong commented on code in PR #21331:
URL: https://github.com/apache/flink/pull/21331#discussion_r1024797669


##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/ConnectionID.java:
##########
@@ -75,15 +84,18 @@ public boolean equals(Object other) {
         }
 
         final ConnectionID ra = (ConnectionID) other;
-        if (!ra.getAddress().equals(address) || ra.getConnectionIndex() != connectionIndex) {
-            return false;
-        }
-
-        return true;
+        return ra.getAddress().equals(address)
+                && ra.getConnectionIndex() == connectionIndex
+                && ra.getResourceID().equals(resourceID);
     }
 
     @Override
     public String toString() {
-        return address + " [" + connectionIndex + "]";
+        return address
+                + " ["
+                + connectionIndex
+                + "]"
+                + " resourceID: "
+                + resourceID.getStringWithMetadata();

Review Comment:
   Below is how a connection-id is printed.
   ```
   localhost/127.0.0.1:10000 [10] resourceID: producerLocation
   ```
   It would be nice to have a more concise format. E.g.,
   ```
   localhost/127.0.0.1:10000 (producerLocation) [10]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] xintongsong commented on pull request #21331: [FLINK-29639] Print resourceId of remote taskmanager when encounter transport exception.

Posted by GitBox <gi...@apache.org>.
xintongsong commented on PR #21331:
URL: https://github.com/apache/flink/pull/21331#issuecomment-1318634320

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] xintongsong commented on a diff in pull request #21331: [FLINK-29639] Print resourceId of remote taskmanager when encounter transport exception.

Posted by GitBox <gi...@apache.org>.
xintongsong commented on code in PR #21331:
URL: https://github.com/apache/flink/pull/21331#discussion_r1023845581


##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/ConnectionID.java:
##########
@@ -43,18 +46,36 @@ public class ConnectionID implements Serializable {
 
     private final int connectionIndex;
 
+    /**
+     * Resource ID can be known from shuffleDescriptor, to avoid rpc overhead,
+     * PartitionConnectionInfo will carry a connectionId with null resourceID.

Review Comment:
   Why `PartitionConnectionInfo` has a null resourceId? It seems `NetworkPartitionConnectionInfo` is always created in `NettyShuffleDescriptor#fromProducerDescriptor`, where you can get the resourceId from `producerDescriptor#getProducerLocation`.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/netty/exception/TransportException.java:
##########
@@ -27,17 +29,25 @@ public abstract class TransportException extends IOException {
 
     private final SocketAddress address;
 
-    public TransportException(String message, SocketAddress address) {
-        this(message, address, null);
+    private final ResourceID resourceID;
+
+    public TransportException(String message, SocketAddress address, ResourceID resourceID) {
+        this(message, address, resourceID, null);
     }
 
-    public TransportException(String message, SocketAddress address, Throwable cause) {
+    public TransportException(
+            String message, SocketAddress address, ResourceID resourceID, Throwable cause) {
         super(message, cause);
 
+        this.resourceID = resourceID;
         this.address = address;
     }
 
     public SocketAddress getAddress() {
         return address;
     }
+
+    public ResourceID getResourceID() {
+        return resourceID;
+    }

Review Comment:
   It seems this is never called. The resourceId is already in the error message.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/netty/exception/LocalTransportException.java:
##########
@@ -18,17 +18,20 @@
 
 package org.apache.flink.runtime.io.network.netty.exception;
 
+import org.apache.flink.runtime.clusterframework.types.ResourceID;
+
 import java.net.SocketAddress;
 
 public class LocalTransportException extends TransportException {
 
     private static final long serialVersionUID = 2366708881288640674L;
 
-    public LocalTransportException(String message, SocketAddress address) {
-        super(message, address);
+    public LocalTransportException(String message, SocketAddress address, ResourceID resourceID) {

Review Comment:
   Why does a `LocalTransportException` need a resourceId? I think the purpose of printing resource id is to help identify the remote TM that the broken connection connected to. For a local exception, there's no remote TM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] xintongsong closed pull request #21331: [FLINK-29639] Print resourceId of remote taskmanager when encounter transport exception.

Posted by GitBox <gi...@apache.org>.
xintongsong closed pull request #21331: [FLINK-29639] Print resourceId of remote taskmanager when encounter transport exception.
URL: https://github.com/apache/flink/pull/21331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] reswqa commented on a diff in pull request #21331: [FLINK-29639] Print resourceId of remote taskmanager when encounter transport exception.

Posted by GitBox <gi...@apache.org>.
reswqa commented on code in PR #21331:
URL: https://github.com/apache/flink/pull/21331#discussion_r1023930244


##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/netty/exception/LocalTransportException.java:
##########
@@ -18,17 +18,20 @@
 
 package org.apache.flink.runtime.io.network.netty.exception;
 
+import org.apache.flink.runtime.clusterframework.types.ResourceID;
+
 import java.net.SocketAddress;
 
 public class LocalTransportException extends TransportException {
 
     private static final long serialVersionUID = 2366708881288640674L;
 
-    public LocalTransportException(String message, SocketAddress address) {
-        super(message, address);
+    public LocalTransportException(String message, SocketAddress address, ResourceID resourceID) {

Review Comment:
   make sense, here will be modified.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] reswqa commented on a diff in pull request #21331: [FLINK-29639] Print resourceId of remote taskmanager when encounter transport exception.

Posted by GitBox <gi...@apache.org>.
reswqa commented on code in PR #21331:
URL: https://github.com/apache/flink/pull/21331#discussion_r1023929635


##########
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/ConnectionID.java:
##########
@@ -43,18 +46,36 @@ public class ConnectionID implements Serializable {
 
     private final int connectionIndex;
 
+    /**
+     * Resource ID can be known from shuffleDescriptor, to avoid rpc overhead,
+     * PartitionConnectionInfo will carry a connectionId with null resourceID.

Review Comment:
   This is because `NettyShuffleDescriptor#fromProducerDescriptor` will create a `NetworkPartitionConnectionInfo`, which will carry a `ConnectionID`. If this connectionId carries a `ResourceID`, the RPC message will become larger.
   
   BTW. Through offline discussion with @xintongsong , it is better to directly maintain address and connectionIndex in `NetworkPartitionConnectionInfo`. This can avoid the potentially null `ResourceID` and reduce the network transmission cost also.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] reswqa commented on pull request #21331: [FLINK-29639] Print resourceId of remote taskmanager when encounter transport exception.

Posted by GitBox <gi...@apache.org>.
reswqa commented on PR #21331:
URL: https://github.com/apache/flink/pull/21331#issuecomment-1318723729

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] flinkbot commented on pull request #21331: [FLINK-29639] Print resourceId of remote taskmanager when encounter transport exception.

Posted by GitBox <gi...@apache.org>.
flinkbot commented on PR #21331:
URL: https://github.com/apache/flink/pull/21331#issuecomment-1316560200

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d8f4e058208f770fd49d9c06922292498b9dadc6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d8f4e058208f770fd49d9c06922292498b9dadc6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d8f4e058208f770fd49d9c06922292498b9dadc6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org