You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "vicennial (via GitHub)" <gi...@apache.org> on 2023/02/24 15:50:09 UTC

[GitHub] [spark] vicennial commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

vicennial commented on code in PR #40147:
URL: https://github.com/apache/spark/pull/40147#discussion_r1117219600


##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -183,6 +183,60 @@ message ExecutePlanResponse {
   }
 }
 
+// Request to transfer client-local artifacts.
+message AddArtifactsRequest {
+
+  // Definition of an Artifact.
+  message Artifact {
+    // The name of the artifact is expected in the form of a "Relative Path" that is made up of a
+    // sequence of directories and the final file element.
+    // Examples of "Relative Path"s: "jars/test.jar", "classes/xyz.class", "abc.xyz", "a/b/X.jar".
+    // The server is expected to maintain the hierarchy of files as defined by their name. (i.e
+    // The relative path of the file on the server's filesystem will be the same as the name of
+    // the provided artifact)
+    string name = 1;
+    // Raw data.
+    bytes data = 2;
+  }
+
+  // A number of small artifacts batched into a single RPC.
+  message Batch {
+    repeated Artifact artifacts = 1;
+  }
+
+  // The client_id is set by the client to be able to collate streaming responses from
+  // different queries.
+  string client_id = 1;
+
+  // User context
+  UserContext user_context = 2;
+
+  // The payload is either a batch of artifacts or a partial chunk of a large artifact.
+  oneof payload {
+    Batch batch = 3;
+    // A large artifact chunked into multiple requests. The server side should assume that the
+    // artifact has been completely uploaded either when it encounters a new artifact name, or
+    // when the the stream is completed.
+    Artifact chunk = 4;
+  }
+}
+
+// Response to adding an artifact. Contains relevant metadata to verify successful transfer of
+// artifact(s).
+message AddArtifactsResponse {
+  // Metadata of an artifact.
+  message ArtifactSummary {
+    string name = 1;
+    // Size in bytes.
+    int64 size = 2;
+    // CRC to verify integrity of the transferred artifact.
+    int64 crc = 3;

Review Comment:
   > Usually the sender of the artifact (the client) would provide the crc, so that the receiver (the server) can validate it after receiving. 
   
   True, I've updated it to follow the typical convention and move crc validation to the server side. An extra benefit of this is that we can drop the incorrect data on server side if crc fails. Thanks for the suggestion!
   
   > How is the client supposed to respond if the crc is incorrect?
   
   The client may choose to reupload the artifact if the CRC is incorrect. I have added this to the documentation to clarify.



##########
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##########
@@ -183,6 +183,60 @@ message ExecutePlanResponse {
   }
 }
 
+// Request to transfer client-local artifacts.
+message AddArtifactsRequest {
+
+  // Definition of an Artifact.
+  message Artifact {
+    // The name of the artifact is expected in the form of a "Relative Path" that is made up of a
+    // sequence of directories and the final file element.
+    // Examples of "Relative Path"s: "jars/test.jar", "classes/xyz.class", "abc.xyz", "a/b/X.jar".
+    // The server is expected to maintain the hierarchy of files as defined by their name. (i.e
+    // The relative path of the file on the server's filesystem will be the same as the name of
+    // the provided artifact)
+    string name = 1;
+    // Raw data.
+    bytes data = 2;
+  }
+
+  // A number of small artifacts batched into a single RPC.
+  message Batch {
+    repeated Artifact artifacts = 1;
+  }
+
+  // The client_id is set by the client to be able to collate streaming responses from
+  // different queries.
+  string client_id = 1;
+
+  // User context
+  UserContext user_context = 2;
+
+  // The payload is either a batch of artifacts or a partial chunk of a large artifact.
+  oneof payload {
+    Batch batch = 3;
+    // A large artifact chunked into multiple requests. The server side should assume that the
+    // artifact has been completely uploaded either when it encounters a new artifact name, or
+    // when the the stream is completed.
+    Artifact chunk = 4;
+  }
+}
+
+// Response to adding an artifact. Contains relevant metadata to verify successful transfer of
+// artifact(s).
+message AddArtifactsResponse {
+  // Metadata of an artifact.
+  message ArtifactSummary {
+    string name = 1;
+    // Size in bytes.
+    int64 size = 2;
+    // CRC to verify integrity of the transferred artifact.
+    int64 crc = 3;

Review Comment:
   > Usually the sender of the artifact (the client) would provide the crc, so that the receiver (the server) can validate it after receiving. 
   
   True, I've updated it to follow the typical convention and moved crc validation to the server side. An extra benefit of this is that we can drop the incorrect data on server side if crc fails. Thanks for the suggestion!
   
   > How is the client supposed to respond if the crc is incorrect?
   
   The client may choose to reupload the artifact if the CRC is incorrect. I have added this to the documentation to clarify.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org