You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/25 00:25:54 UTC

[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031921894


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   This is an interesting topic. Given current withColumns API design, users cannot maintain or control the order over schema fields already (please correct me if I am wrong).
   
   The nice thing to have is, if a user call withColumns by same parameter on the same DataFrame twice, the user see the same output schema through this proto (ordering not predictable but at least consistent)
   
   However I don't know if this is feasible. For Python/Scala, we offer API like Dict/Map which once is used, the value iteration won't be deterministic so we cannot produce stable ordering on clients side already thus cannot preserve it through proto. Is this true? 
   
   If we ever can stability produce an ordering on clients, we can of course maintain the ordering through proto. 
   
   cc @cloud-fan @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org