You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "hvanhovell (via GitHub)" <gi...@apache.org> on 2023/03/06 02:34:59 UTC

[GitHub] [spark] hvanhovell commented on a diff in pull request #40270: [SPARK-42662][CONNECT][PYTHON][PS] Support `withSequenceColumn` as PySpark DataFrame internal function.

hvanhovell commented on code in PR #40270:
URL: https://github.com/apache/spark/pull/40270#discussion_r1125815690


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -781,3 +782,10 @@ message FrameMap {
   CommonInlineUserDefinedFunction func = 2;
 }
 
+message WithSequenceColumn {

Review Comment:
   Well my argument against this is that it is just a project with a specific type of expression attached to it. There is no need to complicate the protocol, it is just that.
   
   As for what this does, and this comment is more aimed at the original PR, two things:
   - The IDs are not stable at all. This is an order based, and well we basically do not guarantee that the order is stable during processing.
   - The IDs can contain gaps or duplicates if any of the shuffles of the input contains a non-deterministic column, and a retry of one of the input tasks/stages occurs. This is a result of the double scanning that RDD.zipWithIndex requires.
   - Finally two scans can be slow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org