You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "jdesjean (via GitHub)" <gi...@apache.org> on 2023/09/01 16:09:52 UTC

[GitHub] [spark] jdesjean commented on pull request #42772: [SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable

jdesjean commented on PR #42772:
URL: https://github.com/apache/spark/pull/42772#issuecomment-1702991483

   > I am not sure I understand the use case here. Why do we exactly need them to be sortable? And is this a must-have?
   > 
   > One of the problems I see here is that you rely on the client to generate a proper v7 UUID, we do not control the client it is an open protocol, so a new implementation can just provide a v4 UUID, or generate an improper v7. There is also the matter of time drift between client and server, who will this affect the generated UUIDs?
   
   When operation id is used as a PK, UUIDv7 gives us the nice property that the order will roughly match the start time order for the query. While no one should rely on this property exclusively, having the records roughly ordered improves sorting performance.
   Additionally, for most lookup sorting by start time, sorting by operation id is useful to obtain consistent ordering in the case of duplicates. Roughly ordered records again help improve the performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org