You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "dillitz (via GitHub)" <gi...@apache.org> on 2023/09/01 12:36:59 UTC

[GitHub] [spark] dillitz opened a new pull request, #42772: [SPARK-45051][CONNECT] Use UUIDv7 for operation IDs to make operations chronologically sortable

dillitz opened a new pull request, #42772:
URL: https://github.com/apache/spark/pull/42772

   ### What changes were proposed in this pull request?
   This changes the operations ID format from UUIDv4 to UUIDv7. This is done on the server and on the client side (both Scala and Python) since both sides can set this ID first.
   
   
   ### Why are the changes needed?
   Spark Connect currently uses UUIDv4 for operation IDs. Using UUIDv7 instead allows us to sort operations by ID to receive a chronological order while keeping the collision-free properties we require from this ID.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Existing tests.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] juliuszsompolski commented on pull request #42772: [SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable

Posted by "juliuszsompolski (via GitHub)" <gi...@apache.org>.
juliuszsompolski commented on PR #42772:
URL: https://github.com/apache/spark/pull/42772#issuecomment-1707008850

   ... although, the currently existing (Spark 3.4) clients never generate operationId client side, so we can get away with adding an assertion that the client side id is UUID7 in `ExecuteHolder.operationId`.
   But then this has to go into Spark 3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dillitz commented on pull request #42772: [SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable

Posted by "dillitz (via GitHub)" <gi...@apache.org>.
dillitz commented on PR #42772:
URL: https://github.com/apache/spark/pull/42772#issuecomment-1707033407

   We agreed that the benefits of adding this are not big enough because we can not rely on the operation ID being UUIDv7 and need to sort by startDate anyway. Closing this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #42772: [SPARK-45051][CONNECT] Use UUIDv7 for operation IDs to make operations chronologically sortable

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on PR #42772:
URL: https://github.com/apache/spark/pull/42772#issuecomment-1702819169

   I am not sure I understand the use case here. Why do we exactly need them to be sortable? And is this a must-have?
   
   One of the problems I see here is that you rely on the client to generate a proper v7 UUID, we do not control the client it is an open protocol, so a new implementation can just provide a v4 UUID, or generate an improper v7. There is also the matter of time drift between client and server, who will this affect the generated UUIDs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] jdesjean commented on pull request #42772: [SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable

Posted by "jdesjean (via GitHub)" <gi...@apache.org>.
jdesjean commented on PR #42772:
URL: https://github.com/apache/spark/pull/42772#issuecomment-1702991483

   > I am not sure I understand the use case here. Why do we exactly need them to be sortable? And is this a must-have?
   > 
   > One of the problems I see here is that you rely on the client to generate a proper v7 UUID, we do not control the client it is an open protocol, so a new implementation can just provide a v4 UUID, or generate an improper v7. There is also the matter of time drift between client and server, who will this affect the generated UUIDs?
   
   When operation id is used as a PK, UUIDv7 gives us the nice property that the order will roughly match the start time order for the query. While no one should rely on this property exclusively, having the records roughly ordered improves sorting performance.
   Additionally, for most lookup sorting by start time, sorting by operation id is useful to obtain consistent ordering in the case of duplicates. Roughly ordered records again help improve the performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dillitz commented on pull request #42772: [SPARK-45051][CONNECT] Use UUIDv7 for operation IDs to make operations chronologically sortable

Posted by "dillitz (via GitHub)" <gi...@apache.org>.
dillitz commented on PR #42772:
URL: https://github.com/apache/spark/pull/42772#issuecomment-1702862775

   > I am not sure I understand the use case here. Why do we exactly need them to be sortable? And is this a must-have?
   
   This was a request from @jdesjean. I believe he can give you a bit more background on this. From what I understand, this would allow us to present a more comprehensible history of executed operations to the user.
   
   > One of the problems I see here is that you rely on the client to generate a proper v7 UUID, we do not control the client it is an open protocol, so a new implementation can just provide a v4 UUID, or generate an improper v7. There is also the matter of time drift between client and server, who will this affect the generated UUIDs?
   
   I agree with you that in this current state, we can not rely on the ID being in the v7 format, but this is also not the goal of this PR. We just want to change the default format from v4 to v7 since it has nicer properties while fulfilling the same requirements.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] juliuszsompolski commented on pull request #42772: [SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable

Posted by "juliuszsompolski (via GitHub)" <gi...@apache.org>.
juliuszsompolski commented on PR #42772:
URL: https://github.com/apache/spark/pull/42772#issuecomment-1707000060

   We maintain backwards compatibility, where older clients can connect to newer server. These older clients will not provide such UUIDs.
   What will happen then? Does it break any critical assumptions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dillitz closed pull request #42772: [SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable

Posted by "dillitz (via GitHub)" <gi...@apache.org>.
dillitz closed pull request #42772: [SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable
URL: https://github.com/apache/spark/pull/42772


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org