You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/18 18:59:31 UTC

[GitHub] [iceberg] rdblue commented on pull request #3983: Spark: Spark3 ZOrder Rewrite Strategy

rdblue commented on pull request #3983:
URL: https://github.com/apache/iceberg/pull/3983#issuecomment-1045020189


   I'm a bit suspicious of those benchmarks because the error range is so high. The error rate for sortFourColumns is 6.7x the reported value. I'm not sure we can conclude much from those numbers. The scores look good in favor of plain integers (5-10% lower), but the error range is too high to know.
   
   > I'm guessing we could do better with a custom sort expression which doesn't materialize the entire ZValue.
   
   That's one thing that using a float will do. We can convert the first 8 bytes back into a long and then use a secondary column of the remaining bytes or just discard them if we're okay with the loss of precision. That reduces the overall shuffle size and also makes Spark's internal operations much more efficient:
   * UnsafeRow returns primitive longs, rather than allocating arrays and copying bytes out of unsafe memory
   * The value is stored in the fixed-width portion of UnsafeRow, rather than using both length and the variable portion
   * Bytes are compared 8 at a time rather than 1-by-1
   
   I'm not very concerned with the cost of producing the zorder representation, just with the accesses Spark is going to do and the overall size. The floating point idea allows us to reduce the size and discard additional bits of we choose to. That seems like a win for keeping the additional data size down as well as avoiding allocations and thing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org