You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "jackye1995 (via GitHub)" <gi...@apache.org> on 2023/03/17 16:05:50 UTC

[GitHub] [iceberg] jackye1995 commented on pull request #7128: Core: Optimize S3 layout of Datafiles by expanding first character set of the hash

jackye1995 commented on PR #7128:
URL: https://github.com/apache/iceberg/pull/7128#issuecomment-1474064817

   @danielcweeks this comes out of a test we did with one of our biggest customers, where the current object storage layout still resulted in throttling, but by extending it we were able to eliminate throttling. 
   
   There are 2 issues with the current layout, here are some data we collected internally:
   
   1. The left padding is an area of concern.  If the hex-encoded version of the hash is not always 8 characters long, then there will be a large number of entries that start with one or more “0” characters, which could create uneven heat distribution for S3.  To test this theory we use UUID to simulate file paths, generate 10M random entries, and then apply the algorithm above.  We can plot the histogram of the length of a non-padded, hex-encoded entropy:
   
   ```
   length[1] = 0 
   length[2] = 2 
   length[3] = 24 
   length[4] = 281 
   length[5] = 4589 
   length[6] = 73243 
   length[7] = 1173442 
   length[8] = 8748419 
   ```
   
   This indicates that ~12.5% of all file paths will start with “0”, which for a hex-encoded string should have been 6.25% since there are 16 characters in the set.  
   
   2. All hex-encodings start with the inclusive range 0-8:
   
   ```
   count(partition["0"]) = 1249993 
   count(partition["1"]) = 1250614 
   count(partition["2"]) = 1250830 
   count(partition["3"]) = 1250695 
   count(partition["4"]) = 1249028 
   count(partition["5"]) = 1249782 
   count(partition["6"]) = 1249436 
   count(partition["7"]) = 1249622 
   ```
   
   The issue here is that the default algorithm generates values up to `Integer.MAX_VALUE`, which in Java is `(2^31)-1`.  When this is converted to hex, it produces 0x7FFFFFFF, so it’s understandable why the first half of the range is restricted to 8 characters using this default algorithm.
   
   With that being said, maybe we do not need all the character range to solve it. An alternative approach is to use something like a reverse hex, which we have also tested to be working. But this approach will maximize the entropy in any cases without the need to worry about S3 internals. I will ask S3 team to take a look and see if the current approach is necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org