You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/01/31 12:50:18 UTC
[GitHub] [pinot] richardstartin opened a new issue #8094: Make it possible to read strings from `Dictionary` as `byte[]`

richardstartin opened a new issue #8094:
URL: https://github.com/apache/pinot/issues/8094


   Currently reading values from `StringDictionary` is very expensive in terms of allocations.
   
   Firstly an oversized `byte[]` in `StringDictionary`
   
   ```java
     @Override
     public String getStringValue(int dictId) {
       return getUnpaddedString(dictId, getBuffer());
     }
   
     protected byte[] getBuffer() {
       return new byte[_numBytesPerValue];
     }
   ```
   
   Then a `String` is allocated once the size is known in `FixedByteValueReaderWriter`:
   
   ```java
     @Override
     public String getUnpaddedString(int index, int numBytesPerValue, byte paddingByte, byte[] buffer) {
       // Based on the ZeroInWord algorithm: http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
       assert buffer.length >= numBytesPerValue;
       long startOffset = (long) index * numBytesPerValue;
       long pattern = (paddingByte & 0xFFL) * 0x101010101010101L;
       ByteBuffer wrapper = ByteBuffer.wrap(buffer);
       if (_dataBuffer.order() == ByteOrder.LITTLE_ENDIAN) {
         wrapper.order(ByteOrder.LITTLE_ENDIAN);
       }
       int position = 0;
       for (int i = 0; i < ((numBytesPerValue >>> 3) << 3); i += 8) {
         long word = _dataBuffer.getLong(startOffset + i);
         wrapper.putLong(i, word);
         long zeroed = word ^ pattern;
         long tmp = (zeroed & 0x7F7F7F7F7F7F7F7FL) + 0x7F7F7F7F7F7F7F7FL;
         tmp = ~(tmp | zeroed | 0x7F7F7F7F7F7F7F7FL);
         if (tmp == 0) {
           position += 8;
         } else {
           position += _dataBuffer.order() == ByteOrder.LITTLE_ENDIAN
               ? Long.numberOfTrailingZeros(tmp) >>> 3
               : Long.numberOfLeadingZeros(tmp) >>> 3;
           return new String(buffer, 0, position, UTF_8);
         }
       }
       return getUnpaddedStringTail(startOffset, position, numBytesPerValue, paddingByte, buffer);
     }
   
     private String getUnpaddedStringTail(long startOffset, int position, int numBytesPerValue, byte paddingByte,
         byte[] buffer) {
       for (; position < numBytesPerValue; position++) {
         byte b = _dataBuffer.getByte(startOffset + position);
         if (b == paddingByte) {
           break;
         }
         buffer[position] = b;
       }
       return new String(buffer, 0, position, UTF_8);
     }
   ```
   
   Having a `byte[]` is often preferable to a `String` anyway, so this could be streamlined by calculating the length of the `String` and then allocating the correctly sized `byte[]`, since the buffer isn't reused anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org