You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/04/05 07:25:55 UTC

[GitHub] [pinot] siddharthteotia commented on a diff in pull request #8398: Allow disabling dict generation for High cardinality columns

siddharthteotia commented on code in PR #8398:
URL: https://github.com/apache/pinot/pull/8398#discussion_r842448677


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java:
##########
@@ -260,6 +260,26 @@ private boolean createDictionaryForColumn(ColumnIndexCreationInfo info, SegmentG
         .containsKey(column)) {
       return false;
     }
+
+    // Do not create dictionary if size with dictionary is going to be larger than size without dictionary
+    // This is done to reduce the cost of dictionary for high cardinality columns
+    // Off by default and needs optimizeDictionaryEnabled to be set to true
+    if (config.isOptimizeDictionaryEnabled() && spec.getFieldType() == FieldType.METRIC
+        && spec.isSingleValueField() && spec.getDataType().isFixedWidth()) {
+      long dictionarySize = info.getDistinctValueCount() * spec.getDataType().size();
+      long forwardIndexSize =
+          ((long) info.getTotalNumberOfEntries() * PinotDataBitSet.getNumBitsPerValue(info.getDistinctValueCount() - 1)
+              + Byte.SIZE - 1) / Byte.SIZE;
+
+      double indexWithDictSize = dictionarySize + forwardIndexSize;
+      double indexWithoutDictSize = info.getTotalNumberOfEntries() * spec.getDataType().size();
+
+      double storageSaved = (indexWithDictSize - indexWithoutDictSize) / indexWithDictSize;

Review Comment:
   We can instead do a check based on ratio ?
   
   `indexWithDictSize / indexWithoutDictSize`
   
   Also, if we want to use percentage, then the reverse calculation will be more intuitive imo  as we want to compute the storage saved by creating dictionary over no-dictionary
   
   `double storageSaved = (indexWithoutDictSize - indexWithDictSize) / indexWithoutDictSize;`
   
   e.g -- long column with 1million rows and 16 cardinality
   
   (8*1024*1024) - (128 + 500000) / (8 * 1024 * 1024) = 0.94 or 94%
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org