You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/11/02 08:59:41 UTC
[GitHub] [pinot] KKcorps commented on a diff in pull request #9527: Do not create dictionary for high-cardinality columns

KKcorps commented on code in PR #9527:
URL: https://github.com/apache/pinot/pull/9527#discussion_r1011392449


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java:
##########
@@ -373,23 +373,33 @@ private boolean createDictionaryForColumn(ColumnIndexCreationInfo info, SegmentG
       return false;
     }
 
-    // Do not create dictionary if index size with dictionary is going to be larger than index size without dictionary
-    // This is done to reduce the cost of dictionary for high cardinality columns
-    // Off by default and needs optimizeDictionaryEnabled to be set to true
-    if (config.isOptimizeDictionaryForMetrics() && spec.getFieldType() == FieldType.METRIC && spec.isSingleValueField()
-        && spec.getDataType().isFixedWidth()) {
-      long dictionarySize = info.getDistinctValueCount() * spec.getDataType().size();
-      long forwardIndexSize =
-          ((long) info.getTotalNumberOfEntries() * PinotDataBitSet.getNumBitsPerValue(info.getDistinctValueCount() - 1)
-              + Byte.SIZE - 1) / Byte.SIZE;
-
-      double indexWithDictSize = dictionarySize + forwardIndexSize;
-      double indexWithoutDictSize = info.getTotalNumberOfEntries() * spec.getDataType().size();
-
-      double indexSizeRatio = indexWithoutDictSize / indexWithDictSize;
-      if (indexSizeRatio <= config.getNoDictionarySizeRatioThreshold()) {
+
+

Review Comment:
   > I can see that for json and text column, we might not want to create dictionary, but for other dimensions, in most cases we still want to create dictionaries, or a lot of indexes cannot be applied.
   > With the current change, for existing users who have optimize dictionary set for metrics, this will automatically apply that to dimensions, which can cause serious regression (inverted index cannot be added).
   > How about adding a config to only apply this to json/text column?
   
   Actually the reason for this change was to introduce this config for dimension columns (after complaints about space amplification and memory usage from users). 
   Json and text index got introduced later in the scope. 
   IMO, what we can do though is then introduce a seperate metric `optimizeDictionaryForDimensions` but mention the risk with setting this config. 
   
   users do have cases where they keep String columns as dimensions but don't really do any filtering on top of them. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org