You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2020/05/30 23:13:44 UTC

[GitHub] [incubator-pinot] siddharthteotia opened a new pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

siddharthteotia opened a new pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470


   (1) PR https://github.com/apache/incubator-pinot/pull/5256 added support for deriving num docs per chunk for var byte raw index create from column length. This was specifically
   done as part of supporting large text values. For use cases that don't want this feature and are high QPS, they see a negative impact since size of chunk increases (earlier value
   of numDocsPerChunk was hardcoded to 1000) and based on the access pattern we might end up uncompressing a bigger chunk to get values for a set of docIds. We have made this change configurable. So the default behavior is same as old (1000 docs per chunk. It can be enabled as follows
   
   `fieldConfigList":[
      {
        "name":"textCol",
        "encodingType":"RAW",
        "indexType":"TEXT",
        "properties":{
           "derive.num.chunks.raw.index":"true",
         }
       }
   `
   
   (2) PR https://github.com/apache/incubator-pinot/pull/4791 added support for noDict for STRING/BYTES in consuming segments. Before PR 4791, even if user had STRING/BYTES as no dictionary in table config, consuming segment still created dictionary because of the lack of support for raw index.  There is a particular impact of this change on the use cases that have set noDict on their STRING dimension columns for other performance reasons and also want metricsAggregation. These use cases don't get to aggregateMetrics because the new implementation was able to honor their table config setting of noDict on STRING/BYTES and created a raw index. Without metrics aggregation, memory pressure increases. So to continue aggregating metrics for such cases, we will create dictionary for STRING/BYTES even if the column is part of noDictionary set from table config.
   
   ## Description
   Add a description of your PR here.
   A good description should include pointers to an issue or design document, etc.
   ## Upgrade Notes
   Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)
   * [ ] Yes (Please label as **<code>backward-incompat</code>**, and complete the section below on Release Notes)
   
   Does this PR fix a zero-downtime upgrade introduced earlier?
   * [ ] Yes (Please label this as **<code>backward-incompat</code>**, and complete the section below on Release Notes)
   
   Does this PR otherwise need attention when creating release notes? Things to consider:
   - New configuration options
   - Deprecation of configurations
   - Signature changes to public methods/interfaces
   - New plugins added or old plugins removed
   * [ ] Yes (Please label this PR as **<code>release-notes</code>** and complete the section on Release Notes)
   ## Release Notes
   If you have tagged this as either backward-incompat or release-notes,
   you MUST add text here that you would like to see appear in release notes of the
   next release.
   
   If you have a series of commits adding or enabling a feature, then
   add this section only in final commit that marks the feature completed.
   Refer to earlier release notes to see examples of text
   
   ## Documentation
   If you have introduced a new feature or configuration, please add it to the documentation as well.
   See https://docs.pinot.apache.org/developers/developers-and-contributors/update-document
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mcvsubbu commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432897422



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
##########
@@ -213,6 +215,14 @@ public void init(SegmentGeneratorConfig segmentCreationSpec, SegmentIndexCreatio
     }
   }
 
+  public static boolean shouldDeriveNumDocsPerChunk(String columnName, Map<String, Map<String, String>> columnProperties) {
+    if (columnProperties != null) {
+      Map<String, String> properties = columnProperties.get(columnName);
+      return properties != null && Boolean.parseBoolean(properties.get(FieldConfig.DERIVE_NUM_DOCS_PER_CHUNK_RAW_INDEX_KEY));

Review comment:
       Can we derive it automatically? (e.g. if column is text index then we derive it from metadata) Or, do you see this being usefiul in other cases as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] snleee commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
snleee commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432898475



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
##########
@@ -213,6 +215,14 @@ public void init(SegmentGeneratorConfig segmentCreationSpec, SegmentIndexCreatio
     }
   }
 
+  public static boolean shouldDeriveNumDocsPerChunk(String columnName, Map<String, Map<String, String>> columnProperties) {
+    if (columnProperties != null) {
+      Map<String, String> properties = columnProperties.get(columnName);
+      return properties != null && Boolean.parseBoolean(properties.get(FieldConfig.DERIVE_NUM_DOCS_PER_CHUNK_RAW_INDEX_KEY));

Review comment:
       @mcvsubbu deriving was the issue for our production issue.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mcvsubbu commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432897195



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
    */
   private boolean isNoDictionaryColumn(Set<String> noDictionaryColumns, Set<String> invertedIndexColumns,
       Set<String> textIndexColumns, FieldSpec fieldSpec, String column) {
-    return textIndexColumns.contains(column) || (noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-        && !invertedIndexColumns.contains(column));
+    if (textIndexColumns.contains(column)) {

Review comment:
       This check here seems a little dangerous to me. We do have column level settings, it is better to throw an exception if a column has both text index as well as dictionary. If we somehow add a dictionary for text column later, we will have to remember to come back to change this place.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mcvsubbu commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432897195



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
    */
   private boolean isNoDictionaryColumn(Set<String> noDictionaryColumns, Set<String> invertedIndexColumns,
       Set<String> textIndexColumns, FieldSpec fieldSpec, String column) {
-    return textIndexColumns.contains(column) || (noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-        && !invertedIndexColumns.contains(column));
+    if (textIndexColumns.contains(column)) {

Review comment:
       This check here seems a little dangerous to me. We do have column level settings, it is better to throw an exception if a column has both text index as well as dictionary.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432898453



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
##########
@@ -213,6 +215,14 @@ public void init(SegmentGeneratorConfig segmentCreationSpec, SegmentIndexCreatio
     }
   }
 
+  public static boolean shouldDeriveNumDocsPerChunk(String columnName, Map<String, Map<String, String>> columnProperties) {
+    if (columnProperties != null) {
+      Map<String, String> properties = columnProperties.get(columnName);
+      return properties != null && Boolean.parseBoolean(properties.get(FieldConfig.DERIVE_NUM_DOCS_PER_CHUNK_RAW_INDEX_KEY));

Review comment:
       > Defaults to false, right?
   
   Yes
   
   > Can we derive it automatically? (e.g. if column is text index then we derive it from metadata) Or, do you see this being usefiul in other cases as well?
   
   We could. Even for columns with text indexes, I don't think we should use it by default (since now that we have seen the potential -ve impact related to access pattern). Yes, most likely this will be used for columns with text index but only if the average column value size is very large (around 1-2MB) since that is the case which takes the chunk size and compressed chunk size (2 * raw) > 2GB and deriving the numDocsPerChunk becomes useful. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mayankshriv commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
mayankshriv commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432896660



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
    */
   private boolean isNoDictionaryColumn(Set<String> noDictionaryColumns, Set<String> invertedIndexColumns,
       Set<String> textIndexColumns, FieldSpec fieldSpec, String column) {
-    return textIndexColumns.contains(column) || (noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-        && !invertedIndexColumns.contains(column));
+    if (textIndexColumns.contains(column)) {
+      // text column is no dictionary currently
+      return true;
+    }
+    FieldSpec.DataType dataType = fieldSpec.getDataType();
+    if (noDictionaryColumns.contains(column)) {
+      // Earlier we didn't support noDict in consuming segments for STRING and BYTES columns.
+      // So even if the user had the column in noDictionaryColumns set in table config, we still
+      // created dictionary in consuming segments.
+      // Later on we added this support. There is a particular impact of this change on the use cases
+      // that have set noDict on their STRING dimension columns for other performance
+      // reasons and also want metricsAggregation. These use cases don't get to
+      // aggregateMetrics because the new implementation is able to honor their table config setting
+      // of noDict on STRING/BYTES. Without metrics aggregation, memory pressure increases.
+      // So to continue aggregating metrics for such cases, we will create dictionary even
+      // if the column is part of noDictionary set from table config
+      if (fieldSpec instanceof DimensionFieldSpec && _aggregateMetrics && (dataType == FieldSpec.DataType.STRING ||

Review comment:
       Log message for this?

##########
File path: pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
##########
@@ -213,6 +215,14 @@ public void init(SegmentGeneratorConfig segmentCreationSpec, SegmentIndexCreatio
     }
   }
 
+  public static boolean shouldDeriveNumDocsPerChunk(String columnName, Map<String, Map<String, String>> columnProperties) {
+    if (columnProperties != null) {
+      Map<String, String> properties = columnProperties.get(columnName);
+      return properties != null && Boolean.parseBoolean(properties.get(FieldConfig.DERIVE_NUM_DOCS_PER_CHUNK_RAW_INDEX_KEY));

Review comment:
       Defaults to false, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] snleee commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
snleee commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432898475



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
##########
@@ -213,6 +215,14 @@ public void init(SegmentGeneratorConfig segmentCreationSpec, SegmentIndexCreatio
     }
   }
 
+  public static boolean shouldDeriveNumDocsPerChunk(String columnName, Map<String, Map<String, String>> columnProperties) {
+    if (columnProperties != null) {
+      Map<String, String> properties = columnProperties.get(columnName);
+      return properties != null && Boolean.parseBoolean(properties.get(FieldConfig.DERIVE_NUM_DOCS_PER_CHUNK_RAW_INDEX_KEY));

Review comment:
       @mcvsubbu deriving was the issue for our production issue. So, we need a way to control when we want to derive or use the default value.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] snleee commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
snleee commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432898475



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
##########
@@ -213,6 +215,14 @@ public void init(SegmentGeneratorConfig segmentCreationSpec, SegmentIndexCreatio
     }
   }
 
+  public static boolean shouldDeriveNumDocsPerChunk(String columnName, Map<String, Map<String, String>> columnProperties) {
+    if (columnProperties != null) {
+      Map<String, String> properties = columnProperties.get(columnName);
+      return properties != null && Boolean.parseBoolean(properties.get(FieldConfig.DERIVE_NUM_DOCS_PER_CHUNK_RAW_INDEX_KEY));

Review comment:
       @mcvsubbu deriving was the issue for our production issue. So, we need a way to control when we want to derive or use the default value.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mcvsubbu commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432897362



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
    */
   private boolean isNoDictionaryColumn(Set<String> noDictionaryColumns, Set<String> invertedIndexColumns,
       Set<String> textIndexColumns, FieldSpec fieldSpec, String column) {
-    return textIndexColumns.contains(column) || (noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-        && !invertedIndexColumns.contains(column));
+    if (textIndexColumns.contains(column)) {
+      // text column is no dictionary currently
+      return true;
+    }
+    FieldSpec.DataType dataType = fieldSpec.getDataType();
+    if (noDictionaryColumns.contains(column)) {
+      // Earlier we didn't support noDict in consuming segments for STRING and BYTES columns.
+      // So even if the user had the column in noDictionaryColumns set in table config, we still
+      // created dictionary in consuming segments.
+      // Later on we added this support. There is a particular impact of this change on the use cases
+      // that have set noDict on their STRING dimension columns for other performance
+      // reasons and also want metricsAggregation. These use cases don't get to
+      // aggregateMetrics because the new implementation is able to honor their table config setting
+      // of noDict on STRING/BYTES. Without metrics aggregation, memory pressure increases.
+      // So to continue aggregating metrics for such cases, we will create dictionary even
+      // if the column is part of noDictionary set from table config
+      if (fieldSpec instanceof DimensionFieldSpec && _aggregateMetrics && (dataType == FieldSpec.DataType.STRING ||

Review comment:
       should the name of the method be changed to `shouldCreateDictionaryForColumn()`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mcvsubbu commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432905885



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
    */
   private boolean isNoDictionaryColumn(Set<String> noDictionaryColumns, Set<String> invertedIndexColumns,
       Set<String> textIndexColumns, FieldSpec fieldSpec, String column) {
-    return textIndexColumns.contains(column) || (noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-        && !invertedIndexColumns.contains(column));
+    if (textIndexColumns.contains(column)) {

Review comment:
       Excellent. I think we still rename the method as something along the lines if `shouldCreateDictionaryForColumn()`, since it has some logic and is not just checking the table config for noDictionary setting,




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia merged pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
siddharthteotia merged pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432898231



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
    */
   private boolean isNoDictionaryColumn(Set<String> noDictionaryColumns, Set<String> invertedIndexColumns,
       Set<String> textIndexColumns, FieldSpec fieldSpec, String column) {
-    return textIndexColumns.contains(column) || (noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-        && !invertedIndexColumns.contains(column));
+    if (textIndexColumns.contains(column)) {

Review comment:
       Actually it is not needed anymore. I already do the validation upfront in TableConfig (already in master). So we can remove it. Per column config is enough




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] snleee commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
snleee commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432895436



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
    */
   private boolean isNoDictionaryColumn(Set<String> noDictionaryColumns, Set<String> invertedIndexColumns,
       Set<String> textIndexColumns, FieldSpec fieldSpec, String column) {
-    return textIndexColumns.contains(column) || (noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-        && !invertedIndexColumns.contains(column));
+    if (textIndexColumns.contains(column)) {
+      // text column is no dictionary currently
+      return true;
+    }
+    FieldSpec.DataType dataType = fieldSpec.getDataType();
+    if (noDictionaryColumns.contains(column)) {
+      // Earlier we didn't support noDict in consuming segments for STRING and BYTES columns.
+      // So even if the user had the column in noDictionaryColumns set in table config, we still
+      // created dictionary in consuming segments.
+      // Later on we added this support. There is a particular impact of this change on the use cases
+      // that have set noDict on their STRING dimension columns for other performance
+      // reasons and also want metricsAggregation. These use cases don't get to
+      // aggregateMetrics because the new implementation is able to honor their table config setting
+      // of noDict on STRING/BYTES. Without metrics aggregation, memory pressure increases.
+      // So to continue aggregating metrics for such cases, we will create dictionary even
+      // if the column is part of noDictionary set from table config
+      if (fieldSpec instanceof DimensionFieldSpec && _aggregateMetrics && (dataType == FieldSpec.DataType.STRING ||

Review comment:
       What was the original behavior before your recent change? Did we explicitly check `STRING` and `BYTES` types also?

##########
File path: pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
##########
@@ -193,9 +194,10 @@ public void init(SegmentGeneratorConfig segmentCreationSpec, SegmentIndexCreatio
             getColumnCompressionType(segmentCreationSpec, fieldSpec);
 
         // Initialize forward index creator
+        boolean deriveNumChunksForVarByteRawIndex = shouldDeriveNumChunksForRawIndex(columnName, segmentCreationSpec.getColumnProperties());

Review comment:
       `deriveNumDocsPerChunk` sounds more accurate? (applicable to all related config & variable names)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] snleee commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
snleee commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432896575



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
    */
   private boolean isNoDictionaryColumn(Set<String> noDictionaryColumns, Set<String> invertedIndexColumns,
       Set<String> textIndexColumns, FieldSpec fieldSpec, String column) {
-    return textIndexColumns.contains(column) || (noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-        && !invertedIndexColumns.contains(column));
+    if (textIndexColumns.contains(column)) {
+      // text column is no dictionary currently
+      return true;
+    }
+    FieldSpec.DataType dataType = fieldSpec.getDataType();
+    if (noDictionaryColumns.contains(column)) {
+      // Earlier we didn't support noDict in consuming segments for STRING and BYTES columns.
+      // So even if the user had the column in noDictionaryColumns set in table config, we still
+      // created dictionary in consuming segments.
+      // Later on we added this support. There is a particular impact of this change on the use cases
+      // that have set noDict on their STRING dimension columns for other performance
+      // reasons and also want metricsAggregation. These use cases don't get to
+      // aggregateMetrics because the new implementation is able to honor their table config setting
+      // of noDict on STRING/BYTES. Without metrics aggregation, memory pressure increases.
+      // So to continue aggregating metrics for such cases, we will create dictionary even
+      // if the column is part of noDictionary set from table config
+      if (fieldSpec instanceof DimensionFieldSpec && _aggregateMetrics && (dataType == FieldSpec.DataType.STRING ||

Review comment:
       Sounds good to me 👍 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] jackjlli commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
jackjlli commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432896019



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
    */
   private boolean isNoDictionaryColumn(Set<String> noDictionaryColumns, Set<String> invertedIndexColumns,
       Set<String> textIndexColumns, FieldSpec fieldSpec, String column) {
-    return textIndexColumns.contains(column) || (noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-        && !invertedIndexColumns.contains(column));
+    if (textIndexColumns.contains(column)) {

Review comment:
       The description of textIndexColumns is missing in this method.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432896416



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
    */
   private boolean isNoDictionaryColumn(Set<String> noDictionaryColumns, Set<String> invertedIndexColumns,
       Set<String> textIndexColumns, FieldSpec fieldSpec, String column) {
-    return textIndexColumns.contains(column) || (noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-        && !invertedIndexColumns.contains(column));
+    if (textIndexColumns.contains(column)) {
+      // text column is no dictionary currently
+      return true;
+    }
+    FieldSpec.DataType dataType = fieldSpec.getDataType();
+    if (noDictionaryColumns.contains(column)) {
+      // Earlier we didn't support noDict in consuming segments for STRING and BYTES columns.
+      // So even if the user had the column in noDictionaryColumns set in table config, we still
+      // created dictionary in consuming segments.
+      // Later on we added this support. There is a particular impact of this change on the use cases
+      // that have set noDict on their STRING dimension columns for other performance
+      // reasons and also want metricsAggregation. These use cases don't get to
+      // aggregateMetrics because the new implementation is able to honor their table config setting
+      // of noDict on STRING/BYTES. Without metrics aggregation, memory pressure increases.
+      // So to continue aggregating metrics for such cases, we will create dictionary even
+      // if the column is part of noDictionary set from table config
+      if (fieldSpec instanceof DimensionFieldSpec && _aggregateMetrics && (dataType == FieldSpec.DataType.STRING ||

Review comment:
       Yes we checked for STRING. See here https://github.com/apache/incubator-pinot/pull/4791
   
   Note that there is a method  `enableMetricsAggregationIfPossible` that decides whether metrics can be aggregated or not and that checks whether all dimensions have dictionary, all metrics should not have dictionary and should be SV etc. That method is still intact.
   
   Just that during initialization of MutableSegmentImpl, we used to check for STRING and remove it from noDictionaryColumns set since raw index wasn't supported. This was actually the reason why the use cases were able to specify it as noDict in config and still able to aggregate metrics. 

##########
File path: pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##########
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
    */
   private boolean isNoDictionaryColumn(Set<String> noDictionaryColumns, Set<String> invertedIndexColumns,
       Set<String> textIndexColumns, FieldSpec fieldSpec, String column) {
-    return textIndexColumns.contains(column) || (noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-        && !invertedIndexColumns.contains(column));
+    if (textIndexColumns.contains(column)) {

Review comment:
       done

##########
File path: pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
##########
@@ -193,9 +194,10 @@ public void init(SegmentGeneratorConfig segmentCreationSpec, SegmentIndexCreatio
             getColumnCompressionType(segmentCreationSpec, fieldSpec);
 
         // Initialize forward index creator
+        boolean deriveNumChunksForVarByteRawIndex = shouldDeriveNumChunksForRawIndex(columnName, segmentCreationSpec.getColumnProperties());

Review comment:
       done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mcvsubbu commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432905837



##########
File path: pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
##########
@@ -213,6 +215,14 @@ public void init(SegmentGeneratorConfig segmentCreationSpec, SegmentIndexCreatio
     }
   }
 
+  public static boolean shouldDeriveNumDocsPerChunk(String columnName, Map<String, Map<String, String>> columnProperties) {
+    if (columnProperties != null) {
+      Map<String, String> properties = columnProperties.get(columnName);
+      return properties != null && Boolean.parseBoolean(properties.get(FieldConfig.DERIVE_NUM_DOCS_PER_CHUNK_RAW_INDEX_KEY));

Review comment:
       The reason I ask is that if we introduce a config it is hard to remove/deprecate, etc. if we make it a default for text column, we can always introduce a config later to adjust. In both offline and realtime cases, we know the average column size (or, can compute easily) at the segment generation time, so it seems to me that this can be done automatically without introducing a configuration. I would propose to NOT introduce a config at this time




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org