You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/01/21 23:44:25 UTC

[GitHub] [incubator-pinot] kkrugler opened a new issue #6473: Support no forward index for column

kkrugler opened a new issue #6473:
URL: https://github.com/apache/incubator-pinot/issues/6473


   Currently a text column can be created without any forward index, which is useful when using the column only for filtering. In this situation, the raw (original) text data is not needed, only the text index (see https://github.com/apache/incubator-pinot/pull/6284/).
   
   There are other situations for non-text columns where this same functionality is useful to reduce the size of the column. In our particular use case, we're generating unique terms for a (large) string field, which we save as a multi-value STRING column. We need an inverted index for fast filtering, but we do not need the forward index, which (leaving aside the inverted index, which is built at load time) accounts for more than 80% of the total segment size.
   
   @kishoreg suggested "having a empty forward Index reader impl" as a way of implementing this.
   
   We could possible handle the configuration of this via a new `noFwdIndexColumns` table config field, similar to the `noDictionaryColumns` config setting.
   
   There would be situations where specifying no forward index for a column would trigger a table config error, for example doing this for a metrics column (or so I assume).
   
   I'm also not sure whether it would be valid to have a column that has no index/dictionary/forward index; does this mean ignore the field in the input data?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia edited a comment on issue #6473: Support no forward index for column

Posted by GitBox <gi...@apache.org>.
siddharthteotia edited a comment on issue #6473:
URL: https://github.com/apache/incubator-pinot/issues/6473#issuecomment-770247494


   What's the size of the forward index for the multi value column? Dctionary IDs in the forward index are bit encoded. Looks like it's very high cardinality and and must be having several millions of rows per segment to result in reasonable size overhead. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] kkrugler commented on issue #6473: Support no forward index for column

Posted by GitBox <gi...@apache.org>.
kkrugler commented on issue #6473:
URL: https://github.com/apache/incubator-pinot/issues/6473#issuecomment-782425259


   Hi @siddharthteotia - yes, one example segment is 2,637,935 rows, and `metadata.properties` for the column of interest (`creativeText_terms`) has cardinality of 48,591 (though that's lower than what I was expecting).
   
   ```
   column.creativeText_terms.cardinality = 48591
   column.creativeText_terms.totalDocs = 2637935
   column.creativeText_terms.dataType = STRING
   column.creativeText_terms.bitsPerElement = 16
   column.creativeText_terms.lengthOfEachEntry = 60
   column.creativeText_terms.columnType = DIMENSION
   column.creativeText_terms.isSorted = false
   column.creativeText_terms.hasNullValue = false
   column.creativeText_terms.hasDictionary = true
   column.creativeText_terms.textIndexType = NONE
   column.creativeText_terms.hasInvertedIndex = true
   column.creativeText_terms.hasFSTIndex = false
   column.creativeText_terms.hasJsonIndex = false
   column.creativeText_terms.isSingleValues = false
   column.creativeText_terms.maxNumberOfMultiValues = 49
   column.creativeText_terms.totalNumberOfEntries = 14628086
   column.creativeText_terms.isAutoGenerated = false
   column.creativeText_terms.minValue = 0.01
   column.creativeText_terms.maxValue = \u1EE9ng
   column.creativeText_terms.defaultNullValue = null
   ```
   
   The dictionary is 2.9MB, and the forward index is 31MB:
   
   ```
   creativeText_terms.dictionary.startOffset = 1648876
   creativeText_terms.dictionary.size = 2915468
   creativeText_terms.forward_index.startOffset = 4564344
   creativeText_terms.forward_index.size = 31110427
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on issue #6473: Support no forward index for column

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on issue #6473:
URL: https://github.com/apache/incubator-pinot/issues/6473#issuecomment-770247494


   What's the size of the forward index for the multi value column? Dctionary IDs in the forward index are bit encoded. Looks like it's very high cardinality and must be having several millions of rows per segment to result in reasonable size overhead. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org