You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "gortiz (via GitHub)" <gi...@apache.org> on 2023/04/14 07:36:31 UTC

[GitHub] [pinot] gortiz opened a new issue, #10615: About FieldConfig.IndexType values

gortiz opened a new issue, #10615:
URL: https://github.com/apache/pinot/issues/10615

   # Objective
   The objective of this issue is to discuss about FieldConfig.IndexType, which:
   * Has been superseded by #10183 
   * Is being used inconsistently
   
   We cannot remove this enum because we need to keep backward compatibility, but we should decide what to do with the four literals that are not documented and are not being used (or used in an inconsistent way).
   
   # Description
   
   `FieldConfig.indexTypes` was intended to be the _new_ way to create all indexes, but it was superseded by `FieldConfig.indexes` attribute introduced in #10183. The main problems with `FieldConfig.indexTypes` were that:
   * Most indexes can be configured in some way. To do so, `FieldConfig.indexTypes` required to introduce unsafe string-based configs in `FieldConfig.properties`.
   * IndexTypes is an enum that doesn't contain all valid indexes and includes some extra values that are not valid.
   
   This issue is related to the last point. Reading the [documentation](https://docs.pinot.apache.org/configuration-reference/table#field-config-list), the valid values here should be: Text, FST, Timestamp and h3. But the enum right now contains the following literals:
   - TEXT
   - FST
   - JSON
   - H3
   - INVERTED
   - SORTED
   - RANGE 
   - TIMESTAMP
   
   The last 4 are not included in the documentation and they are not being consistently used in the code.
   
   # Inconsistent usage
   
   To do this analysis I used tag `release-0.12.1`. The reason to do not use master is that #10183 (already merged) modifies a lot of code related to indexes, so I want to be sure that the inconsistent usage of these literals was already there before `index-spi` was merged:
   
   ## Inverted
   * It was introduced in the first version of this document (2020-02-07).
   * It is not read in production code. The only two usages are in MutableSegmentImpl ([here](https://github.com/apache/pinot/blob/release-0.12.1/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java#L676), and [here](https://github.com/apache/pinot/blob/release-0.12.1/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java#L783)). In both cases it is being used to report some error.
   * It is being used in test, but its usage doesn't seem actually important.
   
   ## Sorted
   * It was introduced in the first version of this document (2020-02-07).
   * It is only being used in [HadoopSegmentPreprocessingJob](https://github.com/apache/pinot/blob/release-0.12.1/pinot-plugins/pinot-batch-ingestion/v0_deprecated/pinot-hadoop/src/main/java/org/apache/pinot/hadoop/job/HadoopSegmentPreprocessingJob.java#L179). This class does not exist in master and it is not being used in master. The fact that it is not an actual index type and it was only used here makes me think that it was introduced as an attempt to do something, but it wasn't finished.
   
   ## Range
   * This was introduced in #7615 (2021-10-25).
   * It is not used in production code. Like inverted, it is being used in tests, but just doesn't seem to do anything.
   * Range is an actual index type, but no range index is created when this literal is added in `indexTypes`.
   
   ## Timestamp
   * This was introduced in #8343 (2022-4-11).
   * The only usage in production code is a check. This check fails if the FieldConfig points to a column that is not a timestamp but `indexTypes` contains the TIMESTAMP literal.
   * Timestamp indexes are not actual indexes. There is no timestamp index writer, reader or handler. They are syntactic sugar (or an alias). When pinot detects that this syntactic sugar has to be applied (more on that later) what actually happens is that Pinot generates one extra row for each timestamp granularity and index that column with a range index.
   * Contrary to what documentation says,  `FieldConfig.indexTypes` doesn't have to contain the TIMESTAMP literal in order to activate this syntactic sugar. It is activated if and only if the `FieldConfig.timestampConfig` is not null. Whether or not TIMESTAMP is included in `FieldConfig.indexTypes` has zero impact on that decision.
   * The timestamp syntactic sugar is just a modification of the table config and the schema and it is applied in:
     * [SegmentProcessorConfig (when it is created)](https://github.com/apache/pinot/blob/release-0.12.1/pinot-core/src/main/java/org/apache/pinot/core/segment/processing/framework/SegmentProcessorConfig.java#L54)
     * [IndexLoadingConfig (when it is configured from a TableConfig)](https://github.com/apache/pinot/blob/release-0.12.1/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/IndexLoadingConfig.java)
     * [SegmentGeneratorConfig (when it is created)](https://github.com/apache/pinot/blob/release-0.12.1/pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/creator/SegmentGeneratorConfig.java#L144)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #10615: About FieldConfig.IndexType values

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on issue #10615:
URL: https://github.com/apache/pinot/issues/10615#issuecomment-1513694877

   Trying to understand the proposal here. Are you suggesting removing the 4 unused enums?
   In long term, we want to use `indexes` config and remove `IndexType`, so I'm okay removing the unused one and add some comments stating that the remaining 4 are for backward compatibility purpose


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] gortiz commented on issue #10615: About FieldConfig.IndexType values

Posted by "gortiz (via GitHub)" <gi...@apache.org>.
gortiz commented on issue #10615:
URL: https://github.com/apache/pinot/issues/10615#issuecomment-1514302380

   I'm just describing the problems I found to open a discussion looking for solutions.
   Recently, I was involved in at least two discussion about the _timestamp index_ or the _sorted index_ where different people had different assumptions about the _indexTypes_ values and their meaning and I wanted to make clear how are they used from code's perspective in order to use that as a base on which we can build the _business_ or _product_ terms we want to define to talk with users and even between devs.
   
   About what to do with them. The options I see are:
   1. Remove them
   2. Ignore them (like what we do right now, but in a more consistent way)
   3. Honor them.
   
   I think the second is the better right now. Easier to implement and the one that doesn't actually impact existent table configs. It is not great from the developer perspective, but it is not that harmful.
   
   ## Remove them
   
   As you say, we cannot just remove literals without breaking backward compatibility.
   At most, we may remove literals that were never documented.
   If we do that, Pinot won't be able to read TableConfigs that were using undocumented things, but we can consider that if they are not documented, they are not part of the stable config and therefore there is no backward compatibility break.
   
   ## Ignore them
   
   But even if we do not remove them, we can use them in a deterministic way. 
   For example, we annotate them as deprecated and ensure that we don't use them anywhere (including tests).
   We can also include a warning when they are using, to indicate users that these literals are actually doing nothing.
   
   Specially in the case of _timestamp_, we should actually ignore the field and actually do the check if _timestampConfig_ is not null, which is what actually enable the _index_.
   
   ## Honor them
   
   Another alternative would be just the opposite.
   Modify the code to actually change its behavior when these literals are present.
   That would also be a backward incompatible change, but it may be seen as a bugfix, given that if the user included `range` in their `indexTypes`, we can assume they though that a range index was going to be created.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org