You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/11/09 18:55:52 UTC

[GitHub] [pinot] nizarhejazi opened a new issue, #9770: Sorted index on a non-dictionary-encoded column results in full segment scan

nizarhejazi opened a new issue, #9770:
URL: https://github.com/apache/pinot/issues/9770

   Defining a `sorted index` on a `non-dictionary-encoded` column causes Pinot to fully scan the content of segments (no inverted index is utilized).
   
   `user` table config:
   ```
         "noDictionaryColumns": [
           "id",
         ],
         "sortedColumn": [
           "id"
         ],
   ```
   
   Example query: 
   `EXPLAIN PLAN FOR select id, "home.state" from user where id IN ('some_id')`
   
   Output:
   <img width="1137" alt="image" src="https://user-images.githubusercontent.com/96436550/200915630-691efbcc-6f65-4172-8759-893fadf25aff.png">
   
   Created another user table with a _sorted index_ defined over id which is _dictionary-encoded_.
   
   `user_id_dict_encoded` table config:
   ```
         "noDictionaryColumns": [
         ],
         "sortedColumn": [
           "id"
         ],
   ```
   
   Example query: 
   `EXPLAIN PLAN FOR select id, "home.state" from user_id_dict_encoded where id IN ('some_id')`
   
   Output:
   <img width="1139" alt="image" src="https://user-images.githubusercontent.com/96436550/200916480-672d0d0c-493d-46c7-8894-6cc040486be3.png">
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #9770: Sorted index on a non-dictionary-encoded column results in full segment scan

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #9770:
URL: https://github.com/apache/pinot/issues/9770#issuecomment-1320872744

   I'd suggest throwing error (during table config validation) when the explicitly configured sorted columns is configured as no-dictionary column to avoid the confusion. We should make the exception message very clear on why sorted column should not be configured as no-dictionary


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mayankshriv commented on issue #9770: Sorted index on a non-dictionary-encoded column results in full segment scan

Posted by GitBox <gi...@apache.org>.
mayankshriv commented on issue #9770:
URL: https://github.com/apache/pinot/issues/9770#issuecomment-1309226060

   In case of realtime ingestion, the following happens:
   - The column does get stored wtih no-dict (or raw values).
   - The column is also sorted on raw values, and that is reflected in metadata as well.
   - However, sorted raw values != sorted index, and this leads to full scan.
   
   While this is a valid expected behavior, I recommend we should change the precedence. If user specifies sorted as well as no-dict in real-time table, it would be better to default to dictionary based sorted index, so as to avoid full scan. I recommend this because the penalty of full scan is much higher than dict vs no dict storage overhead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] klsince commented on issue #9770: Sorted index on a non-dictionary-encoded column results in full segment scan

Posted by GitBox <gi...@apache.org>.
klsince commented on issue #9770:
URL: https://github.com/apache/pinot/issues/9770#issuecomment-1309223054

   Thanks for the report. The planning logic for the IN clause in the query would hit this [if-check](https://github.com/apache/pinot/blob/master/pinot-core/src/main/java/org/apache/pinot/core/operator/filter/FilterOperatorUtils.java#L77) to use sorted index for column only when it's sorted AND dictionary encoded. 
   
   We may consider to enforce dictionary encoding for column set in `sortedColumn` to make it more intuitive here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] klsince commented on issue #9770: Sorted index on a non-dictionary-encoded column results in full segment scan

Posted by GitBox <gi...@apache.org>.
klsince commented on issue #9770:
URL: https://github.com/apache/pinot/issues/9770#issuecomment-1309257784

   FYI, currently, column encoding can't be changed (like from raw to dict encoding and vice versa) by reloading segments with the reload or reset swagger APIs. 
   
   There is [an ongoing effort](https://github.com/apache/pinot/issues/9348) to extend the segment loading logic to change column encoding to fix such issues in future.
   
   For now, to work around this issue, one need to re-ingested the data with the corrected tableConfig: either from upstream to REALTIME table, or move data from REALTIME table to an OFFLINE table with the RealtimeToOfflineSegment minion task. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org