You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/09/15 14:23:57 UTC

[GitHub] [pinot] richardstartin opened a new issue #7437: Range Index Enhancement Proposal

richardstartin opened a new issue #7437:
URL: https://github.com/apache/pinot/issues/7437


   I would like to propose the introduction of bit-sliced range encoded indexes to improve range index query performance, using the data structure implemented [here](https://github.com/RoaringBitmap/RoaringBitmap/pull/519). Please review the proposal document https://docs.google.com/document/d/1se2OgqXJiD7r7S7U6SUmTIAApO66QIrAYosxvXHEXlw/edit?usp=sharing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] amrishlal commented on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
amrishlal commented on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-924631371


   @mayankshriv @kishoreg I am also a strong +1 on this proposal :-) as this appears to be a good approach. I didn't mean to imply in any way this should not be done. I was just wondering (mainly from a curiosity point of view) as to how well this would perform if the current RangeIndex if we were to make certain "optimizations" to the current range index to bring it closer to log2(N) search time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin closed issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
richardstartin closed issue #7437:
URL: https://github.com/apache/pinot/issues/7437


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-975799144


   Should we make this the default in 0.10.0? We have already had reports that this improves query latency by integer multiples. @mayankshriv @Jackie-Jiang @kishoreg 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] amrishlal edited a comment on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
amrishlal edited a comment on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-924631371


   @mayankshriv @kishoreg I am also a strong +1 on this proposal :-) as this appears to be a good approach. I didn't mean to imply in any way that this should not be done. I was just wondering (mainly from a curiosity point of view) as to how well this would perform if the current RangeIndex if we were to make certain "optimizations" to the current range index to bring it closer to log2(N) search time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-975822761


   @richardstartin Yes. I'd suggest making both this and the new raw forward format default in 0.10.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-924715637


   I think it's important to discuss data distribution when considering bucket assignment by value. The Achilles heel of bucketing by value is skewed distributions, which might be produced by e.g. a bursty process leading to clustered timestamps, or maybe latencies in APM data, which tend to be multi-modal with huge outliers extending the range significantly (mean=1ms but p99.9=5s isn't unheard of). If you get, say, 80% of rows falling in a single bucket, evaluation time can't be logarithmic in the number of buckets, even if that limit holds for uniformly distributed data. You can ameliorate this for some distributions by using more buckets (but this means you need to merge more of them at query time), but it's always possible to construct some degenerate case where the buckets become imbalanced (e.g. consider artefacts of upstream processing where e.g. null has been collated to zero and a sparse set of values becomes dense with zeroes). 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kishoreg commented on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
kishoreg commented on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-924054486


   @amrishlal I suggest reading the design doc in detail again.
   
   The problem with any combination of approx filter and then scan to filter again can have bad edge cases. Moreover, the proposal from @richardstartin will completely eliminate the need for a linear scan. 
   
   - square root of N does not really lead to optimal distribution. Also, this might lead to a lot of buckets which will be really bad
   - binary searching of buckets will not make a huge difference when the number of buckets is low. Also, note that we need to find a list of buckets, not just one bucket. A range can span multiple buckets.
   -We cannot do binary search within the bucket since the posting lists are ordered by docIds and not values.
   
   Overall, I am strong +1 on this proposal. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kishoreg commented on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
kishoreg commented on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-920161362


   This is an amazing doc. We should probably add this doc as examples in our contribution guidelines


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] amrishlal edited a comment on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
amrishlal edited a comment on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-924631371


   @mayankshriv @kishoreg I am also a strong +1 on this proposal :-) as this appears to be a good approach. I didn't mean to imply in any way that this should not be done. I was just wondering (mainly from a curiosity point of view) as to how this approach would compare with current RangeIndex if we were to make certain "optimizations" to the current range index to bring it closer to log2(N) search time. While sqrt(N) is the optimal number of buckets, I agree it can lead to a large number of buckets (approx 30K for a billion value column)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] kishoreg edited a comment on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
kishoreg edited a comment on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-920161362


   This is an amazing doc. We should probably add this doc as example in our contribution guidelines


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] amrishlal commented on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
amrishlal commented on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-923691071


   The proposal doc is very interesting reading. Thanks for creating it. I am still in the process of reading through it, but wanted to add a few points regarding the current RangeIndex. A while back, we had done a high-level evaluation of the current RangeIndex algorithm and came up with the following optimizations:
   - If there are N values in the column being indexed, then `RangeIndexCreator` should automatically set the number of buckets to `sqrt(N)` as this is proven to be the most optimal size for searching the right bucket and then searching within the bucket for the right position.
   - Instead of carrying out linear search, `RangeIndexBasedFilterOperator` should carry out a binary search to locate the right bucket.
   - A binary search should also be carried out to locate the position with the bucket (not completely sure if this can be done).
   
   We concluded that these three improvements would improve the performance of RangeIndex to 1/2 * log2(N) for any given starting or ending value. We didn't get around to making these optimizations, but I thought I would put these here for completeness and performance comparison against the "Bit Slice Range Index". The changes (except for maybe the third one) should be straight-forward to implement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] mayankshriv commented on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
mayankshriv commented on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-924026665


   @amrishlal Thanks for sharing your thoughts on the existing range index impl. Do you have an implementation or numbers to compare with? From what I understand, the proposal here is to have a separate implementation (based on offline experiments that show better results), but not to replace the existing one. Once we have the impl, we can perform detailed comparisons and pick the right one based on data driven results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on issue #7437: Range Index Enhancement Proposal

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #7437:
URL: https://github.com/apache/pinot/issues/7437#issuecomment-976003129


   @Jackie-Jiang let’s keep the forward index version separate because it will need a little bit of work to allow the fixed and variable length writer versions to evolve independently, which isn’t a problem when enabling on a column by column basis. I’ll open an issue for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org