You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/07/13 17:24:35 UTC
[GitHub] [pinot] siddharthteotia commented on issue #5246: Pagination support for groupBy queries

siddharthteotia commented on issue #5246:
URL: https://github.com/apache/pinot/issues/5246#issuecomment-1183488539

   At LinkedIn, we have started to work on pagination on priority considering multiple requests we have received internally.
   
   At a high level, our customer requirements are around the fact that they want to run query in Pinot that can potentially return a large response and user app does not want to accept the entire response in memory at once and want the ability to paginate the response as multiple result sets (size per result set dictated by the user app).
   
   The current pagination implementation in Pinot (even if it is just for selection query) is sub-optimal in the sense that it takes each query as a fresh query and executes the query again and again for every pagination window, discard the results outside the window and provides the result within the M, N window that user has asked for. 
   
   The main thing to note about pagination is that it has to be treated as a single query. 
   
   Our customers don't want to run a one-off pagination query OFFSET M, FETCH N where M and N are completely random in which case it is not possible to reason about the results and it's even hard for the user to decide M as a one-off starting point.
   
   The semantics that we want to provide is that "I want to fetch 10 million records from Pinot for a query and want to fetch 100K at a time"
   
   So the customer will typically start with M as 0 and might just keep N fixed (say at 10000 or so) and just keep paging the results through multiple calls from their app which simply changes M during every call (and they potentially refresh the results in UI etc returned by Pinot in every call).
   
   I think we should look at the pagination problem from this perspective as opposed to a random one-off pagination query because M and N don’t make any sense for a random query. Result of a random pagination query doesn’t add any value to the user since they want to look at the entire result as a continuous stream of results with the will to stop anytime. 
   
   We are trying to tackle the problem from the above perspective when trying to provide pagination semantics. Detail design discussion is in progress.
   
   Some more thoughts slightly related to this -- 
   
   Now, one problem is that users who run such queries may have the tendency to think that support for pagination means they can run "any" query in Pinot that can be very long running and Pinot is guaranteed to finish it and provide results. This can easily cause OOM (out of memory) and bring down the cluster. 
   
   Pinot is unlikely to enter the territory of running very long running queries and getting the entire 100% accurate result by spilling to disk and avoiding OOM at all times. Presto should be used for those cases. 
   
   However, for some of our users (who are ok with multi-second latency and prefer slightly more accurate response for GROUP BY queries),  as a follow-up / next phase, we want to consider enhancing support in Pinot for queries that return large responses and/or process / aggregate more than usual amounts of data. We want to do this by doing some of the memory intensive query execution operations in off-heap (direct) memory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org