You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pinot.apache.org by TING CHEN <ti...@uber.com.INVALID> on 2020/02/18 22:20:05 UTC

Does Pinot support early termination?

Does Pinot do early stop when enough results have been already collected?

We have queries of form
"SELECT * FROM table WHERE userID='H' AND sourceEventTimestamp>=t1 AND
sourceEventTimestamp<=t2 ORDER BY sourceEventTimestamp DESC LIMIT 500".

The table has been sorted by sourceEventTimestamp and userID has inverted
index. I notice that the selectivity of the query is low (meaning many rows
passing the condition). So the first 500 results should be collected
relatively quick. But the exec times are too long i.e., > 10s.

=============== Slack conversation with Kishore and Xiang attached below===
Kishore G <https://app.slack.com/team/UDRJ7G85T>  12:32 PM
<https://apache-pinot.slack.com/archives/CDRCA57FC/p1582057958151100>
We can do early termination if there is no order by
12:33 <https://apache-pinot.slack.com/archives/CDRCA57FC/p1582058018152300>
But with order by, there is nothing much we can do to terminate early...
12:34 <https://apache-pinot.slack.com/archives/CDRCA57FC/p1582058040153100>
What is the problem you are trying to solve?
Ting Chen <https://app.slack.com/team/UG3BZ4ALQ>  12:59 PM
<https://apache-pinot.slack.com/archives/CDRCA57FC/p1582059583158200>
the main issue we have is query latency is too long (~15 s). For early
termination, since the table is physically sorted by the ORDER_BY column, I
suppose an ideal plan is to check the relevant segments (starting with the
segments with the largest value in the filtering range) and stop when
enough results have been collected?
Kishore G <https://app.slack.com/team/UDRJ7G85T>  1:02 PM
<https://apache-pinot.slack.com/archives/CDRCA57FC/p1582059747160100>
That’s possible, what is the time range in the query
Ting Chen <https://app.slack.com/team/UG3BZ4ALQ>  1:04 PM
<https://apache-pinot.slack.com/archives/CDRCA57FC/p1582059864161200>
from 7 days ago to a few second ago. Basically the past 7 days' data.
Kishore G <https://app.slack.com/team/UDRJ7G85T>  1:10 PM
<https://apache-pinot.slack.com/archives/CDRCA57FC/p1582060200165000>
It’s a good optimization to have. Worth starting a thread and discussing
further. For now, is it possible for the client to break it up into
multiple queries- one for each day?
Ting Chen <https://app.slack.com/team/UG3BZ4ALQ>  1:12 PM
<https://apache-pinot.slack.com/archives/CDRCA57FC/p1582060338167500>
I will file an issue for this and do some investigation on codes. Yes, you
idea is basically the walk-around for now. I ask the customers to look for
the past 1 day's data instead: they still got their results needed while
the latency is halved.
Kishore G <https://app.slack.com/team/UDRJ7G85T>  1:14 PM
<https://apache-pinot.slack.com/archives/CDRCA57FC/p1582060485168700>
Cool. What you want is doable with some optimization in the planning
phase..
Xiang Fu <https://app.slack.com/team/UGRJA9TEH>  1:46 PM
<https://apache-pinot.slack.com/archives/CDRCA57FC/p1582062381169400>
@Ting Chen <https://apache-pinot.slack.com/team/UG3BZ4ALQ>
1:46 <https://apache-pinot.slack.com/archives/CDRCA57FC/p1582062405170000>
one thing about this is that the query will hit many segments and merge the
results
1:47 <https://apache-pinot.slack.com/archives/CDRCA57FC/p1582062448170800>
so it’s hard to tell the global ordering to do early termination