You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Sijie Guo <gu...@gmail.com> on 2019/11/02 04:33:09 UTC

Re: Presto - event time predicate push-down?

Hi Brian,

I think it is already implemented. The Pulsar Presto Connector supports
predicate pushdown based on publish time.

> Or it could include min/max/bloom filter on user data too, like ORC
<https://orc.apache.org/docs/indexes.html> does

Alternatively, we can leverage Pulsar's tiered storage mechanism and
implement a schema-aware columnar offloader to offload a row-based segment
into a columnar segment using Parquet or ORC format.
That's one item in our roadmap.

Thanks,
Sijie

On Wed, Oct 9, 2019 at 3:35 AM Brian Candler <b....@pobox.com> wrote:

> On 08/10/2019 18:33, Brian Candler wrote:
>
>     select * from events where data like "%foo%"
>         and publish_time between "2019-01-01T12:00:00" and
> "2019-01-01T13:00:00";
>
> Does Presto/Pulsar/Bookkeeper only touch the segments where publish_time
> is within those boundaries?  Is there an index somewhere which says for
> each segment what is the lowest and highest publish_time it contains?
>
> Ah, I found listed under Phase 2 features at
> https://github.com/apache/pulsar/wiki/PIP-19:-Pulsar-SQL
>
> 4. Time boxed queries
> 5. When doing a query over a subset of the data, based on publish time, we
> should be able to only scan the relevant data instead of everything stored
> in the topic
>
> So I guess this is an "upcoming feature".
>
> (Aside: it occurs to me that if every closed segment published its minimum
> and maximum publish time on a meta-topic, that would be an efficient way to
> locate the segments of interest.  Or it could include min/max/bloom filter
> on user data too, like ORC <https://orc.apache.org/docs/indexes.html>
> does)
>

Re: Presto - event time predicate push-down?

Posted by Brian Candler <b....@pobox.com>.
On 02/11/2019 04:33, Sijie Guo wrote:
> I think it is already implemented. The Pulsar Presto Connector 
> supports predicate pushdown based on publish time.
>
Excellent, thank you.

I couldn't get Presto to work in Pulsar's standalone mode (#5370 
<https://github.com/apache/pulsar/issues/5370>), but that's just waiting 
for a newer version of Presto to be integrated (#5386 
<https://github.com/apache/pulsar/pull/5386>)


> Alternatively, we can leverage Pulsar's tiered storage mechanism and 
> implement a schema-aware columnar offloader to offload a row-based 
> segment into a columnar segment using Parquet or ORC format.
> That's one item in our roadmap.

That would be awesome!