You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pulsar.apache.org by Sijie Guo <gu...@gmail.com> on 2019/11/02 04:33:09 UTC
Re: Presto - event time predicate push-down?
Hi Brian,
I think it is already implemented. The Pulsar Presto Connector supports
predicate pushdown based on publish time.
> Or it could include min/max/bloom filter on user data too, like ORC
<https://orc.apache.org/docs/indexes.html> does
Alternatively, we can leverage Pulsar's tiered storage mechanism and
implement a schema-aware columnar offloader to offload a row-based segment
into a columnar segment using Parquet or ORC format.
That's one item in our roadmap.
Thanks,
Sijie
On Wed, Oct 9, 2019 at 3:35 AM Brian Candler <b....@pobox.com> wrote:
> On 08/10/2019 18:33, Brian Candler wrote:
>
> select * from events where data like "%foo%"
> and publish_time between "2019-01-01T12:00:00" and
> "2019-01-01T13:00:00";
>
> Does Presto/Pulsar/Bookkeeper only touch the segments where publish_time
> is within those boundaries? Is there an index somewhere which says for
> each segment what is the lowest and highest publish_time it contains?
>
> Ah, I found listed under Phase 2 features at
> https://github.com/apache/pulsar/wiki/PIP-19:-Pulsar-SQL
>
> 4. Time boxed queries
> 5. When doing a query over a subset of the data, based on publish time, we
> should be able to only scan the relevant data instead of everything stored
> in the topic
>
> So I guess this is an "upcoming feature".
>
> (Aside: it occurs to me that if every closed segment published its minimum
> and maximum publish time on a meta-topic, that would be an efficient way to
> locate the segments of interest. Or it could include min/max/bloom filter
> on user data too, like ORC <https://orc.apache.org/docs/indexes.html>
> does)
>
Re: Presto - event time predicate push-down?
Posted by Brian Candler <b....@pobox.com>.
On 02/11/2019 04:33, Sijie Guo wrote:
> I think it is already implemented. The Pulsar Presto Connector
> supports predicate pushdown based on publish time.
>
Excellent, thank you.
I couldn't get Presto to work in Pulsar's standalone mode (#5370
<https://github.com/apache/pulsar/issues/5370>), but that's just waiting
for a newer version of Presto to be integrated (#5386
<https://github.com/apache/pulsar/pull/5386>)
> Alternatively, we can leverage Pulsar's tiered storage mechanism and
> implement a schema-aware columnar offloader to offload a row-based
> segment into a columnar segment using Parquet or ORC format.
> That's one item in our roadmap.
That would be awesome!