You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Nicolas Paris <ni...@gmail.com> on 2018/04/30 13:30:56 UTC

Re: question about views

Hi

This looks an interesting design.

Am I correct such view
would hit the RDBMS for every query ?
However such view would hit the parquet file only when
the timestamp predicate would match a partition ?

Any news on a recent test to confirm the design ?

Thanks

2018-03-20 6:49 GMT+01:00 Ted Dunning <te...@gmail.com>:

> Aman,
>
> That is exactly the clarification that I needed. I had a hazy memory of a
> problem in this area, but not enough to actually figure out the current
> state.
>
> In case anybody cares, being able to do this is really handy. The basic
> idea is to keep long history in files and recent history in a DB. That
> allows you to create files with data that is advantageously sorted in order
> to get excellent compression. You can get nearly atomic switch-over to
> newly created files with lazy deletion of database entries by using a
> reference to a cutoff date in a database row. The file side would only look
> for data before the cutoff and the DB would only look for data after the
> cut. By positioning new files (created by CTAS on an about to be obsolete
> part of the DB) before changing the cutoff date, we get apparent atomicity.
>
> After the switch, and after a reasonable delay beyond that (to let all
> pending queries finish), the DB can be trimmed.
>
> Without a working pushdown through unions, this is all kind of pointless.
> If that is working now, it would be fabulous.
>
> An example of how big a win this can be, consider a use case where we want
> to keep all old states of customer preferences and context (say for a
> mobile phone). Almost all of the hundreds of settings for an individual
> would be unchanged even if a few do change. That means that if you could
> arrange a day (or more) of data by user id, the columnar compression of
> parquet would crush the data size. This only works, however, if you can
> collect a fair number of rows for each user. Thus the idea of a hybrid
> setup.
>
>
>
> On Mon, Mar 19, 2018 at 11:57 PM, Aman Sinha <am...@apache.org> wrote:
>
> > Due to an infinite loop occurring in Calcite planning, we had to disable
> > the filter pushdown past the union (SetOps).  See
> > https://issues.apache.org/jira/browse/DRILL-3855.
> > Now that we have rebased on Calcite 1.15.0, we should re-enable this and
> > test and if the pushdown works then the partition pruning on both sides
> of
> > the union should automatically work after that.
> >
> > Will follow-up on this..
> >
> > -Aman
> >
> > On Mon, Mar 19, 2018 at 3:02 PM, Kunal Khatua <ku...@gmail.com>
> > wrote:
> >
> > > I think Ted's question is 2 fold, with the former being more important.
> > > 1. Can we push filters past a union.
> > > 2. Will Drill push filters down to the source.
> > >
> > > For the latter, it depends on the source.
> > > For the former, it depends primarily on whether Calcite supports this.
> I
> > > haven't tried it, so I can't say.
> > >
> > > On 3/19/2018 2:22:54 PM, rahul challapalli <challapallirahul@gmail.com
> >
> > > wrote:
> > > First I would suggest to ignore the view and try out a query which has
> > the
> > > required filters as part of the subqueries on both sides of the union
> > (for
> > > both the database and partitioned parquet data). The plan for such a
> > query
> > > should have the answers to your question. If both the subqueries
> > > independently prune out un-necessary data, using partitions or
> indexes, I
> > > don't think adding a union between them would alter that behavior.
> > >
> > > -Rahul
> > >
> > > On Mon, Mar 19, 2018 at 1:44 PM, Ted Dunning wrote:
> > >
> > > > IF I create a view that is a union of partitioned parquet files and a
> > > > database that has secondary indexes, will Drill be able to properly
> > push
> > > > down query limits into both parts of the union?
> > > >
> > > > In particular, if I have lots of archival data and parquet
> partitioned
> > by
> > > > time but my query only asks for recent data that is in the database,
> > will
> > > > the query avoid the parquet files entirely (as you would wish)?
> > > >
> > > > Conversely, if the data I am asking for is entirely in the archive,
> > will
> > > > the query make use of the partitioning on my parquet files correctly?
> > > >
> > >
> >
>

Re: question about views

Posted by Ted Dunning <te...@gmail.com>.

I will see what I can do to set up a test.

On Mon, Apr 30, 2018, 08:10 Vitalii Diravka <vi...@gmail.com>
wrote:

> Ted,
>
> The rules are enabled and DRILL-3855 [1] is resolved.
> Please try your queries with latest Drill master version.
>
> [1] https://issues.apache.org/jira/browse/DRILL-3855
>
> Kind regards
> Vitalii
>
>
> On Mon, Apr 30, 2018 at 4:31 PM Nicolas Paris <ni...@gmail.com> wrote:
>
> > Hi
> >
> > This looks an interesting design.
> >
> > Am I correct such view
> > would hit the RDBMS for every query ?
> > However such view would hit the parquet file only when
> > the timestamp predicate would match a partition ?
> >
> > Any news on a recent test to confirm the design ?
> >
> > Thanks
> >
> > 2018-03-20 6:49 GMT+01:00 Ted Dunning <te...@gmail.com>:
> >
> > > Aman,
> > >
> > > That is exactly the clarification that I needed. I had a hazy memory
> of a
> > > problem in this area, but not enough to actually figure out the current
> > > state.
> > >
> > > In case anybody cares, being able to do this is really handy. The basic
> > > idea is to keep long history in files and recent history in a DB. That
> > > allows you to create files with data that is advantageously sorted in
> > order
> > > to get excellent compression. You can get nearly atomic switch-over to
> > > newly created files with lazy deletion of database entries by using a
> > > reference to a cutoff date in a database row. The file side would only
> > look
> > > for data before the cutoff and the DB would only look for data after
> the
> > > cut. By positioning new files (created by CTAS on an about to be
> obsolete
> > > part of the DB) before changing the cutoff date, we get apparent
> > atomicity.
> > >
> > > After the switch, and after a reasonable delay beyond that (to let all
> > > pending queries finish), the DB can be trimmed.
> > >
> > > Without a working pushdown through unions, this is all kind of
> pointless.
> > > If that is working now, it would be fabulous.
> > >
> > > An example of how big a win this can be, consider a use case where we
> > want
> > > to keep all old states of customer preferences and context (say for a
> > > mobile phone). Almost all of the hundreds of settings for an individual
> > > would be unchanged even if a few do change. That means that if you
> could
> > > arrange a day (or more) of data by user id, the columnar compression of
> > > parquet would crush the data size. This only works, however, if you can
> > > collect a fair number of rows for each user. Thus the idea of a hybrid
> > > setup.
> > >
> > >
> > >
> > > On Mon, Mar 19, 2018 at 11:57 PM, Aman Sinha <am...@apache.org>
> > wrote:
> > >
> > > > Due to an infinite loop occurring in Calcite planning, we had to
> > disable
> > > > the filter pushdown past the union (SetOps).  See
> > > > https://issues.apache.org/jira/browse/DRILL-3855.
> > > > Now that we have rebased on Calcite 1.15.0, we should re-enable this
> > and
> > > > test and if the pushdown works then the partition pruning on both
> sides
> > > of
> > > > the union should automatically work after that.
> > > >
> > > > Will follow-up on this..
> > > >
> > > > -Aman
> > > >
> > > > On Mon, Mar 19, 2018 at 3:02 PM, Kunal Khatua <kunalkhatua@gmail.com
> >
> > > > wrote:
> > > >
> > > > > I think Ted's question is 2 fold, with the former being more
> > important.
> > > > > 1. Can we push filters past a union.
> > > > > 2. Will Drill push filters down to the source.
> > > > >
> > > > > For the latter, it depends on the source.
> > > > > For the former, it depends primarily on whether Calcite supports
> > this.
> > > I
> > > > > haven't tried it, so I can't say.
> > > > >
> > > > > On 3/19/2018 2:22:54 PM, rahul challapalli <
> > challapallirahul@gmail.com
> > > >
> > > > > wrote:
> > > > > First I would suggest to ignore the view and try out a query which
> > has
> > > > the
> > > > > required filters as part of the subqueries on both sides of the
> union
> > > > (for
> > > > > both the database and partitioned parquet data). The plan for such
> a
> > > > query
> > > > > should have the answers to your question. If both the subqueries
> > > > > independently prune out un-necessary data, using partitions or
> > > indexes, I
> > > > > don't think adding a union between them would alter that behavior.
> > > > >
> > > > > -Rahul
> > > > >
> > > > > On Mon, Mar 19, 2018 at 1:44 PM, Ted Dunning wrote:
> > > > >
> > > > > > IF I create a view that is a union of partitioned parquet files
> > and a
> > > > > > database that has secondary indexes, will Drill be able to
> properly
> > > > push
> > > > > > down query limits into both parts of the union?
> > > > > >
> > > > > > In particular, if I have lots of archival data and parquet
> > > partitioned
> > > > by
> > > > > > time but my query only asks for recent data that is in the
> > database,
> > > > will
> > > > > > the query avoid the parquet files entirely (as you would wish)?
> > > > > >
> > > > > > Conversely, if the data I am asking for is entirely in the
> archive,
> > > > will
> > > > > > the query make use of the partitioning on my parquet files
> > correctly?
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: question about views

Posted by Vitalii Diravka <vi...@gmail.com>.

Ted,

The rules are enabled and DRILL-3855 [1] is resolved.
Please try your queries with latest Drill master version.

[1] https://issues.apache.org/jira/browse/DRILL-3855

Kind regards
Vitalii


On Mon, Apr 30, 2018 at 4:31 PM Nicolas Paris <ni...@gmail.com> wrote:

> Hi
>
> This looks an interesting design.
>
> Am I correct such view
> would hit the RDBMS for every query ?
> However such view would hit the parquet file only when
> the timestamp predicate would match a partition ?
>
> Any news on a recent test to confirm the design ?
>
> Thanks
>
> 2018-03-20 6:49 GMT+01:00 Ted Dunning <te...@gmail.com>:
>
> > Aman,
> >
> > That is exactly the clarification that I needed. I had a hazy memory of a
> > problem in this area, but not enough to actually figure out the current
> > state.
> >
> > In case anybody cares, being able to do this is really handy. The basic
> > idea is to keep long history in files and recent history in a DB. That
> > allows you to create files with data that is advantageously sorted in
> order
> > to get excellent compression. You can get nearly atomic switch-over to
> > newly created files with lazy deletion of database entries by using a
> > reference to a cutoff date in a database row. The file side would only
> look
> > for data before the cutoff and the DB would only look for data after the
> > cut. By positioning new files (created by CTAS on an about to be obsolete
> > part of the DB) before changing the cutoff date, we get apparent
> atomicity.
> >
> > After the switch, and after a reasonable delay beyond that (to let all
> > pending queries finish), the DB can be trimmed.
> >
> > Without a working pushdown through unions, this is all kind of pointless.
> > If that is working now, it would be fabulous.
> >
> > An example of how big a win this can be, consider a use case where we
> want
> > to keep all old states of customer preferences and context (say for a
> > mobile phone). Almost all of the hundreds of settings for an individual
> > would be unchanged even if a few do change. That means that if you could
> > arrange a day (or more) of data by user id, the columnar compression of
> > parquet would crush the data size. This only works, however, if you can
> > collect a fair number of rows for each user. Thus the idea of a hybrid
> > setup.
> >
> >
> >
> > On Mon, Mar 19, 2018 at 11:57 PM, Aman Sinha <am...@apache.org>
> wrote:
> >
> > > Due to an infinite loop occurring in Calcite planning, we had to
> disable
> > > the filter pushdown past the union (SetOps).  See
> > > https://issues.apache.org/jira/browse/DRILL-3855.
> > > Now that we have rebased on Calcite 1.15.0, we should re-enable this
> and
> > > test and if the pushdown works then the partition pruning on both sides
> > of
> > > the union should automatically work after that.
> > >
> > > Will follow-up on this..
> > >
> > > -Aman
> > >
> > > On Mon, Mar 19, 2018 at 3:02 PM, Kunal Khatua <ku...@gmail.com>
> > > wrote:
> > >
> > > > I think Ted's question is 2 fold, with the former being more
> important.
> > > > 1. Can we push filters past a union.
> > > > 2. Will Drill push filters down to the source.
> > > >
> > > > For the latter, it depends on the source.
> > > > For the former, it depends primarily on whether Calcite supports
> this.
> > I
> > > > haven't tried it, so I can't say.
> > > >
> > > > On 3/19/2018 2:22:54 PM, rahul challapalli <
> challapallirahul@gmail.com
> > >
> > > > wrote:
> > > > First I would suggest to ignore the view and try out a query which
> has
> > > the
> > > > required filters as part of the subqueries on both sides of the union
> > > (for
> > > > both the database and partitioned parquet data). The plan for such a
> > > query
> > > > should have the answers to your question. If both the subqueries
> > > > independently prune out un-necessary data, using partitions or
> > indexes, I
> > > > don't think adding a union between them would alter that behavior.
> > > >
> > > > -Rahul
> > > >
> > > > On Mon, Mar 19, 2018 at 1:44 PM, Ted Dunning wrote:
> > > >
> > > > > IF I create a view that is a union of partitioned parquet files
> and a
> > > > > database that has secondary indexes, will Drill be able to properly
> > > push
> > > > > down query limits into both parts of the union?
> > > > >
> > > > > In particular, if I have lots of archival data and parquet
> > partitioned
> > > by
> > > > > time but my query only asks for recent data that is in the
> database,
> > > will
> > > > > the query avoid the parquet files entirely (as you would wish)?
> > > > >
> > > > > Conversely, if the data I am asking for is entirely in the archive,
> > > will
> > > > > the query make use of the partitioning on my parquet files
> correctly?
> > > > >
> > > >
> > >
> >
>